A Brief Introduction to Literate Analytics With org-babel
30. April 2018
As an engineer on an analytics and data infrastructure team, I spend quite a bit of time exploring and performing ad-hoc analyses of data scattered across SQL databases; Hive and Spark clusters; and blob storage. These investigations can be a chore – a single instance might involve a half dozen ETL steps, ad-hoc querying to assess whether data needs to be cleaned, and moving code between a REPL and a program destined for production.
Once an initial conclusion has been reached, I need to go back and document my process for posterity, remembering exactly what worked along the way. This is a recipe for mistakes and omissions, so why not document as you go?
What I need to do is:
- Log each command
- Document what it does and why it’s necessary
- Execute and examine the results as I go
org-mode and org-babel provide an amazing environment for this sort
of polyglot, multi-modal series of tasks. Prose and code can be freely
intermingled in a literate
programming style,
any references used during the process can be cited, and the final
product can be exported into human- or machine-readable formats. If
you’re keen to see the final product of an analysis, here’s an
example
I recently created for Tea With
Strangers, reporting on growth in
recent months that demonstrates these capabilities.
Getting Started
As I typically begin my explorations with a SQL-like datastore, I’ll
begin my examples there. sql-mode and ob-sql ship with emacs, so
all you have to do is allow for execution of SQL src blocks in the
org-babel environment; to do this, add the following to your
init.el:
1
2
3
4
5
6
7
(add-to-list 'org-babel-load-languages '(sql . t))
;;If you also want to disable confirmation for SQL blocks:
(setq org-confirm-babel-evaluate
(lambda (lang body)
(not (string= lang "sql"))))
;;or disable it for all blocks
(setq org-confirm-babel-evaluate nil)
n.b. If you’d like to learn more about how
babel‘s evaluation works, itsinfopage is very complete; call(info "(org) Evaluating code blocks")to view .
Now that org-babel is configured, it’s time create a new org buffer
and add some properties that will configure our SQL environment. src
blocks accept a variety of arguments that control the behaviour of their
output and export.
#+PROPERTY: header-args:sql :session literate-analytics :engine postgresql- This line adds two default arguments to any
sqlsrcblocks::sessionand:engine. The former enables session-based evaluation for our SQL blocks, the latter tellssql-modethatpostgresqlis the backend we’re working against; this is equivalent to usingsql-set-productin asql-modebuffer.
- This line adds two default arguments to any
#+PROPERTY: header-args:sql+ :exports results :database tws_dev- The
+aftersqlenables appending to existing arguments instead of overriding them.:exports resultsincludes the output ofsrcblocks will be included in the output of anorg-exportbackend by default. Finally,:databaseis the connection string that will be passed to our SQL client.
- The
Now add a src block in sql-mode by typing <s TAB sql. Contents of
this block will be syntax highlighted as if they were in their own
buffer, and you can edit its contents inline or in an indirect buffer
using the appropriate major mode by typing C-c ' while your point is
over the block.
Your buffer should now look like this:
1
2
3
4
5
6
#+PROPERTY: header-args:sql :session literate-analytics :engine postgresql
#+PROPERTY: header-args:sql+ :exports results :database tws_dev
#+BEGIN_SRC sql
SELECT COUNT(*) FROM users;
#+END_SRC
Now, place your point inside the src block and call
org-babel-execute-src-block with C-c C-c. If everything’s configured
correctly, a RESULT block should be inserted below:
1
2
3
4
#+RESULTS:
| count |
|-------|
| 5 |
Results are automatically coerced into an org table, which can be
quite helpful if you want to perform some tabulation or filtering with
elisp, or simply want to export the results later via org-export.
Using Remote Resources via TRAMP
Since it’s rare you’ll be interacting exclusively with local resources,
it’s quite useful to be able to run commands on other machines. For
example, we frequently ssh into our Hadoop cluster head nodes to
rename, move, or find files in HDFS. Fortunately, babel is tightly
integrated with TRAMP.
The :dir property on src blocks accepts a TRAMP reference; execution
of the block will then occur on the remote machine, e.g.:
1
2
3
4
5
6
7
8
9
10
#+BEGIN_SRC sh :dir /ssh:spark: :results output list
hdfs dfs -ls $hdfsPath/ServiceInterface
#+END_SRC
#+RESULTS:
: - Found 5 items
: - drwxrwxrwx+ - ... 0 2018-04-25 17:06 $hdfsPath/ServiceInterface/Delinked
: - drwxrwxrwx+ - ... 0 2018-04-25 16:43 $hdfsPath/ServiceInterface/PartialValidated
: - drwxrwxrwx+ - ... 0 2018-04-23 19:58 $hdfsPath/ServiceInterface/PreValidated
: - drwxrwxrwx+ - ... 0 2018-04-25 21:24 $hdfsPath/ServiceInterface/Unvalidated
If you’re going to be executing multiple commands on the same remote
machine, you can use the :session header arg to keep the connection
open. See the info page for
:session for details.
Further Reading
I’ve only covered a fraction of what org-babel is capable of, but
hopefully it’ll give you some ideas. For further reading, I recommend
watching Howard Abram’s Literate DevOps with emacs
talk and perusing the
info manual for babel.