A Brief Introduction to Literate Analytics With
30. April 2018
As an engineer on an analytics and data infrastructure team, I spend quite a bit of time exploring and performing ad-hoc analyses of data scattered across SQL databases; Hive and Spark clusters; and blob storage. These investigations can be a chore – a single instance might involve a half dozen ETL steps, ad-hoc querying to assess whether data needs to be cleaned, and moving code between a REPL and a program destined for production.
Once an initial conclusion has been reached, I need to go back and document my process for posterity, remembering exactly what worked along the way. This is a recipe for mistakes and omissions, so why not document as you go?
What I need to do is:
- Log each command
- Document what it does and why it’s necessary
- Execute and examine the results as I go
org-babel provide an amazing environment for this sort
of polyglot, multi-modal series of tasks. Prose and code can be freely
intermingled in a literate
any references used during the process can be cited, and the final
product can be exported into human- or machine-readable formats. If
you’re keen to see the final product of an analysis, here’s an
I recently created for Tea With
Strangers, reporting on growth in
recent months that demonstrates these capabilities.
As I typically begin my explorations with a SQL-like datastore, I’ll
begin my examples there.
ob-sql ship with
all you have to do is allow for execution of SQL
src blocks in the
org-babel environment; to do this, add the following to your
1 2 3 4 5 6 7 (add-to-list 'org-babel-load-languages '(sql . t)) ;;If you also want to disable confirmation for SQL blocks: (setq org-confirm-babel-evaluate (lambda (lang body) (not (string= lang "sql")))) ;;or disable it for all blocks (setq org-confirm-babel-evaluate nil)
n.b. If you’d like to learn more about how
babel‘s evaluation works, its
infopage is very complete; call
(info "(org) Evaluating code blocks")to view .
org-babel is configured, it’s time create a new
and add some properties that will configure our SQL environment.
blocks accept a variety of arguments that control the behaviour of their
output and export.
#+PROPERTY: header-args:sql :session literate-analytics :engine postgresql
- This line adds two default arguments to any
:engine. The former enables session-based evaluation for our SQL blocks, the latter tells
postgresqlis the backend we’re working against; this is equivalent to using
- This line adds two default arguments to any
#+PROPERTY: header-args:sql+ :exports results :database tws_dev
sqlenables appending to existing arguments instead of overriding them.
:exports resultsincludes the output of
srcblocks will be included in the output of an
org-exportbackend by default. Finally,
:databaseis the connection string that will be passed to our SQL client.
Now add a
src block in
sql-mode by typing
<s TAB sql. Contents of
this block will be syntax highlighted as if they were in their own
buffer, and you can edit its contents inline or in an indirect buffer
using the appropriate major mode by typing
C-c ' while your point is
over the block.
Your buffer should now look like this:
1 2 3 4 5 6 #+PROPERTY: header-args:sql :session literate-analytics :engine postgresql #+PROPERTY: header-args:sql+ :exports results :database tws_dev #+BEGIN_SRC sql SELECT COUNT(*) FROM users; #+END_SRC
Now, place your point inside the
src block and call
C-c C-c. If everything’s configured
RESULT block should be inserted below:
1 2 3 4 #+RESULTS: | count | |-------| | 5 |
Results are automatically coerced into an
org table, which can be
quite helpful if you want to perform some tabulation or filtering with
elisp, or simply want to export the results later via
Using Remote Resources via TRAMP
Since it’s rare you’ll be interacting exclusively with local resources,
it’s quite useful to be able to run commands on other machines. For
example, we frequently
ssh into our Hadoop cluster head nodes to
rename, move, or find files in HDFS. Fortunately,
babel is tightly
integrated with TRAMP.
:dir property on
src blocks accepts a TRAMP reference; execution
of the block will then occur on the remote machine, e.g.:
1 2 3 4 5 6 7 8 9 10 #+BEGIN_SRC sh :dir /ssh:spark: :results output list hdfs dfs -ls $hdfsPath/ServiceInterface #+END_SRC #+RESULTS: : - Found 5 items : - drwxrwxrwx+ - ... 0 2018-04-25 17:06 $hdfsPath/ServiceInterface/Delinked : - drwxrwxrwx+ - ... 0 2018-04-25 16:43 $hdfsPath/ServiceInterface/PartialValidated : - drwxrwxrwx+ - ... 0 2018-04-23 19:58 $hdfsPath/ServiceInterface/PreValidated : - drwxrwxrwx+ - ... 0 2018-04-25 21:24 $hdfsPath/ServiceInterface/Unvalidated
If you’re going to be executing multiple commands on the same remote
machine, you can use the
:session header arg to keep the connection
open. See the info page for
:session for details.
I’ve only covered a fraction of what
org-babel is capable of, but
hopefully it’ll give you some ideas. For further reading, I recommend
watching Howard Abram’s
Literate DevOps with emacs
talk and perusing the
info manual for