Remote Debugging Spark Jobs

11. January 2018

As we move past kicking the tires of Spark, we’re finding quite a few areas where the documentation doesn’t quite cover the scenarios you’ll run into day-after-day in a production environment. Today’s exemplar? Debugging jobs running in our staging cluster.

While taking a functional approach to data transformations allows us to write easily testable and composable code, we don’t claim to be perfect, and being able to set breakpoints and inspect values at runtime are invaluable tools in a programmer’s arsenal when println just won’t do.

The what of debugging JVM application in production is simple, and among the most powerful (and impressive!) capabilities of the Java platform. The Java Debug Wire Protocol (JDWP) enables attaching to a remote server as if it were on your machine. Just launch the JVM for your Spark driver with debug flags and attach. Easy, right?

Not quite.

Our production and staging environments run atop HDInsight, which – like many Spark distributions – uses YARN to manage cluster resources, and our Spark applications run in cluster mode to take advantage of driver failover and log aggregation. This means the node we launch our job from may not be the one responsible for hosting our application’s driver.

Additionally, rather than starting jobs from the command line with spark-submit, we rely on Apache Livy to expose a REST API for job submission, checking the status of running jobs, and fetching logs from a job’s YARN container. Further complicating matters, our worker nodes are isolated on a private vLAN and accessible only from a head node.

To debug in this scenario, we need to:

  1. Submit our job through the Livy API with an appropriate conf to configure the Spark driver for debugging
  2. Find out which node in the cluster our driver is running on
  3. Forward traffic from our machine to the driver node in the remote cluster, enabling us to connect to worker nodes on the vLAN

The first stop on our incredible journey is the Livy Batch API. We’ll use cURL to POST to the endpoint and submit a job:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
PORT=$(($RANDOM + ($RANDOM % 2) % 32768 + 10000))
SPARK_DEBUG_OPTS=$(cat <<EOF
{
  "spark.driver.extraJavaOptions": 
  "-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=$PORT"
}
EOF)

JSON=$(cat <<EOF
{
  "file": "$FILENAME",
  "className": "$CLASSNAME",
  "args": [ $ARGS ],
  "conf": $SPARK_DEBUG_OPTS
}
EOF)

curl -k \
  --user "$SPARK_CLUSTER_USERNAME:$SPARK_CLUSTER_PASSWORD" \
  -v \
  -f \
  -s \
  -H 'Content-Type: application/json' \
  -d "$JSON" \
  "https://$SPARK_CLUSTER_URL/livy/batches"

The spark.driver.extraJavaOptions key, documented here, specifies a list of options to be passed to the driver JVM. Oracle’s page on the Java Debugger Platform Architecture helpfully breaks these settings down. The short version: we’re telling the driver to load the debugger, make it accessible via a socket, and to suspend execution until a debugger application attaches.

If your request succeeds, you should receive a response similar to this:

1
2
3
4
5
6
7
8
9
10
{
  "id": 719,
  "state": "starting",
  "appId": null,
  "appInfo": {
    "driverLogUrl": null,
    "sparkUiUrl": null
  },
  "log": []
}

Since it takes a few seconds for the job to be loaded by YARN, we have to wait a to get the info we’re looking for: the driver’s hostname, which can be found inside the driverLogUrl. To do so, take the ID from our POST request and ask for an update from Livy:

1
2
3
4
ID=719
curl -k \
--user "$SPARK_CLUSTER_USERNAME:$SPARK_CLUSTER_PASSWORD" \
"https://$SPARK_CLUSTER_URL/livy/batches/$ID >> appInfo.json

The null values from our first response should now be populated:

1
2
3
4
5
6
7
8
9
10
11
12
{
  "id": 719,
  "state": "starting",
  "appId": "application_1513157636215_0120",
  "appInfo": {
    "driverLogUrl": 
    "https://$SPARK_CLUSTER_URL/yarnui/10.0.0.11/node/containerlogs/$yarnID/livy",
    "sparkUiUrl": 
    "https://$SPARK_CLUSTER_URL/yarnui/hn/proxy/$yarnID/"
  },
  "log": [  ]
}

The hostname of our driver node is the first segment after /yarnui/ in the appInfo.driverLogUrl key. Let’s parse it out with jq, grep, and cut and save it to an ENV var:

1
2
DRIVER_HOST=$(cat appInfo.json |  jq '.appInfo.driverLogUrl' |
  grep -oE 'yarnui/([^\/]+)' | cut -f2 -d"/")

All that’s left is forwarding traffic from a port on our local machine to the debugger running on the driver’s host. To do that, we’ll use ssh port forwarding, accessing the internal network connecting all of our workers by tunneling traffic through our head node:

1
2
ssh -Nn -L 5005:$DRIVER_HOST:$PORT \
           $SPARK_CLUSTER_SSH_ACCOUNT@SPARK_CLUSTER_HEAD_NODE

This will forward all traffic on local port 5005 to the DRIVER_HOST on the PORT we told the JVM to listen at earlier. Now to connect our debugger and check that everything’s working…

IntelliJ Debugger successfully attached!

Et voilĂ ! Remote debugging on a Spark cluster managed by YARN and Livy. Simple, right?