Remote Debugging Spark Jobs
11. January 2018
As we move past kicking the tires of Spark, we’re finding quite a few areas where the documentation doesn’t quite cover the scenarios you’ll run into day-after-day in a production environment. Today’s exemplar? Debugging jobs running in our staging cluster.
While taking a functional approach to data transformations allows us to write
easily testable and composable code, we don’t claim to be perfect, and being
able to set breakpoints and inspect values at runtime are invaluable tools in a
programmer’s arsenal when println
just won’t do.
The what
of debugging JVM application in production is simple, and among the
most powerful (and impressive!) capabilities of the Java platform. The Java
Debug Wire Protocol (JDWP) enables attaching to a remote server as if it were on
your machine. Just launch the JVM for your Spark driver with debug flags and
attach. Easy, right?
Not quite.
Our production and staging environments run atop
HDInsight, which –
like many Spark distributions – uses YARN to manage cluster resources, and our
Spark applications run in cluster
mode to take advantage of driver
failover and log aggregation. This means the node we launch our job from may not
be the one responsible for hosting our application’s driver.
Additionally, rather than starting jobs from the command line with
spark-submit
, we rely on Apache Livy to
expose a REST API for job submission, checking the status of running jobs, and
fetching logs from a job’s YARN container. Further complicating matters, our
worker nodes are isolated on a private vLAN and accessible only from a head
node.
To debug in this scenario, we need to:
- Submit our job through the Livy API with an appropriate
conf
to configure the Spark driver for debugging - Find out which node in the cluster our driver is running on
- Forward traffic from our machine to the driver node in the remote cluster, enabling us to connect to worker nodes on the vLAN
The first stop on our incredible journey is the
Livy Batch API.
We’ll use cURL
to POST to the endpoint and submit a job:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
PORT=$(($RANDOM + ($RANDOM % 2) % 32768 + 10000))
SPARK_DEBUG_OPTS=$(cat <<EOF
{
"spark.driver.extraJavaOptions":
"-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=$PORT"
}
EOF)
JSON=$(cat <<EOF
{
"file": "$FILENAME",
"className": "$CLASSNAME",
"args": [ $ARGS ],
"conf": $SPARK_DEBUG_OPTS
}
EOF)
curl -k \
--user "$SPARK_CLUSTER_USERNAME:$SPARK_CLUSTER_PASSWORD" \
-v \
-f \
-s \
-H 'Content-Type: application/json' \
-d "$JSON" \
"https://$SPARK_CLUSTER_URL/livy/batches"
The spark.driver.extraJavaOptions
key,
documented here,
specifies a list of options to be passed to the driver JVM. Oracle’s
page on the Java Debugger Platform Architecture
helpfully breaks these settings down. The short version: we’re telling the driver to load
the debugger, make it accessible via a socket, and to suspend execution until a
debugger application attaches.
If your request succeeds, you should receive a response similar to this:
1
2
3
4
5
6
7
8
9
10
{
"id": 719,
"state": "starting",
"appId": null,
"appInfo": {
"driverLogUrl": null,
"sparkUiUrl": null
},
"log": []
}
Since it takes a few seconds for the job to be loaded by YARN, we have to wait a
to get the info we’re looking for: the driver’s
hostname, which can be found inside the driverLogUrl
. To do so, take the ID
from our POST request and ask for an update from Livy:
1
2
3
4
ID=719
curl -k \
--user "$SPARK_CLUSTER_USERNAME:$SPARK_CLUSTER_PASSWORD" \
"https://$SPARK_CLUSTER_URL/livy/batches/$ID >> appInfo.json
The null values from our first response should now be populated:
1
2
3
4
5
6
7
8
9
10
11
12
{
"id": 719,
"state": "starting",
"appId": "application_1513157636215_0120",
"appInfo": {
"driverLogUrl":
"https://$SPARK_CLUSTER_URL/yarnui/10.0.0.11/node/containerlogs/$yarnID/livy",
"sparkUiUrl":
"https://$SPARK_CLUSTER_URL/yarnui/hn/proxy/$yarnID/"
},
"log": [ ]
}
The hostname of our driver node is the first segment after /yarnui/
in the
appInfo.driverLogUrl
key. Let’s parse it out with jq
, grep
, and cut
and
save it to an ENV var:
1
2
DRIVER_HOST=$(cat appInfo.json | jq '.appInfo.driverLogUrl' |
grep -oE 'yarnui/([^\/]+)' | cut -f2 -d"/")
All that’s left is forwarding traffic from a port on our local machine to the
debugger running on the driver’s host. To do that, we’ll use ssh
port
forwarding, accessing the internal network connecting all of our workers by
tunneling traffic through our head node:
1
2
ssh -Nn -L 5005:$DRIVER_HOST:$PORT \
$SPARK_CLUSTER_SSH_ACCOUNT@SPARK_CLUSTER_HEAD_NODE
This will forward all traffic on local port 5005 to the DRIVER_HOST
on the
PORT
we told the JVM to listen at earlier. Now to connect our debugger and
check that everything’s working…
Et voilĂ ! Remote debugging on a Spark cluster managed by YARN and Livy. Simple, right?