Setting Environment Variables on an HDInsight Spark Cluster with Script Actions

20. March 2018

It’s a common pattern in web development to use environment variables for app configuration in different environments. My team wanted to use the same pattern for our Spark jobs, but documentation on how to set environment variables for an HDInsight cluster was hard to come by, but we eventually found a solution for HDInsight 3.6.

tl;dr: You need to modify spark.yarn.executorEnv and spark.yarn.appMasterEnv in $SPARK_HOME/conf/spark-defaults.conf. For example, if you want the variable SERVICE_PRINCIPAL_ENDPOINT available from a sys.env call inside your Spark app, you’d add the following lines to spark-defaults.conf:

1
2 spark.yarn.appMasterEnv.SERVICE_PRINCIPAL_ENDPOINT <val>
spark.executorEnv.SERVICE_PRINCIPAL_ENDPOINT <val>

These settings will now be used whenever a job is submitted to the cluster, whether that be via spark-submit or Livy. Though only the head nodes of the cluster need to have their defaults.conf updated, it’s recommended you automate deployment of your customizations through Script Actions.

I wrote a small bash script that generates another bash script, uploads it to Azure Blob Storage, and submits it to the HDInsight Script Action REST endpoint for execution across all the nodes on the cluster. Because a Script Action may be executed on a node multiple times, it’s important to make sure they’re idempotent, lest you end up with the same key set dozens of time in the same file.

I get around this by enclosing our Script Action-deployed section between two distinct comments, and using grep and sed to delete any preëxisting block before it’s added back to the file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42 env_block() {
    declare -A ClusterEnv=(
        ["ADLS_ROOT_URL"]="adl://nib-dl.azuredatalake.net"
    )

    local concat=""
    local NL=$'\n'
    for key in "${!ClusterEnv[@]}"; do
        concat+="spark.yarn.appMasterEnv.${key} ${ClusterEnv[$key]}$NL"
        concat+="spark.executorEnv.${key} ${ClusterEnv[$key]}$NL"
    done

    local header="### START SCRIPTACTION ENV BLOCK ###"
    local footer="### END SCRIPTACTION ENV BLOCK ###"
    echo "$header$NL$concat$footer"
}

generate_script_action() {
    #Filename to output script to
    CLUSTER_FILE=\$SPARK_HOME/conf/spark-defaults.conf
    local output=$1
    cat > $output <<EOF
env_vars=\$(cat << END
$(env_block)
END
)

#Load /etc/environment so we have \$SPARK_HOME set
source /etc/environment

#Delete existing Script Action block
lines=\$(cat $CLUSTER_FILE | grep -nE "###.+SCRIPTACTION" \
    | cut -f1 -d":" | tr '\n' ',' | sed -e 's/,$//')

#If the section isn't found, skip deleting from the target file
if ! [ -z "\$lines"  ]; then
    sudo sed -i'' -e "\${lines}d" $CLUSTER_FILE
fi

echo "\$env_vars" | sudo tee -a $CLUSTER_FILE
EOF
}

generate_script_action is a bash function that takes a single argument, a filename, and writes out a Script Action that sets the appropriate keys in spark-defaults.conf as described above. In our production version, we use Azure KeyVault to fetch fresh secrets and re-execute the script action on each build and deploy to prod.