Setting Environment Variables on an HDInsight Spark Cluster with Script Actions
20. March 2018
It’s a common pattern in web development to use environment variables for app configuration in different environments. My team wanted to use the same pattern for our Spark jobs, but documentation on how to set environment variables for an HDInsight cluster was hard to come by, but we eventually found a solution for HDInsight 3.6.
tl;dr: You need to modify spark.yarn.executorEnv
and
spark.yarn.appMasterEnv
in $SPARK_HOME/conf/spark-defaults.conf
. For
example, if you want the variable SERVICE_PRINCIPAL_ENDPOINT
available from a
sys.env
call inside your Spark app, you’d add the following lines to
spark-defaults.conf
:
1
2
spark.yarn.appMasterEnv.SERVICE_PRINCIPAL_ENDPOINT <val>
spark.executorEnv.SERVICE_PRINCIPAL_ENDPOINT <val>
These settings will now be used whenever a job is submitted to the cluster,
whether that be via spark-submit
or Livy. Though only the head nodes of the cluster
need to have their defaults.conf
updated, it’s recommended you automate
deployment of your customizations through Script
Actions.
I wrote a small bash
script that generates another bash
script, uploads it
to Azure Blob Storage, and submits it to the HDInsight Script Action REST
endpoint for execution across all the nodes on the cluster. Because a Script
Action may be executed on a node multiple times, it’s important to make sure
they’re idempotent, lest you end up with the same key set dozens of time in the
same file.
I get around this by enclosing our Script Action-deployed section
between two distinct comments, and using grep
and sed
to delete any
preĆ«xisting block before it’s added back to the file.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
env_block() {
declare -A ClusterEnv=(
["ADLS_ROOT_URL"]="adl://nib-dl.azuredatalake.net"
)
local concat=""
local NL=$'\n'
for key in "${!ClusterEnv[@]}"; do
concat+="spark.yarn.appMasterEnv.${key} ${ClusterEnv[$key]}$NL"
concat+="spark.executorEnv.${key} ${ClusterEnv[$key]}$NL"
done
local header="### START SCRIPTACTION ENV BLOCK ###"
local footer="### END SCRIPTACTION ENV BLOCK ###"
echo "$header$NL$concat$footer"
}
generate_script_action() {
#Filename to output script to
CLUSTER_FILE=\$SPARK_HOME/conf/spark-defaults.conf
local output=$1
cat > $output <<EOF
env_vars=\$(cat << END
$(env_block)
END
)
#Load /etc/environment so we have \$SPARK_HOME set
source /etc/environment
#Delete existing Script Action block
lines=\$(cat $CLUSTER_FILE | grep -nE "###.+SCRIPTACTION" \
| cut -f1 -d":" | tr '\n' ',' | sed -e 's/,$//')
#If the section isn't found, skip deleting from the target file
if ! [ -z "\$lines" ]; then
sudo sed -i'' -e "\${lines}d" $CLUSTER_FILE
fi
echo "\$env_vars" | sudo tee -a $CLUSTER_FILE
EOF
}
generate_script_action
is a bash
function that takes a single argument, a
filename, and writes out a Script Action that sets the appropriate keys in
spark-defaults.conf
as described above. In our production version, we use
Azure KeyVault to fetch fresh secrets and re-execute the script action on each
build and deploy to prod.