You are browsing a read-only backup copy of Wikitech. The live site can be found at


From Wikitech-static
Jump to navigation Jump to search

Spark is a powerful engine for processing data on the Analytics Cluster. You can drive it using SQL, Python, R, Java, or Scala.

As of January 2020, we are running Spark 2.4.4, so the most appropriate documentation is available at

Command-line interfaces

There are a number of Spark command-line programs available on the analytics clients:

  • spark2-submit
  • spark2-shell
  • spark2R
  • spark2-sql
  • pyspark2
  • spark2-thriftserver

Note that other Spark documentation will use the standard names for these programs, without the 2 (e.g. spark-submit). We have added the 2 to prevent confusion with the programs from Spark 1.

spark2-sql allows you to interact with Hive tables directly via Spark SQL engine, but in a purely SQL REPL, rather than having to code in a programming language.

In the rest of this doc, spark2 shell commands will be used, as it is the preferred installation of Spark. Note that our spark2 configuration defaults pyspark2 to using python3 (and ipython3 for the driver).

How do I ...

Start a spark shell in yarn

Note: The settings presented here are for a medium-size job on the cluster (~15% of the whole cluster)

  • Scala
spark2-shell --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64
  • Python
pyspark2 --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64
  • R
spark2R --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64
  • SQL
spark2-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64

Set the python version pyspark should use

As of June 2020, our installation of Spark works with python 3.5 and 3.7. The default in on Debian Stretch nodes is 3.5, and in Debian Buster 3.7. Most Hadoop workers are Stretch, so if you want to launch pyspark in YARN with Python 3.7, you should always specify the pyspark python version like:

 PYSPARK_PYTHON=python3.7 pyspark2 --master yarn

See spark logs on my local machine when using spark submit

  • If you are running Spark on local, spark2-submit should write logs to your console by default.
  • How to get logs written to a file?
    • Spark uses log4j for logging, and the log4j config is usually at /etc/spark2/conf/
    • This uses a ConsoleAppender by default, and if you wanted to write to files, an example log4j properties file would be:
# Set everything to be logged to the file
log4j.rootCategory=INFO, file
log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

This should write logs to /tmp/spark.log

  • On the analytics cluster (stat1007):
    • On the analytics cluster, running a spark job through spark submit writes logs to the console too, on both yarn and local modes
    • To write to file, create a file, similar to the one above that uses the FileAppender
    • Use the --files argument on spark-submit and upload your custom file:
spark2-shell --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G --files /path/to/your/
  • While running a spark job through Oozie
    • The log4j file path now needs to be a location accessible by all drivers/executors running in different machines
    • Putting the file on a temp directory on Hadoop and using a hdfs:// url should do the trick
    • Note that the logs will be written on the machine where the driver/executors are running - so you'd need access to go look at them

Monitor Spark shell job Resources

If you run some more complicated spark in the shell and you want to see how Yarn is managing resources, have a look at

Don't hesitate to poke people on #wikimedia-analytics for help!

Use Hive UDF with Spark SQL

Here is an example in R. On stat1007, start a spark shell with the path to jar:

spark2R --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G --jars /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar

Then in the R session:

sql("CREATE TEMPORARY FUNCTION is_spider as ''")
sql("Your query")

Spark and Jupyter notebooks

Spark is now supported in our hosted Jupyter notebooks. This is the preferred way to run Spark in a notebook.

Custom virtual environment

If however you want to run a Notebook with a specific python virtual environment, the solution is to set up the environment, and from it launch and connect to notebooks configured to work with spark. This approach does not use JupyterHub and requires connecting to a temporary notebook process.

# Connect to stat1007 through ssh (the remote machine that will host you notebooks)
ssh stat1007.eqiad.wmnet

# Create your python virtual environment (using the http proxy)
http_proxy=http://webproxy.eqiad.wmnet:8080 https_proxy=http://webproxy.eqiad.wmnet:8080 virtualenv -p python3 test_spark_venv

#Activate the newly created virtual environment
source test_spark_venv/bin/activate

# Download the minimal set of needed libraries, again using the proxy (ipython and jupyter, needed to start notebooks)
http_proxy=http://webproxy.eqiad.wmnet:8080 https_proxy=http://webproxy.eqiad.wmnet:8080 pip install ipython jupyter

# Configure pyspark to launch a notebook server when it starts
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --port 8123  --ip='*' --no-browser"

# start the pyspark job that will launch the notebook server
pyspark2 --master yarn --deploy-mode client --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64

Spark job is now started on the cluster, having its master driven from a notebook server on stat1007. The terminal from which you have launched the commands shows you something like:

 To access the notebook, open this file in a browser:
    Or copy and paste one of these URLs:

We will need the last url to connect to the notebook server, but we first need to setup an ssh tunnel allowing your local computer to access the notebook server on stat1007:

# From your local machine
ssh -N stat1007.eqiad.wmnet -L 8123:stat1007.eqiad.wmnet:8123

Now you can browse from your local machine to the last url given by the notebook app, start a new notebook and use the 'spark' variable to access the spark session.

pyspark and external packages

To use external packages like graphframes:

pyspark2 --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=webproxy.eqiad.wmnet -Dhttp.proxyPort=8080 -Dhttps.proxyHost=webproxy.eqiad.wmnet -Dhttps.proxyPort=8080"

Use this to avoid

resolving dependencies :: org.apache.spark#spark-submit-parent;1.0

confs: [default]

Spark and Oozie

Oozie has a spark action, allowing you to launch Spark jobs as you'd do (almost ...) with spark-submit:

<spark xmlns="uri:oozie:spark-action:0.1">
             <spark-opts>--conf spark.yarn.jar=${spark_assembly_jar} --executor-memory ${spark_executor_memory} --driver-memory ${spark_driver_memory} --num-executors ${spark_number_executors} --queue ${queue_name} --conf spark.yarn.appMasterEnv.SPARK_HOME=/bogus --driver-class-path ${hive_lib_path} --driver-java-options "-Dspark.executor.extraClassPath=${hive_lib_path}" --files ${hive_site_xml}</spark-opts>

The tricky parts here are in the spark-opts element, with the need for spark to be given specific configuration settings not automatically loaded as they are with spark-submit:

  • Core spark jar is needed in configuration:
--conf spark.yarn.jar=${spark_assembly_jar}
# on analytics-hadoop:
#    spark_assembly_jar = hdfs://analytics-hadoop/user/spark/share/lib/spark-assembly.jar
  • When using python, you need to set the SPARK_HOME environment variable (to dummy for instance):
--conf spark.yarn.appMasterEnv.SPARK_HOME=/bogus
  • If you want to use HiveContext in spark, you need to add the hive lib jars and hive-site.xml to spark (not done by default in our version):
--driver-class-path ${hive_lib_path} --driver-java-options "-Dspark.executor.extraClassPath=${hive_lib_path}" --files ${hive_site_xml}
# on analytics-hadoop: 
#   hive_lib_path = /usr/lib/hive/lib/*
#   hive_site_xml = hdfs://analytics-hadoop//util/hive/hive-site.xml

SparkR in production (stat100* machines) examples

SparkR: Basic example

From stat100*, and with the latest {SparkR} installed:

Note: This example starts a medium-size application (~15% of the cluster resources)


# - set environmental variables
Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn-client sparkr-shell")

# - start SparkR api session
sparkR.session(master = "yarn", 
   appName = "SparkR", 
   sparkHome = "/usr/lib/spark2/", 
   sparkConfig = list(spark.driver.memory = "2g", 
                      spark.driver.cores = "4", 
                      spark.executor.memory = "8g",
                      spark.dynamicAllocation.maxExecutors = "64",
                      spark.enableHiveSupport = TRUE)

# - a somewhat trivial example w. linear regression on iris 

# - iris becomes a SparkDataFrame
df <- createDataFrame(iris)

# - GLM w. family = "gaussian"
model <- spark.glm(data = df, Sepal_Length ~ Sepal_Width + Petal_Length + Petal_Width, family = "gaussian")

# - summary

# - end SparkR session

SparkR: Large(er) file from HDFS

Also from stat100*, and with the latest {SparkR} installed:

Note: This example starts a large application (~30% of the cluster)

### --- flights dataset Multinomial Logistic Regression
### --- SparkDataFrame from HDFS
### --- NOTE: in this example, 'flights.csv' is found in /home/goransm/testData on stat1007

Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn-client sparkr-shell")

### --- Start SparkR session w. Hive Support enabled
sparkR.session(master = "yarn",
               appName = "SparkR",
               sparkHome = "/usr/lib/spark2/",
               sparkConfig = list(spark.driver.memory = "4g",
                                  spark.dynamicAllocation.maxExecutors = "128",
                                  spark.executor.cores  = "4",
                                  spark.executor.memory = "8g",
                                  spark.enableHiveSupport = TRUE

# - copy flight.csv to HDFS
system('hdfs dfs -put /home/goransm/testData/flights.csv hdfs://analytics-hadoop/user/goransm/flights.csv', 
       wait = T)

# - load flights
df <- read.df("flights.csv",
               header = "true",
               inferSchema = "true",
               na.strings = "NA")

# - structure

# - dimensionality

# - clean up df from NA values
df <- filter(df, isNotNull(df$AIRLINE) & isNotNull(df$ARRIVAL_DELAY) & isNotNull(df$AIR_TIME) & isNotNull(df$TAXI_IN) & 
                 isNotNull(df$TAXI_OUT) & isNotNull(df$DISTANCE) & isNotNull(df$ELAPSED_TIME))

# - dimensionality

# - Generalized Linear Model w. family = "multinomial"
model <- spark.logit(data = df, 
                     family = "multinomial")

# - Regression Coefficients
res <- summary(model)

# - delete flight.csv from HDFS
system('hdfs dfs -rm hdfs://analytics-hadoop/user/goransm/flights.csv', wait = T)

# - close SparkR session

Spark Resource Settings

Spark jobs are highly configurable and no setting is optimal for all jobs. However, this section provides some good guidelines and starting points.

Regular jobs

A good starting point for regular jobs is the following combination of settings. These settings allow the job to use roughly as much as 15% of cluster resources.

"spark.driver.memory": "2g",
"spark.dynamicAllocation.maxExecutors": 64,
"spark.executor.memory": "8g",
"spark.executor.cores": 4,
"spark.sql.shuffle.partitions": 256

Large jobs

A good starting point for large jobs is the following combination of settings. These settings allow the job to use roughly as much as 30% of cluster resources.

"spark.driver.memory": "4g",
"spark.dynamicAllocation.maxExecutors": 128
"spark.executor.memory": "8g",
"spark.executor.cores": 4,
"spark.sql.shuffle.partitions": 512

Extra large jobs

Many Spark default settings are not optimal for large scale jobs (roughly, those that handle a terabyte or more of data across stages or that have tens of thousands of stages). This article from the Facebook technical team gives hints at how to better tune Spark in those cases. In this section we try to explain how the tuning helps.

Scaling the driver

  • First, make sure your job uses dynamic allocation. It's enabled by default on the analytics-cluster, but can be turned off. This will ensure a better use of resources across the cluster. If your job fails because of errors at shuffle (due to the external shuffle service), the tuning below should help.
  • Allow for more consecutive attempts per stage (default is 4, 10 is suggested): spark.stage.maxConsecutiveAttempts = 10. This tweak allows to better deal with fetch-failures. They happen usually when an executor is not available anymore (dead because of OOM or cluster resource preemption for instance). In such a case, other executors fail fetching data, and lead to failed stages. Bumping the number possible consecutive attempts allows for more error-recovery space.
  • Increase the RPC server threads to prevent out of memory errors: = 64 (no information available as to why this help - It can be assumed that since spark.rpc.connect.threads = 64then it's better to have the same amount of server threads answering, but I have not found proper information).


  • Manually set spark.yarn.executor.memoryOverhead when using big executors or when using a lot of string values (interned string are store in the memory buffer). By default spark allocates 0.1 * total-executor-memory for the buffer, which can be too small.
  • Increase shuffle file buffer size: to reduce number of disk seeks and system calls made: spark.shuffle.file.buffer = 1 MB and spark.unsafe.sorter.spill.reader.buffer.size = 1 MB
  • Optimize spill files merging by facilitating merging newly computed streams to existing files (useful when the job spills a lot): spark.file.transferTo = false, spark.shuffle.file.buffer = 1 MB and spark.shuffle.unsafe.file.output.buffer = 5 MB
  • Reduce spilled data size by augmenting compression block size: = 512KB
  • If needed: Enable off-heap memory if GC pause become problematic (not needed for analytics jobs so far): spark.memory.offHeap.enable = true and spark.memory.offHeap.size = 3g (don't forget that the off-heap memory is part of the yarn container, therefore your container is of size: executor-memory + memory.offHeap,size)

External shuffle service

  • Speed up file retrieval by bumping the cache size available for the file index: spark.shuffle.service.index.cache.size = 2048