You are browsing a read-only backup copy of Wikitech. The live site can be found at


From Wikitech-static
< Analytics‎ | Systems‎ | Cluster
Revision as of 13:44, 7 April 2017 by imported>Milimetric (Milimetric moved page Analytics/Cluster/Spark to Analytics/Systems/Cluster/Spark: Reorganizing documentation)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

How do I ...

Start spark shell

spark-shell --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G

See spark logs on my local machine when using spark submit

  • If you are running Spark on local, spark-submit should write logs to your console by default.
  • How to get logs written to a file?
    • Spark uses log4j for logging, and the log4j config is usually at /etc/spark/
    • This uses a ConsoleAppender by default, and if you wanted to write to files, an example log4j properties file would be:
# Set everything to be logged to the file
log4j.rootCategory=INFO, file
log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

This should write logs to /tmp/spark.log

  • On the analytics cluster (stat1002):
    • On the analytics cluster, running a spark job through spark submit writes logs to the console too, on both yarn and local modes
    • To write to file, create a file, similar to the one above that uses the FileAppender
    • One option is to use the --files argument on spark-submit and upload your custom file.
    • The other is to use extraJavaOptions config
    • For the above two, see (Under Debugging your Application for explanation)
  • While running a spark job through Oozie
    • The log4j file path now needs to be a location accessible by all drivers/executors running in different machines
    • Putting the file on a temp directory on Hadoop and using a hdfs:// url should do the trick
    • Note that the logs will be written on the machine where the driver/executors are running - so you'd need access to go look at them

Monitor Spark shell job Resources

If you run some more complicated spark in the shell and you want to see how Yarn is managing resources, you'll need to tunnelː

ssh -N stat1002.eqiad.wmnet -L 8088:analytics1001.eqiad.wmnet:8088

Now you can browse to httpː//localhostː8088/ and if you click on links that are broken, just replace the hostname with localhost (it switches to analytics1001.eqiad.wmnet because the URLs are absolute)

  • Poke people on #wikimedia-analytics for help!

Spark and Ipython

The spark python API makes working with data in HDFS super easy. For exploratory tasks, I like using Ipython Notebooks. You can run Spark from an Ipython Notebook by doing the following:

On Stat1002:

Tell pyspark to start the ipython notebook server when called

export IPYTHON_OPTS="notebook --pylab inline --port 8123  --ip='*' --no-browser"

Start pyspark

pyspark --master yarn --deploy-mode client --executor-memory 2g --conf spark.dynamicAllocation.maxExecutors=32

On Your Laptop:

Create a tunnel from your machine to the Ipython Notebook server

ssh -N -L 8123:stat1002.eqiad.wmnet:8123


ssh -N stat1002.eqiad.wmnet -L 8123:stat1002.eqiad.wmnet:8123

Finally, navigate to http://localhost:8123, create a notebook and start coding!

When you're done, don't forget to kill your pyspark process :)

Spark and Oozie

Oozie has a spark action, allowing you to launch Spark jobs as you'd do (almost ...) with spark-submit:

<spark xmlns="uri:oozie:spark-action:0.1">
             <spark-opts>--conf spark.yarn.jar=${spark_assembly_jar} --executor-memory ${spark_executor_memory} --driver-memory ${spark_driver_memory} --num-executors ${spark_number_executors} --queue ${queue_name} --conf spark.yarn.appMasterEnv.SPARK_HOME=/bogus --driver-class-path ${hive_lib_path} --driver-java-options "-Dspark.executor.extraClassPath=${hive_lib_path}" --files ${hive_site_xml}</spark-opts>

The tricky parts here are in the spark-opts element, with the need for spark to be given specific configuration settings not automatically loaded as they are with spark-submit:

  • Core spark jar is needed in configuration:
--conf spark.yarn.jar=${spark_assembly_jar}
# on analytics-hadoop:
#    spark_assembly_jar = hdfs://analytics-hadoop/user/spark/share/lib/spark-assembly.jar
  • When using python, you need to set the SPARK_HOME environment variable (to dummy for instance):
--conf spark.yarn.appMasterEnv.SPARK_HOME=/bogus
  • If you want to use HiveContext in spark, you need to add the hive lib jars and hive-site.xml to spark (not done by default in our version):
--driver-class-path ${hive_lib_path} --driver-java-options "-Dspark.executor.extraClassPath=${hive_lib_path}" --files ${hive_site_xml}
# on analytics-hadoop: 
#   hive_lib_path = /usr/lib/hive/lib/*
#   hive_site_xml = hdfs://analytics-hadoop//util/hive/hive-site.xml