You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Data Engineering/Systems/Jupyter
The analytics clients include a hosted version of JupyterHub, allowing easy analysis of internal data using Jupyter notebooks.
Access
To access Jupyter, you need:
- production data access in the analytics-privatedata-users POSIX group
- Your SSH configured correctly.
- Kerberos credentials
- You'll also need to be in the
wmf
ornda
LDAP groups.
Once you have this access, open a SSH tunnel to one of the analytics clients, e.g.
ssh -N stat1005.eqiad.wmnet -L 8880:127.0.0.1:8880
replacing stat1005
with the name of any analytics client hostname if you prefer.
Then, open localhost:8880 in your browser and log in with your shell username and LDAP password (the one you use for Wikitech). You'll be prompted to select or create a Conda environment. See the section on Conda environments below.
Note that this will give you access to your Jupyter notebook server on the chosen analytics client host only. Notebooks and files are saved to your home directory on that host. If you need shared access to files, consider putting those files in HDFS.
Authenticate to Hadoop via Kerberos
Once you've logged in, if you want to access data from Hadoop, you will need to authenticate with Kerberos.
This can be done either via a terminal SSH session or in a Jupyter terminal.
In a terminal session, just type
kinit
You'll be prompted for your Kerberos password.
Querying data
The Data Engineering and Product Analytics teams maintain software packages to make accessing data from the analytics clients as easy as possible by hard-coding all the setup and configuration.
In Python
For Python, there is Wmfdata-Python. It can access data through MariaDB, Hive, Presto, and Spark and has a number of other useful functions, like creating Spark sessions. For details, see the repository and particularly the quickstart notebook.
In R
For R, there is wmfdata-r. It can access data from MariaDB and Hive and has many other useful functions, particularly for graphing and statistics.
Scala-Spark or Spark-SQL using Toree (not recommended)
To use either Scala-Spark or Spark-SQL notebooks you need to have Apache Toree available in your Conda environment. Note that Toree is a relatively inactive project, so this is not recommended.
An easy way to do so is to install it via the notebook terminal interface: In your notebook interface, click New -> terminal
, and in the terminal type pip install toree
. And that's it.
Now you can create a Jupyter kernel using Toree as a gateway between the notebook and a Spark session running on the cluster (note: the Spark session is managed by Toree, no need to create it manually).
To create both a Scala-Spark and Spark-SQL kernels, in your notebook terminal again:
NOTE: Please change the kernel name and the spark options as you see fit - you can find the default wmfdata spark parameters on this github page.
jupyter toree install \
--user \
--spark_home="/usr/lib/spark2/" \
--interpreters=Scala,SQL \
--kernel_name="Scala Spark" \
--spark_opts="--master yarn --driver-memory 2G --executor-memory 8G --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=64 --conf spark.sql.shuffle.partitions=256"
Conda environments
- Main article: Analytics/Systems/Conda
After logging into JupyterHub, when you start Jupyter Notebook Server it is launched out of your conda environment. This means that the packages in anaconda-wmf or conda-analytics are available to import in your python notebooks. If you need different or newer versions of packages, you can conda (preferred) or pip install them into your active Conda environment, and they will be imported from there.
You can create as many Conda environments as you might need, but you can only run one Jupyter Notebook Server at a time. This means that you can only use one Conda environment in a Jupyter Notebook Server at a time. To use a Jupyter Notebook Server with a different Conda environment, you can stop your Jupyter Notebook Server from the JupyterHub Control Panel, and start a new server and select a different Conda environment for it to use.
These Conda environments may also be used outside of Jupyter on the CLI.
Migrating from anaconda-wmf
to conda-analytics
In tandem with the migration from Spark2 to Spark3, we will be moving from the current anaconda-wmf
Conda base environment to a new conda-analytics
base environment. When we release the conda-analytics
setup (date not yet determined), any new environment using the Jupyter start-up menu will be based on the conda-analytics
environment. Existing anaconda-wmf
environments will continue to work until Spark 2 support is removed on 31 March 2023.
For more information on the differences, visit Analytics/Systems/Conda.
Choosing between Spark 3 and Spark 2
Spark 2 has been deprecated and all Spark jobs must be migrated to Spark 3. See Analytics/Systems/Cluster/Spark/Migration to Spark 3 for more info. In the interim, you can run Spark 2 jobs with the overrides discussed below.
New Spark 3 conda-analytics
environment
As part of the Spark 3 rollout, we have also updated the base Conda environment. This new environment supports Spark 3 and is based off of Miniconda, which is a minimal version of Anaconda. We are calling this environment conda-analytics
, and you can see what packages are included by default here.
Using Spark 2 in the interim
To use Spark 2, you can still use previously created environments based on anaconda-wmf
. You can also create new anaconda-wmf
environments by hand.
On a stat machine terminal run:
# clean package cache as some of them, like python 3.10, may clash with anaconda-wmf
rm -rf ~/.conda/pkgs/*
# create new stacked environment
/lib/anaconda-wmf/bin/conda-create-stacked name_of_your_env_here
Then on JupyterHub:
File -> Hub Control Pane -> Stop My Server -> Start My Server
At the last JupyterHub prompt you should be able to choose your newly created environment.
Troubleshooting
pip fails to install a newer version of a package
If using pip to install a package into your conda environment that already exists in the base anaconda-wmf environment, you might get an error like:
Attempting uninstall: wmfdata
Found existing installation: wmfdata 1.0.4
Uninstalling wmfdata-1.0.4:
ERROR: Could not install packages due to an EnvironmentError: [Errno 30] Read-only file system: 'WHEEL'
To work around this, tell pip to --ignore-installed
when running pip install
, like:
pip install --ignore-installed --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release
See also: Analytics/Systems/Anaconda#Installing_packages_into_your_user_conda_environment
Trouble installing R packages
See Analytics/Systems/Anaconda#R_support
Browser disconnects
If your browser session disconnects from the kernel on the server (if, for example, your SSH connection times out), any work the kernel is doing will continue, and you'll be able to access the results the next time you connect to the kernel, but no further display output for that work (like print()
commands to log progress) will accumulate, even if you reopen the notebook (JupyterLab issue 4237).
My Python notebook will not start
Your IPython configuration may be broken. Try deleting your ~/.ipython
directory (you'll lose any configurations you've made or extensions you've installed, but it won't affect your notebooks, files, or Python packages).
My kernel restarts when I run a large query
It may be that your Jupyter Notebook Server ran out of memory and the operating system's out of memory killer decided to kill your kernel to cope with the situation. You won't get any notification that this has happened other than the notebook restarting, but you can assess the state of the memory on the notebook server by checking its host overview dashboard in Grafana (host-overview dashboard) or using the command line to see which processes are using the most (with ps aux --sort -rss | head
or similar).
Viewing Jupyter Notebook Server logs
JupyterHub logs are viewable by normal users in Kibana.
A dashboard has been created named JupyterHub and this is also linked from the Home Dashboard.
At present the logs are not split per user, but we are working to make this possible.
They are no longer written by default to /var/log/syslog
but they are retained on the host in the systemd journal.
You might need to see JupyterHub logs to troubleshoot login issues or resource issues affecting the cluster.
An individual user's notebook server log be examined with the following command
sudo journalctl -f -u jupyter-$USERNAME-singleuser.service
Viewing JupyterHub logs
TODO: Make this work for regular users!
You might need to see JupyterHub logs to troubleshoot login issues:
sudo journalctl -f -u jupyterhub
Tips
Analytics/Systems/Jupyter/Tips