You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Data Engineering/Systems/Jupyter

From Wikitech-static
Jump to navigation Jump to search

The analytics clients include a hosted version of JupyterHub, allowing easy analysis of internal data using Jupyter notebooks.

Access

To access Jupyter, you need:

Once you have this access, open a SSH tunnel to one of the analytics clients, e.g.

 ssh -N stat1005.eqiad.wmnet -L 8880:127.0.0.1:8880

replacing stat1005 with the name of any analytics client hostname if you prefer.

Then, open localhost:8880 in your browser and log in with your shell username and LDAP password (the one you use for Wikitech). You'll be prompted to select or create a Conda environment. See the section on Conda environments below.

Note that this will give you access to your Jupyter notebook server on the chosen analytics client host only. Notebooks and files are saved to your home directory on that host. If you need shared access to files, consider putting those files in HDFS.

Authenticate to Hadoop via Kerberos

Once you've logged in, if you want to access data from Hadoop, you will need to authenticate with Kerberos.

This can be done either via a terminal SSH session or in a Jupyter terminal.

In a terminal session, just type

 kinit

You'll be prompted for your Kerberos password.

Querying data

The Data Engineering and Product Analytics teams maintain software packages to make accessing data from the analytics clients as easy as possible by hard-coding all the setup and configuration.

In Python

For Python, there is Wmfdata-Python. It can access data through MariaDB, Hive, Presto, and Spark and has a number of other useful functions, like creating Spark sessions. For details, see the repository and particularly the quickstart notebook.

In R

For R, there is wmfdata-r. It can access data from MariaDB and Hive and has many other useful functions, particularly for graphing and statistics.

Scala-Spark or Spark-SQL using Toree (not recommended)

To use either Scala-Spark or Spark-SQL notebooks you need to have Apache Toree available in your Conda environment. Note that Toree is a relatively inactive project, so this is not recommended.

An easy way to do so is to install it via the notebook terminal interface: In your notebook interface, click New -> terminal , and in the terminal type pip install toree. And that's it.

Now you can create a Jupyter kernel using Toree as a gateway between the notebook and a Spark session running on the cluster (note: the Spark session is managed by Toree, no need to create it manually).

To create both a Scala-Spark and Spark-SQL kernels, in your notebook terminal again:

NOTE: Please change the kernel name and the spark options as you see fit - you can find the default wmfdata spark parameters on this github page.

jupyter toree install \
    --user \
    --spark_home="/usr/lib/spark2/" \
    --interpreters=Scala,SQL \
    --kernel_name="Scala Spark" \
    --spark_opts="--master yarn --driver-memory 2G --executor-memory 8G --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=64 --conf spark.sql.shuffle.partitions=256"

Conda environments

Main article: Analytics/Systems/Conda

After logging into JupyterHub, when you start Jupyter Notebook Server it is launched out of your conda environment. This means that the packages in anaconda-wmf or conda-analytics are available to import in your python notebooks. If you need different or newer versions of packages, you can conda (preferred) or pip install them into your active Conda environment, and they will be imported from there.

You can create as many Conda environments as you might need, but you can only run one Jupyter Notebook Server at a time. This means that you can only use one Conda environment in a Jupyter Notebook Server at a time. To use a Jupyter Notebook Server with a different Conda environment, you can stop your Jupyter Notebook Server from the JupyterHub Control Panel, and start a new server and select a different Conda environment for it to use.

These Conda environments may also be used outside of Jupyter on the CLI.

Migrating from anaconda-wmf to conda-analytics

In tandem with the migration from Spark2 to Spark3, we will be moving from the current anaconda-wmf Conda base environment to a new conda-analytics base environment. When we release the conda-analytics setup (date not yet determined), any new environment using the Jupyter start-up menu will be based on the conda-analytics environment. Existing anaconda-wmf environments will continue to work until Spark 2 support is removed on 31 March 2023.

For more information on the differences, visit Analytics/Systems/Conda.

Choosing between Spark 3 and Spark 2

Spark 2 has been deprecated and all Spark jobs must be migrated to Spark 3. See Analytics/Systems/Cluster/Spark/Migration to Spark 3 for more info. In the interim, you can run Spark 2 jobs with the overrides discussed below.

New Spark 3 conda-analytics environment

As part of the Spark 3 rollout, we have also updated the base Conda environment. This new environment supports Spark 3 and is based off of Miniconda, which is a minimal version of Anaconda. We are calling this environment conda-analytics, and you can see what packages are included by default here.

Using Spark 2 in the interim

To use Spark 2, you can still use previously created environments based on anaconda-wmf . You can also create new anaconda-wmf environments by hand.

On a stat machine terminal run:

# clean package cache as some of them, like python 3.10, may clash with anaconda-wmf
rm -rf ~/.conda/pkgs/*
# create new stacked environment
/lib/anaconda-wmf/bin/conda-create-stacked name_of_your_env_here

Then on JupyterHub:

File -> Hub Control Pane -> Stop My Server -> Start My Server

At the last JupyterHub prompt you should be able to choose your newly created environment.

Troubleshooting

pip fails to install a newer version of a package

If using pip to install a package into your conda environment that already exists in the base anaconda-wmf environment, you might get an error like:

  Attempting uninstall: wmfdata
    Found existing installation: wmfdata 1.0.4
    Uninstalling wmfdata-1.0.4:
ERROR: Could not install packages due to an EnvironmentError: [Errno 30] Read-only file system: 'WHEEL'

To work around this, tell pip to --ignore-installed when running pip install, like:

pip install --ignore-installed --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release

See also: Analytics/Systems/Anaconda#Installing_packages_into_your_user_conda_environment

Trouble installing R packages

See Analytics/Systems/Anaconda#R_support

Browser disconnects

If your browser session disconnects from the kernel on the server (if, for example, your SSH connection times out), any work the kernel is doing will continue, and you'll be able to access the results the next time you connect to the kernel, but no further display output for that work (like print() commands to log progress) will accumulate, even if you reopen the notebook (JupyterLab issue 4237).

My Python notebook will not start

Your IPython configuration may be broken. Try deleting your ~/.ipython directory (you'll lose any configurations you've made or extensions you've installed, but it won't affect your notebooks, files, or Python packages).

My kernel restarts when I run a large query

It may be that your Jupyter Notebook Server ran out of memory and the operating system's out of memory killer decided to kill your kernel to cope with the situation. You won't get any notification that this has happened other than the notebook restarting, but you can assess the state of the memory on the notebook server by checking its host overview dashboard in Grafana (host-overview dashboard) or using the command line to see which processes are using the most (with ps aux --sort -rss | head or similar).

Viewing Jupyter Notebook Server logs

JupyterHub logs are viewable by normal users in Kibana.

A dashboard has been created named JupyterHub and this is also linked from the Home Dashboard.

At present the logs are not split per user, but we are working to make this possible.

They are no longer written by default to /var/log/syslog but they are retained on the host in the systemd journal.

You might need to see JupyterHub logs to troubleshoot login issues or resource issues affecting the cluster.

An individual user's notebook server log be examined with the following command

sudo journalctl -f -u jupyter-$USERNAME-singleuser.service

Viewing JupyterHub logs

TODO: Make this work for regular users!

You might need to see JupyterHub logs to troubleshoot login issues:

sudo journalctl -f -u jupyterhub

Tips

Analytics/Systems/Jupyter/Tips

Administration

Analytics/Systems/Jupyter/Administration