Jump to content

This is a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Data Platform/Systems/Conda

From Wikitech

We use Conda to manage packages and virtual environments on the stat hosts .

Environments are created by cloning Conda-Analytics, a custom Conda distribution maintained by the Data Platform Engineering team .

Use with Jupyter

For instructions on using Conda within our Jupyter environment, see Data Platform/Systems/Jupyter#Conda environments .

Use outside Jupyter

This section applies to Conda use outside of Jupyter (that is, when you connect to one of the analytics clients with a plain SSH terminal session).

In most cases, you can use the standard Conda commands (e.g. conda install , conda remove , conda list , conda deactivate ). This section covers the exceptions where we have custom commands to support our cloning-based workflow.

Creating a new environment

In the terminal, run conda-analytics-clone and a new clone of conda-analytics will be created for you in ~/.conda/envs .

It will be automatically named with the time and your username. If you prefer, you can give it a custom name: conda-analytics-clone my-cool-env .

Listing environments

$ conda-analytics-list
# conda environments:
#
2022-11-04T19.32.00_xcollazo     /home/xcollazo/.conda/envs/2022-11-04T19.32.00_xcollazo
2022-11-08T15.39.32_xcollazo     /home/xcollazo/.conda/envs/2022-11-08T15.39.32_xcollazo
2022-11-09T20.10.01_xcollazo     /home/xcollazo/.conda/envs/2022-11-09T20.10.01_xcollazo
base                  *  /opt/conda-analytics

Activating an environment

Run source conda-analytics-activate my-cool-env .

You can achieve the same thing with vanilla commands:

$ source /opt/conda-analytics/etc/profile.d/conda.sh
$ conda activate my-cool-env

You can also activate the read-only base environment, run source conda-analytics-activate base .

Installing packages

With a Conda environment activated, you can install packages by running conda install {{package}} in the terminal. If you are using Conda outside of Jupyter , you will first have to set your environment to use the HTTP proxy .

Conda will install packages from the Conda Forge channel by default. You can manually select a different channel by adding --channel {{channel}} to the command. The easiest way to search Conda Forge for a specific package is to do a regular web search with the qualifier "site:anaconda.org/conda-forge/".

If a Python package you need is not available from Conda Forge, you can use Pip instead.

Pinned package management

Each cloned environment comes with a pinned file whose main purpose is to prevent core packages from being automatically upgraded.

The pinned file is located in the conda-meta directory of each environment and can be customised to your liking

Troubleshooting

Spark 3 insert statement requirements

Using an INSERT statement in Spark 3 SQL or write.insertInto() in PySpark 3 results in the environment's Python executable being called. If the code is run from a cron job that loads a custom Python environment this might result in errors being thrown because that executable isn't available on the cluster. One way to solve this is to use wmfdata.spark.create_session(ship_python_env = True) to create a custom Spark session that ships the Python environment to the cluster nodes.

R support

R is not included by default, but can easily be installed. See Data Platform/Systems/R for details.

Administration

The Conda-Analytics base environment is based on Miniconda and has extra packages specific for our needs as well as scripts for cloning the environment. On the stat hosts , it is available in /opt/conda-analytics .

The code used to build new releases of conda-analytics lives in gitlab:repos/data-engineering/conda-analytics/ . The actual releases live in the associated package registry .