Data Platform/Systems/Conda
We use Conda to manage packages and virtual environments on the stat hosts .
Environments are created by cloning Conda-Analytics, a custom Conda distribution maintained by the Data Platform Engineering team .
Use with Jupyter
For instructions on using Conda within our Jupyter environment, see Data Platform/Systems/Jupyter#Conda environments .
Use outside Jupyter
This section applies to Conda use outside of Jupyter (that is, when you connect to one of the analytics clients with a plain SSH terminal session).
In most cases, you can use the standard
Conda commands
(e.g.
conda install
,
conda remove
,
conda list
,
conda deactivate
). This section covers the exceptions where we have custom commands to support our cloning-based workflow.
Creating a new environment
In the terminal, run
conda-analytics-clone
and a new clone of conda-analytics will be created for you in
~/.conda/envs
.
It will be automatically named with the time and your username. If you prefer, you can give it a custom name:
conda-analytics-clone my-cool-env
.
Listing environments
$ conda-analytics-list
# conda environments:
#
2022-11-04T19.32.00_xcollazo /home/xcollazo/.conda/envs/2022-11-04T19.32.00_xcollazo
2022-11-08T15.39.32_xcollazo /home/xcollazo/.conda/envs/2022-11-08T15.39.32_xcollazo
2022-11-09T20.10.01_xcollazo /home/xcollazo/.conda/envs/2022-11-09T20.10.01_xcollazo
base * /opt/conda-analytics
Activating an environment
Run
source conda-analytics-activate my-cool-env
.
You can achieve the same thing with vanilla commands:
$ source /opt/conda-analytics/etc/profile.d/conda.sh
$ conda activate my-cool-env
You can also activate the read-only base environment, run
source conda-analytics-activate base
.
Installing packages
With a Conda environment activated, you can install packages by running
conda install {{package}}
in the terminal. If you are using Conda outside of
Jupyter
, you will first have to set your environment to use the
HTTP proxy
.
Conda will install packages from the
Conda Forge channel
by default. You can manually select a different
channel
by adding
--channel {{channel}}
to the command. The easiest way to search Conda Forge for a specific package is to do a regular web search with the qualifier "site:anaconda.org/conda-forge/".
If a Python package you need is not available from Conda Forge, you can use Pip instead.
Pinned package management
Each cloned environment comes with a pinned file whose main purpose is to prevent core packages from being automatically upgraded.
The pinned file is located in the
conda-meta
directory of each environment and can be customised to your liking
Troubleshooting
Spark 3 insert statement requirements
Using an
INSERT
statement in Spark 3 SQL or
write.insertInto()
in PySpark 3 results in the environment's Python executable being called. If the code is run from a
cron
job that loads a custom Python environment this might result in errors being thrown because that executable isn't available on the cluster. One way to solve this is to use
wmfdata.spark.create_session(ship_python_env = True)
to create a custom Spark session that ships the Python environment to the cluster nodes.
R support
R is not included by default, but can easily be installed. See Data Platform/Systems/R for details.
Administration
The Conda-Analytics base environment is based on
Miniconda
and has extra packages specific for our needs as well as scripts for cloning the environment. On the
stat hosts
, it is available in
/opt/conda-analytics
.
The code used to build new releases of conda-analytics lives in gitlab:repos/data-engineering/conda-analytics/ . The actual releases live in the associated package registry .