You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Data Engineering/Systems/conda-analytics
conda-analytics is a prepackaged conda distribution for mostly python based analytics and research purposes. WMF maintains a custom debian package of Miniconda that includes some extra packages, but also has scripts for cloning the base conda environments. These conda user environments allow users to install packages into their own conda environment without modifying the base environment.
Note that conda-analytics supersedes anaconda-wmf, the former conda distribution maintained by WMF.
Usage
Listing conda environments
conda-analytics-list
# conda environments:
#
2022-11-04T19.32.00_xcollazo /home/xcollazo/.conda/envs/2022-11-04T19.32.00_xcollazo
2022-11-08T15.39.32_xcollazo /home/xcollazo/.conda/envs/2022-11-08T15.39.32_xcollazo
2022-11-09T20.10.01_xcollazo /home/xcollazo/.conda/envs/2022-11-09T20.10.01_xcollazo
base * /opt/conda-analytics
Base environment
To use the readonly base environment, you can simply run python or other executables directly out of /opt/conda-analytics/bin
. If you prefer to activate the base environment, run source conda-analytics-activate base
.
Creating a new conda user environment
Run
conda-analytics-clone
and a new conda environment will be cloned for you in ~/.conda/envs.
If you prefer, you can name your conda environment
conda-analytics-clone my-cool-env
Activating a conda user environment
Run
source conda-analytics-activate my-cool-env
Or, you can activate it by running vanilla conda commands:
source /opt/conda-analytics/etc/profile.d/conda.sh conda activate my-cool-env
Listing installed packages
To see the packages installed in your current environment first activate the target environment as shown above. Then you can run conda list
.
Installing packages into your user conda environment
After activating your user conda environment, you can set http proxy env vars and install conda and pip packages. E.g.
export http_proxy=http://webproxy.eqiad.wmnet:8080 export https_proxy=http://webproxy.eqiad.wmnet:8080 conda install -c conda-forge <desired_conda_package> pip install <desired_pip_package>
Conda is much preferred over pip, if the package you need is available via Conda. Conda can better track packages and their install locations than pip.
These packages will be installed into the currently activated Conda user environment.
Deactivating your user conda environment
source conda-analytics-deactivate
Or, since the user conda env's bin dir has been added to your path, you should also be able to just run
conda deactivate
Troubleshooting
TODO.
R support
WMF's conda-analytics environment support was built with Python in mind. Other languages are passively supported.
R is not included in conda-analytics. To install R into your user environment, do the following:
# Make sure you are using a conda env. This is not necessary if running in Jupyter.
source conda-analytics-activate my-cool-r-env
# Enable http proxy. This is not necessary if running in Jupyter
export http_proxy=http://webproxy.eqiad.wmnet:8080
export https_proxy=http://webproxy.eqiad.wmnet:8080
export no_proxy=127.0.0.1,localhost,.wmnet
# Install the conda R package into your user conda environment.
conda install -c conda-forge R
# R is now fully contained in your user conda environment.
which R
/home/xcollazo/.conda/envs/my-cool-r-env/bin/R
You should now be able to install R packages using R's package manager via install.packages()
.
However, just like with Python, installing R packages with conda is preferred over using R's package manager. If a conda R package exists, you should be able to just install it like:
$ conda install -c conda-forge r-tidyverse
Jupyter
Install R kernel in a similar way you installed R (see instructions above):
conda install -c conda-forge r-irkernel
Open a Terminal in JupyterLab (if you haven't yet) and run the following in R (simply run R
to launch it):
IRkernel::installspec()
NOTE: It will not work if you are running the code in an SSH session. You need to be inside JupyterLab for the command to find the Jupyter configuration.
Tips
Since the version of R coming from conda-forge is 4.2 (or newer) we now have access to newer features such as a new syntax for specifying strings and a built-in pipe operator (|>
) – replacing the need for magrittr's %>%
.
NOTE: Stacked environments based on anaconda-wmf are stuck with R 3.6 as the latest available version.
It is also recommended to create a ~/.Rprofile file with the following:
options(
repos = c(
CRAN = "https://cran.rstudio.com/",
STAN = "https://mc-stan.org/r-packages/"
),
mc.cores = 4
)
Sys.setenv(MAKEFLAGS = "-j4")
Sys.setenv(DOWNLOAD_STATIC_LIBV8 = 1)
wmfdata for R
To query Hive and MariaDB wiki replicas in R you will need to install wmfdata for R. This package is no longer actively maintained, but it's there if you really need it. However, it is strongly recommended to access data via wmfdata for Python (included in conda-analytics without any additional installation steps needed).
First, install the remotes package in a similar way you installed R (see instructions above):
conda install -c conda-forge r-remotes
Open a Terminal in JupyterLab or an SSH session (if you haven't yet) and run the following in R:
remotes::install_github("wikimedia/wmfdata-r")
brms and lme4
If you would like to use brms and/or lme4 for statistical modeling, install the packages in a similar way you installed R (see instructions above):
conda install -c conda-forge r-brms r-lme4
Open a Terminal in JupyterLab or an SSH session (if you haven't yet) and run the following in R:
install.packages("BH")
For some reason BH (a dependency for brms) needs to be installed that way even when installing brms via conda.
Administration
The code used to build new releases of conda-analytics lives in in Gitlab.