You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Data Engineering/Systems/conda-analytics

From Wikitech-static
Jump to navigation Jump to search

conda-analytics is a prepackaged conda distribution for mostly python based analytics and research purposes. WMF maintains a custom debian package of Miniconda that includes some extra packages, but also has scripts for cloning the base conda environments. These conda user environments allow users to install packages into their own conda environment without modifying the base environment.

Note that conda-analytics supersedes anaconda-wmf, the former conda distribution maintained by WMF.

Usage

Listing conda environments

conda-analytics-list
# conda environments:
#
2022-11-04T19.32.00_xcollazo     /home/xcollazo/.conda/envs/2022-11-04T19.32.00_xcollazo
2022-11-08T15.39.32_xcollazo     /home/xcollazo/.conda/envs/2022-11-08T15.39.32_xcollazo
2022-11-09T20.10.01_xcollazo     /home/xcollazo/.conda/envs/2022-11-09T20.10.01_xcollazo
base                  *  /opt/conda-analytics

Base environment

To use the readonly base environment, you can simply run python or other executables directly out of /opt/conda-analytics/bin. If you prefer to activate the base environment, run source conda-analytics-activate base.

Creating a new conda user environment

Run

conda-analytics-clone

and a new conda environment will be cloned for you in ~/.conda/envs.

If you prefer, you can name your conda environment

conda-analytics-clone my-cool-env

Activating a conda user environment

Run

 source conda-analytics-activate my-cool-env

Or, you can activate it by running vanilla conda commands:

source /opt/conda-analytics/etc/profile.d/conda.sh
conda activate my-cool-env

Listing installed packages

To see the packages installed in your current environment first activate the target environment as shown above. Then you can run conda list.

Installing packages into your user conda environment

After activating your user conda environment, you can set http proxy env vars and install conda and pip packages. E.g.

export http_proxy=http://webproxy.eqiad.wmnet:8080
export https_proxy=http://webproxy.eqiad.wmnet:8080
conda install -c conda-forge <desired_conda_package>
pip install <desired_pip_package>

Conda is much preferred over pip, if the package you need is available via Conda. Conda can better track packages and their install locations than pip.

These packages will be installed into the currently activated Conda user environment.

Deactivating your user conda environment

 source conda-analytics-deactivate

Or, since the user conda env's bin dir has been added to your path, you should also be able to just run

 conda deactivate

Troubleshooting

TODO.

R support

WMF's conda-analytics environment support was built with Python in mind. Other languages are passively supported.

R is not included in conda-analytics. To install R into your user environment, do the following:

# Make sure you are using a conda env. This is not necessary if running in Jupyter.
source conda-analytics-activate my-cool-r-env
# Enable http proxy.  This is not necessary if running in Jupyter
export http_proxy=http://webproxy.eqiad.wmnet:8080
export https_proxy=http://webproxy.eqiad.wmnet:8080
export no_proxy=127.0.0.1,localhost,.wmnet

# Install the conda R package into your user conda environment.
conda install -c conda-forge R

# R is now fully contained in your user conda environment.
which R
/home/xcollazo/.conda/envs/my-cool-r-env/bin/R

You should now be able to install R packages using R's package manager via install.packages().

However, just like with Python, installing R packages with conda is preferred over using R's package manager. If a conda R package exists, you should be able to just install it like:

$ conda install -c conda-forge r-tidyverse

Jupyter

Install R kernel in a similar way you installed R (see instructions above):

conda install -c conda-forge r-irkernel

Open a Terminal in JupyterLab (if you haven't yet) and run the following in R (simply run R to launch it):

IRkernel::installspec()

NOTE: It will not work if you are running the code in an SSH session. You need to be inside JupyterLab for the command to find the Jupyter configuration.

Tips

Since the version of R coming from conda-forge is 4.2 (or newer) we now have access to newer features such as a new syntax for specifying strings and a built-in pipe operator (|>) – replacing the need for magrittr's %>%.

NOTE: Stacked environments based on anaconda-wmf are stuck with R 3.6 as the latest available version.

It is also recommended to create a ~/.Rprofile file with the following:

options(
  repos = c(
    CRAN = "https://cran.rstudio.com/",
    STAN = "https://mc-stan.org/r-packages/"
  ),
  mc.cores = 4
)
Sys.setenv(MAKEFLAGS = "-j4")
Sys.setenv(DOWNLOAD_STATIC_LIBV8 = 1)

wmfdata for R

To query Hive and MariaDB wiki replicas in R you will need to install wmfdata for R. This package is no longer actively maintained, but it's there if you really need it. However, it is strongly recommended to access data via wmfdata for Python (included in conda-analytics without any additional installation steps needed).

First, install the remotes package in a similar way you installed R (see instructions above):

conda install -c conda-forge r-remotes

Open a Terminal in JupyterLab or an SSH session (if you haven't yet) and run the following in R:

remotes::install_github("wikimedia/wmfdata-r")

brms and lme4

If you would like to use brms and/or lme4 for statistical modeling, install the packages in a similar way you installed R (see instructions above):

conda install -c conda-forge r-brms r-lme4

Open a Terminal in JupyterLab or an SSH session (if you haven't yet) and run the following in R:

install.packages("BH")

For some reason BH (a dependency for brms) needs to be installed that way even when installing brms via conda.

Administration

The code used to build new releases of conda-analytics lives in in Gitlab.