You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Systems/Anaconda: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Bearloga
(→‎Troubleshooting: Streamline TAR environment variable setting)
imported>Neil P. Quinn-WMF
(Neil P. Quinn-WMF moved page Analytics/Systems/Anaconda to Analytics/Systems/Conda: The key thing we use is Conda (the environment and package manager) not Anaconda (the Python distribution))
 
(8 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Anaconda is a prepackaged conda distribution for mostly python based analytics and research purposes.  WMF maintains a custom debian package of Anaconda that includes some extra packages, but also has scripts for creating 'stacked' conda user environments.  These conda user environments allow users to install packages into their own conda environment without modifying the base anaconda environment.
#REDIRECT [[Analytics/Systems/Conda]]
 
= Usage =
== Listing conda environments ==
 
<syntaxhighlight lang="shell">
/usr/lib/anaconda-wmf/bin/conda env list
# conda environments:
#
2020-08-19T16.19.37_otto    /home/otto/.conda/envs/2020-08-19T16.19.37_otto
2020-08-19T16.47.40_otto    /home/otto/.conda/envs/2020-08-19T16.47.40_otto
2020-08-19T16.56.54_otto    /home/otto/.conda/envs/2020-08-19T16.56.54_otto
2020-08-19T16.59.40_otto    /home/otto/.conda/envs/2020-08-19T16.59.40_otto
2020-12-13T19.40.09_otto    /home/otto/.conda/envs/2020-12-13T19.40.09_otto
base                  *  /usr/lib/anaconda-wmf
</syntaxhighlight>
 
== Anaconda base environment ==
To use the readonly Anaconda base environment, you can simply run python or other executables directly out of <code>/usr/lib/anaconda-wmf/bin</code>.  If you prefer to activate the anaconda base environment, run <code>source /usr/lib/anaconda-wmf/bin/activate</code>.
 
== Creating a new conda user environment ==
Run
 
  conda-create-stacked
 
and a new conda environment will be created for you in ~/.conda/envs.  When used, this environment will automatically append the base conda environment Python load paths to its own.  If the same package is installed in both environments, your user conda environment's package will take precedence.
 
If you prefer, you can name your conda environment
 
  conda-create-stacked my-cool-env
 
== Activating a conda user environment ==
There are several ways to activate a conda user environment.  Just running
 
  source conda-activate-stacked
 
On its own will attempt to guess at the most recent conda environment to activate.  If you only have one conda environment, this will work.
 
You can also specify the name of the conda env to activate. Run <code>/usr/lib/anaconda-wmf/bin/conda info --envs</code> to get a list of available conda environments. E.g.
 
  source conda-activate-stacked otto_2020-08-17T20.52.02
 
Or, you can run the 'activate' script out if your conda environment path:
 
  source ~/.conda/envs/2020-08-17T20.52.02_otto/bin/activate
 
== Installing packages into your user conda environment ==
After activating your user conda environment, you can set http proxy env vars and install conda and pip packages. E.g.
  export http_proxy=http://webproxy.eqiad.wmnet:8080
  export https_proxy=http://webproxy.eqiad.wmnet:8080
  conda install -c conda-forge <desired_conda_package>
  pip install --ignore-installed <desired_pip_package>
 
Conda is much preferred over pip, if the package you need is available via Conda.  Conda can better track packages and their install locations than pip.
 
Note the <code>--ignore-installed</code> flag for <code>pip install</code>. This is only needed if you are installing a pip package into your Conda environment that already exists in the base anaconda-wmf environment.
 
These packages will be installed into the currently activated Conda user environment.
 
== Deactivating your user conda environment ==
 
  source conda-deactivate-stacked
 
Or, since the user conda env's bin dir has been added to your path, you should also be able to just run
 
  source deactivate
 
= stacked conda environments =
Conda supports activating environments 'stacked' on another one.  However, all this 'stacking' does by default is leave the base conda environment's <tt>bin</tt> directory on your <tt>PATH</tt>.  It does not allow for python dependencies from multiple environments.
 
Our customization fixes this.  When <tt>conda-create-stacked</tt> is run, an <tt>anaconda.pth</tt> file is created in the new conda environment's site-packages directory.  This file tells Python to add the anaconda-wmf base environemnt python search paths to its own.  If a package is present in both environments, the stacked conda environment's version will take precedence.
 
= R support =
WMF's anaconda environment support was built with Python in mind.  Other languages are passively supported.
 
R is included in the base anaconda-wmf environment, but it is not installed into the user conda environment by default.  Doing so makes the size of user environments much larger, and makes distributing them to HDFS take much longer.
 
To install R packages into your user environment, do the following:
 
<syntaxhighlight lang="bash">
# Make sure you are using a conda env. This is not necessary if running in Jupyter.
source conda-activate-stacked
# Enable http proxy.  This is not necessary if running in Jupyter
export http_proxy=http://webproxy.eqiad.wmnet:8080; export https_proxy=http://webproxy.eqiad.wmnet:8080; export no_proxy=127.0.0.1,localhost,.wmnet
 
# R is currently the base anaconda-wmf R.
which R
/usr/lib/anaconda-wmf/bin/R
 
# Install the conda R package into your user conda environment.
conda install R
 
# R is now fully contained in your user conda environment.
which R
/home/otto/.conda/envs/2021-04-07T21.37.00_otto/bin/R
</syntaxhighlight>
 
You should now be able to install R packages using R's package manager via <code>install.packages()</code>.
 
However, just like with Python, installing R packages with conda is preferred over using R's package manager.  If a conda R package exists, you should be able to just install it like:
 
<code>
$ conda install r-tidyverse
</code>
 
It is also recommended to create a '''~/.Rprofile''' file with the following:<syntaxhighlight lang="r">
Sys.setenv("http_proxy" = "http://webproxy.eqiad.wmnet:8080")
Sys.setenv("https_proxy" = "http://webproxy.eqiad.wmnet:8080")
options(
  repos = c(
    CRAN = "https://cran.rstudio.com/",
    STAN = "https://mc-stan.org/r-packages/"
  ),
  mc.cores = 4
)
Sys.setenv(MAKEFLAGS = "-j4")
Sys.setenv(DOWNLOAD_STATIC_LIBV8 = 1)
</syntaxhighlight>
 
=== Troubleshooting ===
If you attempt to install from a Git repository – e.g. [https://github.com/wikimedia/wmfdata-r wmfdata] via <code>remotes::install_github("wikimedia/wmfdata-r")</code> and get the following:
Downloading GitHub repo wikimedia/wmfdata-r@HEAD
sh: 1: /bin/gtar: not found
sh: 1: /bin/gtar: not found
Error: Failed to install 'wmfdata' from GitHub:
  error in running command
In addition: Warning messages:
1: In system(cmd) : error in running command
2: In utils::untar(tarfile, ...)
For some reason this is an issue with Conda's R. The only workaround is running <code>Sys.setenv(TAR = system("which tar", intern = TRUE))</code>  before the install commands.

Latest revision as of 00:17, 18 November 2022