You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
User:4nn1l2/Help:Toolforge/Summary
![]() | This page may be outdated or contain incorrect details. Please update it if you can. |
Tool Labs is a hosting environment for community developers working on tools and bots that help users maintain and use wikis. Tool Labs provides access to replicas of Wikimedia databases, allowing developers to easily re-use this information, for analytics, bot work, or by creating tools that help editors and other volunteers in their work. The infrastructure is supported by a dedicated group of Wikimedia Foundation staff and volunteers.
Quick start
- On wikitech, visit Create an account and create your Labs wiki account.
- make careful note of the wiki username and "Instance shell account name" you choose
- On wikitech, Fill out an access request for the Tools project.
- In a command-line terminal, generate an SSH-2 RSA key. See Generating and uploading an SSH key if you don't know how.
- In a command-line terminal, enter: $ cat ~/.ssh/id_rsa.pub (or similar) to display your public SSH key that you created above, then copy it.
- On wikitech, log in with your labs wiki account, visit Preferences > OpenStack tab and paste in your public SSH key.
- Wait for your requests to be completed (you should receive messages on your wikitech talk page).
Once this is all done you should be able to
- Use SSH to login to Tool Labs. In a command-line terminal, enter: ssh nn1l2@login.tools.wmflabs.org
- Use SSH-based utilities such as scp and sftp to transfer files between Tool Labs and your computer.
- Create tools (see § Creating a new Tool account).
Gotchas
- Your wikitech wiki username and your shell login username may be different. Visit Preferences > User profile and check "Instance shell account name".
- The passwords you chose for your wikitech login and SSH key may be different.
- When you login with SSH you are in your personal folder. To quickly go to your tool account enter: become tool_name
- You will also notice that web service for your tool is not started by default. To start it enter: webservice start
What is Tool Labs
Rationale
Tool Labs was developed in response to the need to support external tools and their developers and maintainers. The system is designed to make it easy for maintainers to share responsibility for their tools and bots, which helps ensure that no useful tool gets ‘orphaned’ when one person needs a break. The system is designed to be reliable, scalable and simple to use, so that developers can hit the ground and start coding.
Features
Architecture and terminology
Bastion hosts
- tools-login.wmflabs.org = login.tools.wmflabs.org
The grid
The Tool Labs grid, implemented with Open Grid Engine (the open-source fork of Sun Grid Engine) permits users to submit jobs from either a log-in account on the bastion host or from a web service. Submitted jobs are added to a work queue, and the system finds a host to execute them. Jobs can be scheduled synchronously or asynchronously, continuously, or simply executed once. If a continuous job fails, the grid will automatically restart the job so that it keeps going. For more information about the grid, please see § Submitting, managing and scheduling jobs on the grid.
Getting access to Tool Labs
See: SSH Help and Getting Started with Tool Labs
We strongly recommend against saving data or tools in any space that is accessible to individuals only. Tools and bots should be maintained in Tool accounts, which have flexible memberships (i.e., multiple people can help maintain the code!).
Using Tool Labs and managing your files
Tool Labs can be accessed in a variety of ways – from its public IP to a GUI client. Please see Help:Access for general information about accessing Labs.
The tools list
The Tool labs tools list page is publicly available and contains a list of all currently-hosted Tool accounts along with their maintainers. Tool accounts that have an associated web page appear as links. Users with access to the 'tools' project can create new tool accounts here, and add or remove maintainers to and from existing tool accounts.
SSH
Once set up, you ssh to Tool Labs via its bastion host login.tools.wmflabs.org, provided that a public SSH key has been uploaded to the Labs account.
ssh nn1l2@login.tools.wmflabs.org
if you get disconnected frequently during ssh, consider setting the ServerAliveInterval
option to a smaller number (~5-20 seconds) when connecting:
ssh -o ServerAliveInterval=5 yourshellaccountname@login.tools.wmflabs.org
Updating files
After you can ssh successfully, you can transfer files via sftp and scp. Note that the transferred files will be owned by you. You will likely wish to transfer ownership to your tool account. To do this:
1. become
your tool account:
yourshellaccountname@tools-login:~$ become toolaccount tools.toolaccount@tools-login:~$
2. As your tool account, take
ownership of the files:
tools.toolaccount@tools-login:~$ take FILE
The take
command will change the ownership of the file(s) and directories recursively to the calling user (in this case, the tool account).
Handling permissions
if you're getting permission errors, note that you can also transfer files the other way around: copy the files as your tool account to /data/projects/<projectname>.
Another, probably easier, way is to set the permission to group-writable for the tools directory. For example, if your shell account's name is alice
and your tool name is alicetools
you could do something like this after logged in as a shell user
become alicetools
chmod -R g+w /data/project/alicetools
logout
cp -rv /home/alice/* /data/project/alicetools/
One-time steps per tool
First, you have to do some preparatory steps which you need only once per tool.
become <YOURTOOL>
If you have not installed composer yet:
mkdir ~/bin curl -sS https://getcomposer.org/installer | php -- --install-dir=$HOME/bin --filename=composer
If your local bin
directory it not in your $PATH
(use echo $PATH
to find out), then create or alter the file ~/.profile
and add the lines:
# set PATH so it includes user's private bin if it exists if [ -d "$HOME/bin" ] ; then PATH="$HOME/bin:$PATH" fi
Finish your session as <YOURTOOL> and start a new one, or:
. ~/.profile
Now you are done with the one-time preparations.
For each instance of core
The following steps are needed for each new installation of MediaWiki. We assume that you want to access MediaWiki via the web in a directory named MW
— you are free to use another name. If not already done:
become <YOURTOOL>
Then:
cd ~/public_html
If you plan to submit changes:
git clone ssh://<YOURUSERNAME>@gerrit.wikimedia.org:29418/mediawiki/core.git MW
or else, if you only want to use MediaWiki without submitting changes:
git clone https://gerrit.wikimedia.org/r/p/mediawiki/core.git MW
will do and spares resources. Next, recent versions of MediaWiki have external dependencies, so you need to install those:
cd MW composer install git review -s
Now you should be able to access the initial pre-install screen of MediaWiki from your web browser as:
https://tools.wmflabs.org/<YOURTOOL>/MW/
and proceed as usual. See how to create new databases for your MediaWiki installations.
Joining and creating a Tool account
What is a Tool account?
A Tool account is the "user" associated with a Tool on Tool labs. Although each tool account has a user ID, they are not personal accounts (like a Labs account), rather services that consist of a user and group ID (i.e., a unix uid-gid pair) that are intended to run the actual tool or bot. Anyone who has access to Tool Labs can create a Tool account.
- Unix user: tools.toolname
- Unix group: tools.toolname
Members of the Tool account's Unix group include:
- the tool account creator
- the tool account itself
- (optionally, but encouraged!) additional tool maintainers
Maintainers may have more than one tool account, and tool accounts may have more than one maintainer. Every member of the group has the authorization to sudo to the tool account. By default, only members of the group have access to tool account's code and data.
A simple way for maintainers to switch to the tool account is with become
:
maintainer@tools-login:~$ become toolname
tools.toolname@tools-login:~$
In addition to the user/group pair, each tool account includes:
- A home directory on shared storage:
/data/project/toolname
- A
~/public_html/
directory, which is visible athttp://tools.wmflabs.org/toolname/
- Database access credentials:
~/replica.my.cnf
, which provide access to the production database replicas as well as to project-local databases. - Access to the continuous and task queues of the compute grid
Joining an existing Tool account
All tool accounts hosted in Tool Labs are listed on the Tools list. If you would like to be added to an existing account, you must contact the maintainer(s) directly.
If you would like to add (or remove) maintainers to a tool account that you manage, you may do so with the 'add' link found beneath the tool name on the Tools home page.
Creating a new Tool account
Members of the ‘tools’ project can create tool accounts from the Tools home page:
- Navigate to the Tools home page.
- Select the "create new tool" link (found in the "Develop your own tool" section).
- Enter a "Service group name". The service group name will be used as the name of your tool account.
Customizing a Tool account
Once you have created a tool account, there are a few things that you can customize to make the tool more easily understood and used by other users. These include:
- adding a tool account description (the description will appear on the Tools home page beside the tool name)
- creating a home page for your tool (if you create a home page for the tool, it will be linked from the Tools home page automatically)
Creating a tool web page
To create a web page for your tool account, simply place an index.html file in the tool account's ~/public_html The page can be a simple description of the tool or bot with basic information on how to set it up or shut it down, or it contain an interface for the web service. To see examples of existing tool web pages, click any of the linked tool names on the Tools list.
You will also need to start a webservice for your tool.
1. Log into your Labs account and become your tool account:
nn1l2@tools-bastion-03:~$ become nn1l2bot
2. Start the web service:
tools.nn1l2bot@tools-bastion-03:~$ webservice start
Creating a tool description
To create a tool description:
1. Log into your Labs account and become your tool account:
nn1l2@tools-bastion-03:~$ become nn1l2bot
2. Create a .description
file in the tool account’s home directory. Note that this file must be HTML:
tools.nn1l2bot@tools-bastion-03:~$ vim .description
3. Add a brief description (no more than 25 words or so) and save the file. You can use basic HTML markup in the file.
4. Navigate to the Tools list. Your tool account description should now appear beside your tool account name.
Configuring bots and tools
Tools and bot code should be stored in your tools account, where it can be managed by multiple users and accessed by all execution hosts. Specific information about configuring web services and bots, along with information about licensing, package installation, and shared code storage, is available at the § Developing on Tool Labs section.
Submitting, managing and scheduling jobs on the grid
![]() | Help improve content for this page: https://phabricator.wikimedia.org/T232405 |
![]() | WMCS is in the process of transitioning from grid engine to Kubernetes. You are encouraged to run your tool on the Kubernetes platform when possible. |
Every non-trivial task performed in Toolforge should be dispatched by the Grid Engine, which ensures that the job is run in a suitable place with sufficient resources.
The basic principle of running jobs is fairly straightforward:
- You submit a job to a work queue from a submission server (for example
login.toolforge.org
) - The grid engine master finds a suitable execution host to run the job on, and starts it there once resources are available
- As it runs, your job will send output and errors to files until the job completes or is aborted.
Jobs can be scheduled synchronously or asynchronously, continuously, or simply executed once. If a continuous job fails, the grid will automatically restart the job so that it keeps going.
To schedule jobs to be run at specific days or time of days, you can use cron to submit the jobs to the grid.
Scheduling a command more often than every five minutes (e.g. * * * * * command
) is highly discouraged, even if the command is "only" jsub. In these cases, you very probably want to use 'jstart' instead. The grid engine ensures that jobs submitted with 'jstart' are automatically restarted if they exit.
Mail to users
Mail sent to user@tools.wmflabs.org
(where user is a shell account) will be forwarded to the email address that user has set in their Wikitech preferences, if it has been verified (the same as the 'Email this user' function on wikitech).
Any existing .forward in the user's home will be ignored.
Mail to tools
Mail can also be sent "to a tool" with:
toolname.anything@tools.wmflabs.org
Where "anything" is an arbitrary alphanumeric string. Mail will be forwarded to the first of:
- The email(s) listed in the tool's
~/.forward.anything
, if present; - The email(s) listed in the tool's
~/.forward
, if present; or - The wikitech email of the tool's individual maintainers.
Additionally, tools.toolname@tools.wmflabs.org
is an alias pointing to toolname.maintainers@tools.wmflabs.org
mostly useful for automated email generating from within Labs.
~/.forward
and ~/.forward.anything
need to be readable by the user Debian-exim
; to achieve that, you probably need to chmod o+r ~/.forward*
Web server
Overview
Every Toolforge tool can run a dedicated <toolname>.toolforge.org website. Toolforge provides the webservice
command which is used to start and stop the web server for each tool. Toolforge supports websites written in several programming languages including PHP, Python, Node.js, Java, Ruby and others. Toolforge also provides some support services which can help you make your website’s visitors safe from tracking by third party services.
The webservice
command uses convention over configuration for some aspects of how the website is deployed. You’ll find details for different programming languages below.
Using the webservice command
You can use the webservice
command to start
, stop
, restart
, and check the status
of a webserver.
webservice
command example$ ssh login.toolforge.org
$ become my_cool_tool
$ webservice start
Use webservice --help
to get a full list of arguments.
Without any additional arguments or configuration files, webservice start
will currently start a PHP 7.3 Kubernetes container serving content from your tool's $HOME/public_html directory using lighttpd as the web server software.
Webservice templates
The webservice
command has the concept of a "template" file which can be used to store arguments (and eventually other structured content) for starting a webservice. The code will look for a --template=...
command line argument and fallback to looking for a $HOME/service.template file. The $HOME/service.template file is what most tools will be expected to use, but we may find interesting uses for multiple templates in a single tool as well.
A webservice template file is a YAML document. It can contain these settings:
- backend: the backend to use (equivalent to
--backend=...
) - cpu: the CPU reservation to ask for on Kubernetes (equivalent to
--cpu=...
) - mem: the memory reservation to ask for on Kubernetes (equivalent to
--mem=...
) - release: the operating system to ask for on Grid Engine (equivalent to
--release=...
) - replicas: the number of Pod replicas to use (equivalent to
--replicas=...
) - type: the type of webservice to start (equivalent to
TYPE
) - extra_args: extra arguments to pass to the backend (not used by most backends)
By saving desired startup state in a file, the user can use simple webservice stop; webservice start
commands again!
Example $HOME/service.template |
---|
The following content has been placed in a collapsed box for improved usability. |
# Toolforge webservice template
# Provide default arguments for `webservice start` commands for this tool.
#
# Uncomment lines below and adjust as needed
# Set backend cluster to run this webservice (--backend={gridengine,kubernetes})
backend: kubernetes
# Set Kubernetes cpu limit (--cpu=...)
#cpu: 500m
# Set Kubernetes memory limit (--mem=...)
#mem: 512Mi
# Set ReplicaSet size for a Kubernetes deployment (--replicas=...)
#replicas: 2
# Runtime type
# See "Supported webservice types" in `webservice --help` output for valid values.
#type: python3.7
# Extra arguments to be parsed by the chosen TYPE
#extra_args:
# - arg0
# - arg1
# - arg2
|
The above content has been placed in a collapsed box for improved usability. |
Choosing a backend
Toolforge provides two different execution environments for web servers: Kubernetes and Grid Engine.
The Kubernetes backend provides more modern software versions and is the default backend. The Grid Engine backend is used primarily by legacy tools which were developed before Kubernetes was available. Toolforge administrators recommend that you try using Kubernetes first for new tools and only use the Grid Engine backend if there is a technical limitation that prevents your tool from running inside Kubernetes.
Common features
Both the Kubernetes and Grid Engine backends share common infrastructure services for serving web sites. Toolforge has an Nginx server configured as a proxy server which handles all inbound requests to your tool's web server. This proxy server takes care of providing TLS termination and then reverse proxies the inbound request to your tool's web service. Web servers running on Kubernetes have a second Nginx proxy server running as the "Ingress" component inside the Kubernetes cluster. See Portal:Toolforge/Admin/Kubernetes/Networking and ingress for detailed information about the network and web request routing used by the Toolforge Kubernetes cluster.
Toolforge also includes a 404 handler service which will respond to HTTP requests for tools which do not exist and tools which are not currently running a web service. This service is implemented as the fourohfour tool which runs on the Kubernetes backend.
Kubernetes
Kubernetes (k8s) is a platform for running containers. Kubernetes web servers have access to newer versions of most software than the Grid Engine provides. K8s also provides a more robust system for restarting tools automatically following an application crash.
Maintainer visible differences from Grid Engine based Web services
- Each process runs inside a Docker container, orchestrated by Kubernetes.
- Provides better resource isolation (one tool can not take down other tools by consuming all RAM or CPU)
- Better health checking (monitoring built into Kubernetes, not a hack we wrote)
- Less complex proxy setup, leading to fewer proxy related outages / issues
- Containers available based on newer Debian versions (Buster)
- Newer software versions than those available with Debian Stretch
- It is not possible to interact with the Grid Engine from Kubernetes (no
jsub
...) - Kubernetes backend has specific
webservice
options:-m MEMORY, --mem MEMORY Set higher Kubernetes memory limit -c CPU, --cpu CPU Set a higher Kubernetes cpu limit -r REPLICAS, --replicas REPLICAS Set the number of pod replicas to use
Grid Engine
The Grid Engine backend runs your web server as a job on a Debian Stretch grid exec node. This is similar to the way that jsub
runs any grid job you submit, but there is a separate exec queue on the grid for running jobs started by webservice
.
Switching between Kubernetes and Grid Engine
From Kubernetes to Grid Engine
$ webservice --backend=kubernetes stop
$ webservice --backend=gridengine start
From Grid Engine to Kubernetes
$ webservice --backend=gridengine stop
$ webservice --backend=kubernetes <type> start
Default web server (lighttpd + PHP)
See: Help:Toolforge/Web/Lighttpd
PHP
Python
See: Help:Toolforge/Web/Python
Node.js web services
See: Help:Toolforge/Web/Node.js
Java
Other / generic web servers
You can run other web servers that are not directly supported. This can be accomplished using the generic
webservice type on the Grid Engine backend or a runtime specific type on the Kubernetes backend.
webservice --backend=kubernetes golang start|stop|restart|shell SCRIPT
webservice --backend=kubernetes jdk11 start|stop|restart|shell SCRIPT
webservice --backend=kubernetes perl5.32 start|stop|restart|shell SCRIPT
webservice --backend=kubernetes ruby25 start|stop|restart|shell SCRIPT
webservice --backend=gridengine generic start|stop|restart SCRIPT
To start a webserver that is launched by a script at /data/project/toolname/code/server.bash
, you would launch it with:
$ webservice --backend=gridengine generic start /data/project/toolname/code/server.bash
Your script will be passed an HTTP port to bind to in an environment variable named PORT. This is the port that the Nginx proxy will forward requests for https://YOUR_TOOL.toolforge.org/ to. When using the Kubernetes backend, PORT will always be 8000. When using the Grid Engine backend, PORT will change each time the webservice start
or webservice restart
command is run.
Common tasks and guides
Hosting large files
Toolforge storage uses NFS which has limited storage and network bandwidth. If your tool requires a static file larger than 1GB (for example serving up a container image or tarball), please store that file in the 'Download' project rather than storing it in your tools home directory.
The Download project hosts https://download.wmcloud.org, a public read-only web server for large file storage. If you would like a file added, create a Phabricator ticket or contact WMCS staff directly to have the file added.
Serving static files
Files placed in a tool's $HOME/www/static
directory are available directly from the URL tools-static.wmflabs.org/toolname
. This does not require any action on the tool's part — putting the files in the appropriate folder (and making the directory readable) should 'just work'.
You can use this to serve static assets (CSS, HTML, JS, etc) or to host simple websites that don't require a server-side component.
Load external assets using our CDN services
To preserve the privacy of our users, avoid embedding assets (images, CSS, JavaScript) from servers outside of Wikimedia Foundation control.
- Libraries (Browse libraries)
- Toolforge provides an anonymizing reverse proxy to cdnjs.
- Fonts (Search fonts)
- Toolforge provides an anonymizing reverse proxy to Google Fonts.
- Maps (Documentation)
- Wikimedia provides maps servers with data from OpenStreetMap.
Runtime memory limits
- Kubernetes: 2GiB for most runtimes (Java's limit is 4GiB).
- Grid Engine: 4GiB
Requesting additional tool memory
Kubernetes web servers start with a default limit on both runtime memory and cpu power. These limits vary slightly based on which runtime language (PHP, Python, Java, etc) you are using. The --cpu
and --mem
command line arguments can be used to increase these defaults up to the quota limit for your tool's Kubernetes namespace. See Kubernetes#Quotas and Resources for instructions on requesting an increased quota for your tool.
For Grid Engine webservices, request more tool memory by opening a Phabricator task
- Notify the #wikimedia-cloud connect IRC channel that you have filed a request.
A Cloud Services administrator will review your request and can create a /data/project/.system/config/$TOOLNAME.web-memlimit
configuration file that will adjust the limit.
Response buffering
An Nginx proxy sits between your webservice and the user. By default this proxy buffers the response sent from your server. For some use cases, including streaming large quantities of data to the browser, this can be undesirable. Buffering can be disabled on a per-request basis by sending an X-Accel-Buffering: no
header in your response.[1]
/favicon.ico
A default image will be served by the shared proxy layer if your webservice returns a 404 Not Found response when asked for /favicon.ico. This default icon is the same as the one found at https://tools-static.wmflabs.org/toolforge/favicons/favicon.ico.
/robots.txt
A default response will be served by the shared proxy layer if your webservice returns a 404 Not Found response when asked for /robots.txt. The default robots.txt response denies access to all compliant web crawlers. We decided that this "fail closed" approach would be safer than a "fail open" telling all crawlers to crawl all tools.
Any tool that does wish to be indexed by search engines and other crawlers can serve their own /robots.txt content. Please see https://www.robotstxt.org/ for more information on /robots.txt in general.
Communication and support
Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:
- Chat in real time in the IRC channel #wikimedia-cloud connect, the bridged Telegram channel, or the bridged Mattermost channel
- Discuss via email after you subscribed to the cloud@ mailing list
References
See also
Dumps
The 'tools' project, like all labs projects, has access to a directory storing the public Wikimedia datasets (i.e. the dumps generated by Wikimedia). The most recent two dumps can be found in:
/public/dumps/public
This directory is read-only, but you can copy files to your tool's home directory and manipulate them in whatever way you like.
If you need access to older dumps, you must manually download them from the Wikimedia downloads server.
/public/dumps/pagecounts-raw
contains some years of the pagecount/projectcount data derived by Erik Zachte from Domas Mituzas' archives.
CatGraph (aka Graphserv/Graphcore)
CatGraph is a custom graph database that provides tool developers fast access to the Wikipedia category structure. For more information, please see the documentation.