You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Portal:Toolforge/Admin
Documentation of backend components and admin procedures for Toolforge. See Help:Toolforge for user facing documentation about actually using Toolforge to run your bots and webservices.
Failover
Tools should be able to survive the failure of any one virt* node. Some items may need manual failover
WebProxy
There are several webproxies, currently tools-proxy-05 and tools-proxy-06. They are both on different virt hosts, and they are 'hot spares' - you can switch them without any downtime. Webservices register themselves with the active proxy (specified by hiera setting active_proxy
), and this information is stored in redis. This proxying information is also replicated to the standby proxy via simple redis replication. So when the proxies are switched, new webservice starts would fail for a while until puppet runs on all the web nodes and the proxies, but current http traffic will continue to be served.
Static webserver
This is a stateless simple nginx http server. Simply switch the floating IP from tools-static-10 to tools-static-11 (or vice versa) to switch over. Recovery is also equally trivial - just bring the machine back up and make sure puppet is ok.
Checker service
This is the service that catchpoint (our external monitoring service) hits to check status of several services. It's totally stateless, so just switching the public IP from tools-checker-03 to -04 (or vice versa) should be fine. Same procedure as static webserver.
See Portal:Toolforge/Admin/Toolschecker
GridEngine Master
The gridengine scheduler/dispatcher runs on tools-master, and manages dispatching jobs to execution nodes and reporting. The active master write its name to
/var/lib/gridengine/default/common/act_qmaster
, where all enduser tools pick it up. tools-sgegrid-master
normally serves in this role but tools-sgegrid-shadow
can also be manually started as the master iff there are currently no active masters with service gridengine-master start
on the shadow master.
For Grid Engine 8 (stretch/son of grid engine), the service is not marked to be running in puppet, and systemd may stop trying to restart it if things are bad for a bit. It will require a manual restart in those situations.
Redundancy
Every 30s, the master touches the file /var/spool/gridengine/qmaster/heartbeat
. On tools-sgegrid-shadow
there is a shadow master that watches this file for staleness, and will fire up a new master on itself if it has been for too long (currently set at 10m -- 2m in the stretch grid). This only works if the running master crashed or was killed uncleanly (including the server hosting it crashing), because a clean shutdown will create a lockfile forbidding shadows from starting a master (as would be expected in the case of willfuly stopped masters). It may leave the lock file in place for other reasons as well depending on how it died. Delete the lock file at /var/spool/gridengine/qmaster/lock
if the takeover is desired.
If it does, then it changes /data/project/.system_sge/gridengine/default/common/act_qmaster
to point to itself, redirecting all userland tools. This move is unidirectional; once the master is ready to take over again then the gridengine-master on tools-sgegrid-shadow
need to be shut down manually (Note: on the stretch grid, this doesn't seem to be true--it failed back smoothly in testing but manually failing back is still smart), and the one on tools-master started (this is necessary to prevent flapping, or split brain, if tools-grid-master
only failed temporarily). This is simply done with service gridengine-master {stop/start}
.
Because of the heartbeat file and act_qmaster mechanisms, when it fails over, the gridengine-master
service will not start if act_qmaster
points to the shadow master. You must manually stop the gridengine-shadow
service on tools-sgegrid-shadow
and then start the gridengine-master
service on tools-sgegrid-master
and then start gridengine-shadow
on tools-sgegrid-shadow
to restore the "normal" state after failover. The services are largely kept under manual systemctl
control because of these sort of dances.
Redis
Redis uses Sentinel to automatically fail over in case of a node failure.
Services
![]() | This section needs a refresh. Webservicemonitor now runs in cron nodes, among other things |
These are services that run off service manifests for each tool - currently just the webservicemonitor service. They're in warm standby requiring manual switchover. tools-services-01 and tools-service-02 both have the exact same code running, but only one of them is 'active' at a time. Which one is determined by the puppet role param role::labs::tools::services::active_host. Set that via hiera to the fqdn of the host that should be 'active' and run puppet on all the services hosts. This will start the services in appropriate hosts and stop them in the appropriate hosts. Since services should not have any internal state, they can be run from any host without having to switch back compulsorily.
Service nodes also run the Toolforge internal aptly service, to serve .deb packages as a repository for all the other nodes.
Command orchestration
Toolforge and Toolsbeta both have a local cumin server.
Administrative tasks
![]() | Any qmod or qconf commands listed below can only be run on the master and shadow nodes (tools-sgegrid-master and tools-sgegrid-shadow) in the Son of Grid Engine (stretch) grid. |
Logging in as root
For normal login root access see Portal:Toolforge/Admin#What_makes_a_root/Giving_root_access.
In case the normal login does not work for example due to an LDAP failure, administrators can also directly log in as root. To prepare for that occasion, generate a separate key with ssh-keygen
, add an entry to the passwords::root::extra_keys
hash in Horizon's 'Project Puppet' section with your shell username as key and your public key as value and wait a Puppet cycle to have your key added to the root
accounts. Add to your ~/.ssh/config
:
# Use different identity for Tools root. Match host *.tools.eqiad1.wikimedia.cloud user root IdentityFile ~/.ssh/your_secret_root_key
The code that reads passwords::root::extra_keys
is in labs/private:modules/passwords/manifests/init.pp.
Disabling all ssh logins except root
Useful for dealing with security critical situations. Just touch /etc/nologin
and PAM will prevent any and all non-root logins.
Complaints of bastion being slow
Users are increasingly noticing slowness on tools-login due to either CPU or IOPS exhaustion caused by people running processes there instead of on the Job Grid. Here are some tips for finding the processes in need of killing:
- Look for IOPS hogs
$ iotop
- Look for abnormal processes:
$ ps axo user:32,pid,cmd | grep -Ev "^($USER|root|daemon|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data)" | grep -ivE 'screen|tmux|-bash|mosh-server|sshd:|/bin/bash|/bin/zsh'
- If you see
pyb.py
kill with extreme prejudice.
- If the rogue job is running as a tool,
!log
something like:!log tools.$TOOL Killed $PROC process running on tools-bastion-03. See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid for instructions on running jobs on the grid.
SGE resources
Son of Grid Engine doesn't appear to be actively developed, but it is somewhat updated from the last open source release of Univa Grid Engine (8.0.0), which is an active commercial product that is not open source at this point.
Documentation for Son of Grid Engine is mostly archives of the Sun/Oracle documents. This can be found at the University of Liverpool website.
PDF manuals for the older grid engine can found using [1]. Most of the information in these still applies to Son of Grid Engine (version 8.1.9).
Nearly all installation guides for any version of Grid Engine are incorrect because they assume some process of untarring executables on NFS, like one would on a classic Solaris installation. The execs on NFS in our environment is purely a set of symlinks to the exec files on local disk that are installed via deb packages.
With that in mind, see this page for most of the current how-tos: https://arc.liv.ac.uk/SGE/howto/howto.html
Dashboard
In addition to the cli commands below, an overview can be viewed at https://sge-status.toolforge.org/
List of handy commands
Most commands take -xml as a parameter to enable xml output. This is useful when lines get cut off. These are unchanged between grid versions.
Note that qmod and qconf commands will only work on masters and shadow masters (tools-sgegrid-master and tools-sgegrid-shadow) in the grid because bastions are not admin hosts.
Queries
- list queues on given host:
qhost -q -h $hostname
- list jobs on given host:
qhost -j -h $hostname
- list all queues:
qstat -f
- qmaster log file:
tail -f /data/project/.system_sge/gridengine/spool/qmaster/messages
Configuration
The global and scheduler configs are managed by puppet. See the files under modules/profile/files/toolforge/grid-global-config and modules/profile/files/toolforge/grid-scheduler-config
See also: http://gridscheduler.sourceforge.net/howto/commontasks.html
- modify host group config:
qconf -mhgrp \@general
- print host group config:
qconf -shgrp \@general
- modify queue config:
qconf -mq queuename
- print queue config:
qconf -sq continuous
- enable a queue:
qmod -e 'queue@node_name'
- disable a queue:
qmod -d 'queue@node_name'
- add host as exec host:
qconf -ae node_name
- print exec host config:
qconf -se node_name
- remove host as exec host: ??
- add host as submit host:
qconf -as node_name
- remove host as submit host: ??
- add host as admin host:
qconf -ah node_name
- remove host as admin host: ??
Accounting
- retrieve information on finished job:
qacct -j [jobid or jobname]
- there are a few scripts in /home/valhallasw/accountingtools: (need to be puppetized)
- vanaf.py makes a copy of recent entries in the accounting file
- accounting.py contains python code to read in the accounting file
- Usage:
valhallasw@tools-bastion-03:~/accountingtools$ php time.php "-1 hour" 1471465675 valhallasw@tools-bastion-03:~/accountingtools$ python vanaf.py 1471465675 mylog Seeking to timestamp 1471465675 ... done! valhallasw@tools-bastion-03:~/accountingtools$ grep mylog -e '6727696' | python merlijn_stdin.py 25 1970-01-01 00:00:00 1970-01-01 00:00:00 tools-webgrid-lighttpd-1206.eqiad1.wikimedia.cloud tools.ptwikis lighttpd-precise-ptwikis 6727696 0 2016-08-17 21:01:42 2016-08-17 21:01:46 tools-webgrid-lighttpd-1207.eqiad1.wikimedia.cloud tools.ptwikis lighttpd-precise-ptwikis 6727696 Traceback (most recent call last): File "merlijn_stdin.py", line 4, in <module> line = raw_input() EOFError: EOF when reading a line
- Ignore the EOFError; the relevant lines are above that. Error codes (first entry) are typically 0 (finished succesfully), 19 ('before writing exit_status' = crashed?), 25 (rescheduled) or 100 ('assumedly after job' = lost job?). I'm not entirely sure about the codes when the job stops because of an error.
Orphan processes
Hunt for orphan processes (parent process id == 1) that have leaked from grid jobs:
$ clush -w @exec-stretch -w @webgrid-generic-stretch -w @webgrid-lighttpd-stretch -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|Debian-exim|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|systemd|www-data|sgeadmin)"|grep -v systemd|grep -v perl|grep -E " 1 "'
The exclusion for perl processes is because there are 2-3 tools built with perl that make orphans via the "normal" forking process.
Kill orphan processes:
$ clush -w @exec-stretch -w @webgrid-generic-stretch -w @webgrid-lighttpd-stretch -b 'ps axwo user:20,ppid,pid,cmd | grep -Ev "^($USER|root|daemon|Debian-exim|diamond|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|systemd|www-data|sgeadmin)"|grep -v systemd|grep -v perl|grep -E " 1 "|awk "{print \$3}"|xargs sudo kill -9'
Creating a new node
Clearing error state
Sometimes due to various hiccups (like LDAP or DNS malfunction), grid jobs can move to an Error state from which they will not come out without explicit user action. Error states can be created by repeated job failures caused by user error on healthy nodes. This includes an 'A' state from heavy job load. Nodes in this state are unschedulable, so unless this condition persists, it's not required to attempt to alleviate this error code. Persistent 'A' error state could however mean a node is broken. Lastly error state 'au' generally means the host isn't reachable. This is also likely attributed to load. If this error persists, check the host's job queue and ensure gridengine is still running on the host.
To view the any potential error states and messages for each node:
qstat -explain E -xml | grep -e name -e state -e message
Once you have ascertained the cause of the Error state and fixed it, you can clear all the error state queue using the cookbook:
dcaro@vulcanus$ cookbook wmcs.toolforge.grid.cleanup_queue_errors -h usage: cookbooks.wmcs.toolforge.grid.cleanup_queue_errors [-h] [--project PROJECT] [--task-id TASK_ID] [--no-dologmsg] [--master-hostname MASTER_HOSTNAME] WMCS Toolforge - grid - cleanup queue errors Usage example: cookbook wmcs.toolforge.grid.cleanup_queue_errors --project toolsbeta --master-hostname toolsbeta-sgegrid-master options: -h, --help show this help message and exit --project PROJECT Relevant Cloud VPS openstack project (for operations, dologmsg, etc). If this cookbook is for hardware, this only affects dologmsg calls. Default is 'toolsbeta'. --task-id TASK_ID Id of the task related to this operation (ex. T123456). --no-dologmsg To disable dologmsg calls (no SAL messages on IRC). --master-hostname MASTER_HOSTNAME The hostname of the grid master node. Default is '<project>-sgegrid-master'
Or manually with:
user@tools-sgegrid-master:~$ sudo qmod -c '*'
You also need to clear all the queues that have gone into error state. Failing to do so prevents jobs from being scheduled on those queues. You can clear all error states on queues with:
qstat -explain E -xml | grep 'name' | sed 's/<name>//' | sed 's/<\/name>//' | xargs qmod -cq
If a single job is stuck in the dr state, meaning is stuck in deleting state but never goes away, run the following:
user@tools-sgegrid-master:~$ sudo qdel -f 9999850
root forced the deletion of job 9999850
Draining a node of Jobs
In real life, you just do this with the exec-manage script. Run sudo exec-manage depool $fqdn
on the grid master or shadow master (eg. tools-sgegrid-master.tools.eqiad1.wikimedia.cloud). What follow are the detailed steps that are handled by that script.
- Disable the queues on the node with
qmod -d '*@$node_name'
- Reschedule continuous jobs running on the node (see below)
- Wait for non-restartable jobs to drain (if you want to be nice!) or
qdel
them - Once whatever needed to be done, reenable the node with
qmod -e '*@$node_name'
There is no simple way to delete or reschedule jobs on a single host, but the following snippet is useful to provide a list to the command line:
$(qhost -j -h $NODE_NAME | awk '{print $1}' | egrep ^[0-9])
which make for reasonable arguments for qdel
or qmod -rj
.
Decommission a node
- Drain the node (see above!). Give the non-restartable jobs some time to finish (maybe even a day if you are feeling generous?).
- Disable puppet on the node.
- Delete its files in /data/project/.system_sge/gridengine/etc (an exec node will have a file named after itself at the hosts, exechosts and submithosts subdirectories).
- Delete the alias for the host in /data/project/.system_sge/gridengine/default/common/host_aliases.
- Remove node from hostgroups it is present in, if any. You can check / remove with a
sudo qconf -mhgrp @general
on any admin host. This will open up the list in a text editor, where you can carefully delete the name of the host and save. Be careful to keep the line continuations going. - Remove the node from any queues it might be included directly in. Look at
sudo qconf -sql
for list of queues, and thensudo qconf -mq $queue_name
to see list of hosts in it. Note that this seems to be mostly required only for webgrid hosts (yay consistency!) - Remove the node from gridengine with
sudo qconf -de $fqdn
- If the node is a webgrid node, also remove it from being a submit host, with
sudo qconf -ds $fqdn
. - Double check that you got rid of the node(s) from grid config by checking the output of
sudo qconf -sel
. (See phab:T149634 for what can happen.) - Wait for a while, then delete the VM!
Local package management
Local packages are provided by an aptly
repository on tools-sge-services-03
.
On tools-sge-services-03
, you can manipulate the package database by various commands; cf. aptly(1)
. Afterwards, you need to publish the database to the file Packages
by (for the trusty-tools
repository) aptly publish --skip-signing update trusty-tools
. To use the packages on the clients you need to wait 30 minutes again or run apt-get update
. In general, you should never just delete packages, but move them to ~tools.admin/archived-packages
.
You can always see where a package is (would be) coming from with apt-cache showpkg $package
.
Local package policy
Package repositories
- We only install packages from trustworthy repositories.
- OK are
- The official Debian and Ubuntu repositories, and
- Self-built packages (apt.wikimedia.org and aptly)
- Not OK are:
- PPAs
- other 3rd party repositories
- OK are
Packagers effectively get root on our systems, as they could add a rootkit to the package, or upload an unsafe sshd version, and apt-get will happily install it
Hardness clause: in extraordinary cases, and for 'grandfathered in' packages, we can deviate from this policy, as long as security and maintainability are kept in mind.
apt.wikimedia.org
We assume that whatever is good for production is also OK for Toolforge.
aptly
We manage the aptly repository ourselves.
- Packages in aptly need to be built by Toolforge admins
- we cannot import .deb files from untrusted 3rd party sources
- Package source files need to come from a trusted source
- a source file from a trusted source (i.e. backports), or
- we build the debian source files ourselves
- we cannot build .dcs files from untrusted 3rd party sources
- Packages need to be easy to update and build
- cowbuilder/pdebuild OK
- fpm is OK
- See Deploy new jobutils package for an example walk through of building and adding packages to aptly.
- We only package if strictly necessary
- infrastructure packages
- packages that should be available for effective development (e.g. composer or sbt)
- not: python-*, lib*-perl, ..., which should just be installed with the available platform-specific package managers
- For each package, it should be clear who is responsible for keeping it up to date
- for infrastructure packages, this should be one of the paid staffers
A list of locally maintained packages can be found under /local packages.
Building packages
![]() | moved to Portal:Toolforge/Admin/Packaging |
Deploy new jobutils package
![]() | moved to Portal:Toolforge/Admin/Packaging |
Deploy new misctools package
![]() | moved to Portal:Toolforge/Admin/Packaging |
Testing/QA for a new tools-webservice package
See also tools-webservice source tree README.
There is a simple flask app in toolsbeta using the tool test
that is set up to be deployed via webservice on Kubernetes. If you need to test something on Son of Grid Engine, the test3 tool is more appropriate, but similar.
After running become test
, you can go to the qa/tools-webservice
directory. This is checked out via anonymous https, and is suitable for checking out a patch you are reviewing. There is an untracked file in there that is useful here, usually. The webservicefile at the route is just a copy of the one in the scripts
folder in the repo. The only difference is:
9d8
< sys.path.insert(0, '')
That exchanges the distribution installed package in the python path for the local directory, so if you run ./webservice $somecommand
it will run what is in your local folder rather than what is in /usr/lib/python3/dist-packages/
. If you are testing changes made directly to scripts/webservice
in the repo, you will likely need to copy that over the file and add sys.path.insert(0, "")
after the import sys line.
If there is no import sys
line in this version of the code, add one! This should let you bang on your new version without having to mess with packaging, yet.
Deploy new tools-webservice package
![]() | moved to Portal:Toolforge/Admin/Packaging |
Webserver statistics
To get a look at webserver statistics, goaccess is installed on the webproxies. Usage:
goaccess --date-format="%d/%b/%Y" --log-format='%h - - [%d:%t %^] "%r" %s %b "%R" "%u"' -q -f/var/log/nginx/access.log
Interactive key bindings are documented on the man page. HTML output is supported by piping to a file. Note that nginx logs are rotated (twice?) daily, so there is only very recent data available.
Restarting all webservices
This is sometimes necessary, if the proxy entries are out of whack. Can be done with
$ ssh tools-sgegrid-master.tools.eqiad1.wikimedia.cloud
$ qstat -q webgrid-generic -q webgrid-lighttpd -u '*' | awk '{print $1}' | xargs -L1 sudo qmod -rj
The qstat gives us a list of all jobs from all users under the two webgrid queues, and the qmod -rj asks gridengine to restart them. This can be run as root on tools-login.wmflabs.org
To restart webservices in the kubernetes cluster, run the following on cloud-cumin-01
$ sudo cumin "O{project:tools name:^tools-worker-10..}" 'docker ps --format "{{.ID}}" --filter "label=io.kubernetes.container.name=webservice"|xargs docker rm -f'
Banning an IP from tool labs
On Hiera:Tools, add the IP to the list of dynamicproxy::banned_ips, then force a puppet run on the webproxies. Add a note to Help:Toolforge/Banned explaining why. The user will get a message like [2].
Deploying the main web page
This website (plus the 403/500/503 error pages) are hosted under tools.admin
. To deploy,
$ become admin
$ cd tool-admin-web
$ git pull
Regenerate replica.my.cnf
This requires access to the active labstore host, and can be done as follows:
$ ssh nfs-tools-project.svc.eqiad.wmnet
$ sudo /usr/local/sbin/maintain-dbusers delete tools.${NAME}
:# or
$ sudo /usr/local/sbin/maintain-dbusers delete ${USERNAME} --account-type=user
Once the account has been deleted, the maintain-dbusers service will automatically recreate the user account.
Debugging bad mysql credentials
Sometimes things go wrong and a user's replica.my.cnf
credentials don't propigate everywhere. You can check the status on various servers to try and narrow down what went wrong.
The database credentials needed are in /etc/dbusers.yaml
on the labstore servers, for example nfs-tools-project.svc.eqiad.wmnet
.
$ ssh nfs-tools-project.svc.eqiad.wmnet
$ sudo cat /etc/dbusers.yaml
:# look for the accounts-backend['password'] for the m5-master connections (user: labsdbaccounts)
:# look for the labsdbs['password'] for the other connections (user: labsdbadmin)
$ CHECK_UID=u12345 # User id to check for
:# Check if the user is in our meta datastore
$ mysql -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "USE labsdbaccounts; SELECT * FROM account WHERE mysql_username='${CHECK_UID}'\G"
:# Check if all the accounts are created in the labsdb boxes from meta datastore.
$ ACCT_ID=.... # Account_id is foreign key (id from account table)
$ mysql -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "USE labsdbaccounts; SELECT * FROM labsdbaccounts.account_host WHERE account_id=${ACCT_ID}\G"
:# Check the actual labsdbs if needed
$ mysql -h labsdb1009.eqiad.wmnet -u labsdbadmin -p -e 'SELECT User, Password from mysql.user where User like "${CHECK_UID}";'
:# Resynchronize account state on the replicas by finding missing GRANTS on each db server
$ sudo maintain-dbusers harvest-replicas
See phab:T183644 for an example of fixing automatic credential creation caused when a old LDAP user becomes a Toolforge member and has an untracked user account on toolsdb.
Regenerate kubernetes credentials for tools (.kube/config)
With admin credentials (root@controlplane node will do), run kubectl -n tool-<toolname> delete cm maintain-kubeusers
Adding K8S Components
See Portal:Toolforge/Admin/Kubernetes#Building_new_nodes
Deleting a tool
For batch or CLI deletion of tools, use the 'mark_tool' command on a cloudcontrol node:
andrew@cloudcontrol1003:~$ sudo mark_tool
usage: mark_tool [-h] [--ldap-user LDAP_USER] [--ldap-password LDAP_PASSWORD]
[--ldap-base-dn LDAP_BASE_DN] [--project PROJECT] [--disable]
[--delete] [--enable]
tool
mark_tool: error: the following arguments are required: tool
Maintainers can mark their tools for deletion using the "Disable tool" button on the tool's detail page on https://toolsadmin.wikimedia.org/. In either case, the immediate effect of disabling a tool is to stop any running jobs, prevent users from logging in as that tool, and schedule archiving and deletion for 40 days in the future.
Tool archives are stored on the tools NFS server, currently labstore1004.eqiad.wmnet:
root@labstore1004:/srv/disable-tool# ls -ltrah /srv/tools/archivedtools/
total 1.8G
drwxr-xr-x 5 root root 4.0K Jun 21 19:37 ..
-rw-r--r-- 1 root root 102K Jul 22 22:15 andrewtesttooltwo
-rw-r--r-- 1 root root 45 Oct 13 00:47 andrewtesttooltwo.tgz
-rw-r--r-- 1 root root 8.3M Oct 13 03:20 mediaplaycounts.tgz
-rw-r--r-- 1 root root 1.8G Oct 13 04:01 projanalysis.tgz
-rw-r--r-- 1 root root 1.3M Oct 13 21:05 reportsbot.tgz
drwxr-xr-x 2 root root 4.0K Oct 13 21:10 .
-rw-r--r-- 1 root root 719K Oct 13 21:10 wsm.tgz
-rw-r--r-- 1 root root 4.8K Oct 13 21:20 andrewtesttoolfour.tgz
The actual deletion process is shockingly complicated. A tool will only be archived and deleted if all of the prior steps succeed, but disabling of a tool should be a sure thing.
Obsolete manual deletion from ancient times |
---|
The following content has been placed in a collapsed box for improved usability. |
|
The above content has been placed in a collapsed box for improved usability. |
SSL certificates
See Portal:Toolforge/Admin/SSL_certificates.
Granting a tool write access to Elasticsearch
- Generate a random password and the mkpassword crypt entry for it using the script new-es-password.sh. (Must be run a host with the `mkpasswd` command installed. (The mkpasswd is part of the whois Debian package.)
$ ./new-es-password.sh tools.an-example
tools.example elasticsearch.ini
----
[elasticsearch]
user=tools.example
password=A3rJqgFKxa/x4NlnIhmw2cXcV92it/Zv0Yt+a7yhxCw=
----
tools.example puppet master private (hieradata/labs/tools/common.yaml)
----
profile::toolforge::elasticsearch::haproxy::elastic_users:
- name: 'tools.example'
password: '$6$FYwP3wxT4K7O9EE$OA3P5972NWJVG/WUnD240sal34/dsNabbcawItevMYO9uoR.fJBrjSABex0EDW0wlkWHID1Tf4oJoiNvYFGmy/'
- Add the private SHA512 hash to the tools puppetmaster:
$ ssh tools-puppetmaster-02.tools.eqiad1.wikimedia.cloud
$ cd /var/lib/git/labs/private
$ sudo -i vim /var/lib/git/labs/private/hieradata/labs/tools/common.yaml
... paste in SHA512 crypt data ...
:wq
$ sudo git add hieradata/labs/tools/common.yaml
$ sudo git commit -m "[local] Elasticsearch credentials for $TOOL"
- Force a puppet run on tools-elastic-[123] using the tools clushmaster
tools-clushmaster-02:~$ clush -w tools-elastic-1,tools-elastic-2,tools-elastic-3 'sudo puppet agent --test'
- Create the credentials file in the tool's $HOME:
$ ssh tools-dev.wmflabs.org
$ umask 0026
$ sudo -i touch /data/project/$TOOL/.elasticsearch.ini
$ ls -al /data/project/$TOOL/.elasticsearch.ini
# confirm the created file is automatically chmod o-rwx by umask.
# e.g. -rw-r----- 1 root tools.$TOOL 0 Jan 19 19:04 /data/project/$TOOL/.elasticsearch.ini
# If not, `sudo -i chmod o-rwx /data/project/$TOOL/.elasticsearch.ini`, hoping that NFS fixes any attempt to exploit race conditions with errno 5
$ sudo -i vim /data/project/$TOOL/.elasticsearch.ini
... paste in username and raw password in ini file format ...
:wq
- Resolve the ticket!
Package upgrades
See Managing package upgrades.
Creating a new Docker image (e.g. for new versions of Node.js)
We maintain a number of Docker images for different languages (PHP, Python, Node.js, etc.). When a new major/LTS version of a language is released, we need to build a new Docker image to support it, and eventually deprecate the image using the old version. For example, https://phabricator.wikimedia.org/T310821 was raised to add support for Node.js v16.
These are the required steps to create the new image, and make it available for use in Toolforge:
- If you need a package that is not available in Debian repositories yet, you can add a new base image to production-images (example patch). In this example, this was done by the SRE Team and we just used the base image they created as the base for the Toolforge image.
- Add a new Dockerfile in toollabs-images (example patch)
- To build the new image, on
tools-docker-imagebuilder-01.tools.eqiad1.wikimedia.cloud
, you can find a clone of the image repository at/srv/images/toolforge
and there you can either usebuild.py
to build a specific image, or use therebuild_all.sh
bash script to rebuild all the images
- To build the new image, on
- Add the new image name in image-config
- Deploy this change to toolsbeta:
cookbook wmcs.toolforge.k8s.component.deploy --git-url https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/
- Deploy this change to tools:
cookbook wmcs.toolforge.k8s.component.deploy --git-url https://gitlab.wikimedia.org/repos/cloud/toolforge/image-config/ --project tools --deploy-node-hostname tools-k8s-control-1.tools.eqiad1.wikimedia.cloud
- Recreate the jobs-api pods in the Toolsbeta cluster, to make them read the new ConfigMap
- SSH to the bastion:
ssh toolsbeta-sgebastion-05.toolsbeta.eqiad1.wikimedia.cloud
- Find the pod ids:
kubectl get pod -n jobs-api
- Delete the pods, K8s will replace them with new ones:
kubectl sudo delete pod -n jobs-api {pod-name}
- SSH to the bastion:
- Do the same in the Tools cluster (same instructions, but use
login.toolforge.org
as the SSH bastion)
- Deploy this change to toolsbeta:
- From a bastion, check you can run the new image with
webservice {image-name} shell
- From a bastion, check the new image is listed when running
toolforge-jobs images
- Update the Toolforge/Kubernetes wiki page to include the new image
- Update the wiki related to the specific language, if relevant. E.g. if you've created a new Node.js image, mention it in Web/Node.js.
- Finally, send an email to cloud-announce@lists.wikimedia.org to let everybody know about the new image!
Kubernetes
See Portal:Toolforge/Admin/Kubernetes
Tools-mail / Exim
See Portal:Toolforge/Admin/Exim and Portal:Cloud_VPS/Admin/Email#Operations
Emergency guides
- Setting up an emergency webservice for a tool / phab:T103056 / etherpad:T103056
- Deployment of grrrit-wm and wikibugs / phab:T102984 / etherpad:T102984
- 'tool labs is down' notification on tools.wmflabs.org / phab:T102971 / etherpad:T102971
Users and community
Some information about how to manager users and general community and their relationship with Toolforge.
Project membership request approval
User access requests show up in https://toolsadmin.wikimedia.org/tools/membership/
Some guidelines for account approvals, based on advice from scfc:
- If the request contains any defamatory or abusive information as part of the username(s), reason, or comments → mark as Declined and check the "Suppress this request (hide from non-admin users)" checkbox.
- You should also block the user on Wikitech and consider contacting a Steward for wider review of the SUL account.
- If the user name "looks" like a bot or someone else who could not consent to the Terms of use and Rules → mark as Declined.
- Check the status of the associated SUL account. If the user is banned on one or more wikis → mark as Declined.
- If the stated purpose is "tangible" ("I want to move my bot x to Labs", "I want to build a web app that does y", etc.) → mark as Approved.
- If you know that someone else has been working on the same problem, add a message explaining who the user should contact or where they might find more information.
- If the stated purpose is "abstract" ("research", "experimentation", etc.) and there is a hackathon ongoing or planned, the user has a non-throw-away mail address, the user has created a user page with coherent information about theirself or linked a SUL account of good standing, etc. → mark as Approved.
- Otherwise add a comment asking for clarification of their reason for use and mark as Feedback needed. The request is not really "denied", but more (indefinitely) "delayed".
Requests left in Feedback needed for more information for more than 30 days should usually be declined with a message like "Feel free to apply again later with more complete information."
Manually associate an LDAP account with wikitech
Developer accounts in the LDAP directory are often, but not always, attached to wikitech as wiki users. The wikitech account is automatically "attached" when the developer account CN and password are used to login to the wiki.
When a user with an active developer account that is not attached to wikitech is found who needs a password reset or blocking, a maintenance script can be run to force attach the account:
- Login to a wikitech host (ie, labweb1001.wikimedia.org)
- Check and confirm the LDAP information for the user. Lookup with:
- Run the maintenance script
user@labweb1001:~$ mwscript /srv/mediawiki/php/extensions/LdapAuthentication/maintenance/attachLdapUser.php --wiki=labswiki --user=$user --email=$email
- Confirm the account creation by checking the wikitech new users log.
Other
How do Toolforge web services actually work?
See Portal:Toolforge/Admin/Webservice
What makes a root/Giving root access
Users who need to do administrative work in Toolforge need to be listed at several places:
- OpenStack project administrator: This allows a user to add and delete other users from the Toolforge project.
- sudo policy "roots": This allows a user to use
sudo
to becomeroot
on Toolforge instances. - 'admin' tool maintainer: This allows a user to log into infrastructure instances and perform tasks as the
admin
tool. (note that for toolsbeta you will need to add it through the command line usingmodify-ldap-group toolsbeta.admin
frommwmaint1002
) - Gerrit group "toollabs-trusted": This allows a user to
+2
changes in repositories exclusive to Toolforge.
Servicegroup log
tools.admin runs /data/project/admin/bin/toolhistory
, which provides an hourly snapshot of ldaplist -l servicegroup
as git repository in /data/project/admin/var/lib/git/servicegroups
HBA: How does it work?
wikibooks:en:OpenSSH/Cookbook/Host-based_Authentication#Client_Configuration_for_Host-based_Authentication. If things don't work, check every point listed in that guide - sshd doesn't give you much output to work with.
Central syslog servers
tools-logs-01
and tools-logs-02
are central syslog servers that receive syslog data from all? tools hosts. These are stored in /srv/syslog
.
Useful administrative tools
These tools offer useful information about Toolforge itself:
- ToolsDB - Statistics about tables owned by tools
- OpenStack Browser - examine projects, instances, web proxies, and Puppet config
- Son of Grid Engine grid status
- Tools running jobs on SGE hosts in the last 7 days
Brainstorming
Sub pages
- APIs
- Archive
- Build Service
- Buildpacks
- Dynamicproxy
- Exim
- Grid
- Infrastructure tools
- Kubernetes
- Kubernetes/2020 Kubernetes cluster rebuild plan notes
- Kubernetes/Certificates
- Kubernetes/Components
- Kubernetes/Deploying
- Kubernetes/Docker-registry
- Kubernetes/Etcd
- Kubernetes/Etcd (deprecated)
- Kubernetes/Jobs framework
- Kubernetes/Networking and ingress
- Kubernetes/Pod tracing
- Kubernetes/RBAC and PSP
- Kubernetes/Upgrading Kubernetes
- Kubernetes/labels
- Kubernetes/lima-kilo
- Legacy redirector for webservices
- Maintenance
- Packaging
- Prometheus
- Redis
- Runbooks
- Runbooks/ToolsGridQueueProblem
- Runbooks/k8s-haproxy
- SSL certificates
- Services
- Son Of Grid Engine Notes
- System Overview
- Toolforge-sync-meeting
- Toolsbeta
- Toolschecker
- Webservicemonitor
- Workgroup
- Workgroup/2022-11-15
- Workgroup/2022-12-13
- Workgroup/2023-01-31
- Workgroup/2023-02-21
- emergency guides
- emergency guides/irc bot deployment
- emergency guides/single tool webservice
- emergency guides/toolforge down notification
- local packages
- puppet refactor
- replagstats
- toolhistory