You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
This page is currently a draft.
More information and discussion about changes to this draft on the talk page.
Toolschecker is a Flask application that runs various active checks on Toolforge and Cloud VPS infrastructure in response to HTTP requests. Each check is exposed as a separate URL on the checker.tools.wmflabs.org host. These URLs are monitored by Icinga for alerting purposes (see "checker.tools.wmflabs.org"). Some URLs are also monitored externally by Catchpoint.
This list is defined in the
toollabs::checker_hosts key in https://wikitech.wikimedia.org/wiki/Hiera:Tools and is used in configuring the ferm rules for Toolforge's flannel and Kubernetes etcd clusters.
Several tools are involved in the checks:
- Crontab for /cron check
- Webservice for /webservice/gridengine check
- Webservice for /webservice/kubernetes check
Expects the mtime of /data/project/toolschecker/crontest.txt to be updated every 5 minutes by a grid job executed by the toolschecker tool.
- ssh login.tools.wmaflabs.org
- become toolschecker
- crontab -l
*/5 * * * * /usr/bin/jsub -N toolschecker.crontest -once -quiet touch /data/project/toolschecker/crontest.txt
There is a small script in
/data/project/toolschecker/bin/long-running.sh that runs as a job that runs forever. If it stops running, this checker will go critical. To prevent that there is a cron job definition of:
*/5 * * * * jlocal /data/project/toolschecker/bigbrother.sh test-long-running-stretch /data/project/toolschecker/bin/long-running.sh
The bigbrother.sh script checks for the job and restarts it if not found.