You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Portal:Toolforge/Admin/Toolschecker"

From Wikitech-static
Jump to navigation Jump to search
imported>Bstorm
(→‎/cron: Add some useful troubleshooting and update the information.)
imported>Majavah
(→‎/etcd/flannel: no longer there)
Line 4: Line 4:
== Servers ==
== Servers ==


* tools-checker-03.tools.eqiad1.wikimedia.cloud
* tools-checker-04.tools.eqiad1.wikimedia.cloud


This list is defined in the <code>toollabs::checker_hosts</code> key in https://wikitech.wikimedia.org/wiki/Hiera:Tools and is used in configuring the ferm rules for Toolforge's flannel and Kubernetes etcd clusters.
This list is defined in the <code>profile::toolforge::checker_hosts</code> Hiera key and is used in configuring the ferm rules for Toolforge's Kubernetes etcd clusters. The server needs to be manually configured as a grid submit host.


== Tools ==
== Tools ==
Line 41: Line 41:
=== /db/wikilabelsrw ===
=== /db/wikilabelsrw ===
=== /dns/private ===
=== /dns/private ===
=== /etcd/flannel ===
=== /etcd/k8s ===
=== /etcd/k8s ===
=== /grid/continuous/stretch ===
=== /grid/continuous/stretch ===

Revision as of 17:23, 11 May 2021

Toolschecker is a Flask application that runs various active checks on Toolforge and Cloud VPS infrastructure in response to HTTP requests. Each check is exposed as a separate URL on the checker.tools.wmflabs.org host. These URLs are monitored by Icinga for alerting purposes (see "checker.tools.wmflabs.org").

Servers

  • tools-checker-04.tools.eqiad1.wikimedia.cloud

This list is defined in the profile::toolforge::checker_hosts Hiera key and is used in configuring the ferm rules for Toolforge's Kubernetes etcd clusters. The server needs to be manually configured as a grid submit host.

Tools

Several tools are involved in the checks:

toolschecker
Crontab for /cron check
toolschecker-ge-ws
Webservice for /webservice/gridengine check
toolschecker-k8s-ws
Webservice for /webservice/kubernetes check

Checks

/cron

Expects the mtime of /data/project/toolschecker/crontest.txt to be updated every 5 minutes by a grid job executed by the toolschecker tool.

Troublshooting:

  • ssh login.toolforge.org
  • become toolschecker
  • crontab -l
*/5 * * * * /usr/bin/jsub -N toolschecker.crontest -once -quiet -j y -o /data/project/toolschecker/logs/crontest.log touch /data/project/toolschecker/crontest.txt
  • You can check the log at /data/project/toolschecker/logs/crontest.log
    • Note that in the log [Sun Apr 11 22:45:07 2021] there is a job named 'toolschecker.crontest' already active is a normal log line caused by latency in the whole system and not indicative of a major issue.
  • qstat to see what jobs are up there. An erroring job is quite likely to still be visible with the E status.
  • If you have a job in the E status:
    • qstat -j $jobid will give you grid-specific error messages (such as an LDAP issue)
    • Jobs that are not in the system anymore and you somehow are aware of the ID for can sometimes be learned about using qacct -j $jobid. This takes longer, but it reads the accounting files instead of what's in the system at this time. The accounting file *is* rotated, so older jobs will not be in there forever.
    • If an errored job is hanging around, it will block the next execution (unique names are required per user and queue), so run qdel $jobid if you suspect it is doing that.
  • It is possible the problem is with the grid's cron host. It is currently tools-sgecron-01.tools.eqiad1.wikimedia.cloud, is a single point of failure and is where cron jobs actually live.
  • If there is an overall grid problem, most of our documentation is in Portal:Toolforge/Admin for that, and Brooke is a good escalation point, if needed.

/db/toolsdb

/db/wikilabelsrw

/dns/private

/etcd/k8s

/grid/continuous/stretch

There is a small script in /data/project/toolschecker/bin/long-running.sh that runs as a job that runs forever. If it stops running, this checker will go critical. To prevent that there is a cron job definition of:

*/5 * * * * jlocal /data/project/toolschecker/bigbrother.sh test-long-running-stretch /data/project/toolschecker/bin/long-running.sh

The bigbrother.sh script checks for the job and restarts it if not found.

/grid/start/stretch

/k8s/nodes/ready

/ldap

/nfs/dumps

/nfs/home

/nfs/secondary_cluster_showmount

/redis

/self

/webservice/gridengine

/webservice/kubernetes