You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

WMDE/Wikidata/Alerts: Difference between revisions

From Wikitech-static
< WMDE‎ | Wikidata
Jump to navigation Jump to search
imported>Jakob
(Add "Termbox Request Errors" section under grafana alerts)
imported>Ladsgroup
Line 1: Line 1:
==Icinga==
==Alertmanager==
{{see also|Alertmanager}}
Wikidata contact information for Alertmanager is set in https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/alertmanager/templates/alertmanager.yml.erb


Wikidata related Icinga alerts are defined in puppet https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/icinga/manifests/monitor/wikidata.pp
The status of alerts can be seen at https://alerts.wikimedia.org/?q=team%3Dwikidata


The status of alerts can be seen at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=wikidata
Internally in WMDE there is a wikidata-monitoring mailing list you can subscribe to.


All alerts report to the "wikidata" contact group, which can be seen at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/nagios_common/files/contactgroups.cfg#52
All alerts coming from grafana with tag of team: "wikidata-team" and receiver of Alertmanager will end up contacting the wikidata-monitoring internal mailing list. (See [[Alertmanager]] for more information on how to setup alerts that would contact the team).
 
Internally in WMDE there is a wikidata-monitoring mailing list you can subscribe to, also notifications will land in the wikidata IRC channel.


==Grafana==
==Grafana==


One of the Icinga checks monitors the alert status of the wikidata alerts dashboard on Grafana.
Alertmanager handle wikidata alerts dashboard on Grafana.


The dashboard can be found here: https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts
The dashboard can be found here: https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts

Revision as of 21:24, 8 September 2021

Alertmanager

Wikidata contact information for Alertmanager is set in https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/alertmanager/templates/alertmanager.yml.erb

The status of alerts can be seen at https://alerts.wikimedia.org/?q=team%3Dwikidata

Internally in WMDE there is a wikidata-monitoring mailing list you can subscribe to.

All alerts coming from grafana with tag of team: "wikidata-team" and receiver of Alertmanager will end up contacting the wikidata-monitoring internal mailing list. (See Alertmanager for more information on how to setup alerts that would contact the team).

Grafana

Alertmanager handle wikidata alerts dashboard on Grafana.

The dashboard can be found here: https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts

Maxlag: Above 10 for 1 hour

In the past this has been caused by:

  • dispatch lag being high, due to waiting for replication, due to a db server being overloaded, due to a long running query that was not correctly killed

Dispatch Script starts

A script is regularly run by Cron to inform other wikis about changes in Wikidata. If that script is not run for longer time than that indicates a problem. This script runs on test.wikidata.org as well, so there is an alert for that too in order to maybe spot a problem there before it reaches production with the train a day later.

The previous incident: T258062: Wikidata Change Dispatching Broken

See also WMDE/Wikidata/Dispatching and Wikibase: Change propagation

Edits: Wikidata edit rate

The edit rate on Wikidata can be a good indicator that something somewhere is wrong, although it will not always indicate exactly what that is.

You can view the edits dashboard at https://grafana.wikimedia.org/d/000000170/wikidata-edits

If MAXLAG is high, that might be a reason for low edit rate.

You may want to investigate what is going on with the API (as all edits go via the API) https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&var-metric=p50&var-module=wb*

API: Max p95 execute time for write modules

Investigate the wb api @ https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&var-metric=p50&var-module=wb*

In the past this has been caused by:

  • s8 db being overloaded, often for a fixable reason
  • Memcached being overloaded, in the past indicating UBNs

Termbox Request Errors

This kind of error occurs when Wikibase is unable to reach the Termbox Service, i.e. the HTTP request itself fails and is unlikely to have reached its destination. This error does *not* get triggered by erroneous responses, so it means there is a problem on the MediaWiki/Wikibase side or network issues.

Oozie Job

Sometimes these jobs will fail for random reasons.

They will be restarted, so no need to worry on a first failure.

If things continue to fail, contact WMF analytics to investigate on IRC in #wikimedia-analytics.