You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

WMDE/Wikidata/Alerts: Difference between revisions

From Wikitech-static
< WMDE‎ | Wikidata
Jump to navigation Jump to search
imported>Addshore
No edit summary
 
imported>Tarrow
(9 intermediate revisions by 4 users not shown)
Line 1: Line 1:
==Icinga==
==Alertmanager==
{{see also|Alertmanager}}
Wikidata contact information for Alertmanager is set in https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/alertmanager/templates/alertmanager.yml.erb


Wikidata related Icinga alerts are defined in puppet https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/icinga/manifests/monitor/wikidata.pp
The status of alerts can be seen at https://alerts.wikimedia.org/?q=team%3Dwikidata


The status of alerts can be seen at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=wikidata
Internally in WMDE there is a wikidata-monitoring mailing list you can subscribe to.


All alerts report to the "wikidata" contact group, which can be seen at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/nagios_common/files/contactgroups.cfg#52
All alerts coming from grafana with tag of team: "wikidata-team" and receiver of Alertmanager will end up contacting the wikidata-monitoring internal mailing list. (See [[Alertmanager]] for more information on how to setup alerts that would contact the team).
 
Internally in WMDE there is a wikidata-monitoring mailing list you can subscribe to, also notifications will land in the wikidata IRC channel.


==Grafana==
==Grafana==


One of the Icinga checks monitors the alert status of the wikidata alerts dashboard on Grafana.
Alertmanager handle wikidata alerts dashboard on Grafana.


The dashboard can be found here: https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts
The dashboard can be found here: https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts
=== Maxlag: Above 10 for 1 hour ===
In the past this has been caused by:
* dispatch lag being high, due to waiting for replication, due to a db server being overloaded, due to a long running query that was not correctly killed
* lag on queryservice servers (see: {{phab|T302330}} )


=== Edits: Wikidata edit rate ===
=== Edits: Wikidata edit rate ===
Line 20: Line 26:


You can view the edits dashboard at https://grafana.wikimedia.org/d/000000170/wikidata-edits
You can view the edits dashboard at https://grafana.wikimedia.org/d/000000170/wikidata-edits
If MAXLAG is high, that might be a reason for low edit rate.


You may want to investigate what is going on with the API (as all edits go via the API) https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&var-metric=p50&var-module=wb*
You may want to investigate what is going on with the API (as all edits go via the API) https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&var-metric=p50&var-module=wb*
Line 27: Line 35:
Investigate the wb api @ https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&var-metric=p50&var-module=wb*
Investigate the wb api @ https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&var-metric=p50&var-module=wb*


=== SQL: Rows read, above 10 million, for 2 minutes ===
In the past this has been caused by:
* s8 db being overloaded, often for a fixable reason
* Memcached being overloaded, in the past indicating UBNs


This is likely too much reading.
=== Termbox Request Errors ===
 
This kind of error occurs when Wikibase is unable to reach the [https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service Termbox Service], i.e. the HTTP request itself fails and is unlikely to have reached its destination. This error does *not* get triggered by erroneous responses, so it means there is a problem on the MediaWiki/Wikibase side or network issues.
 
=== Change Dispatching ===
See [[WMDE/Wikidata/Dispatching]] and [https://doc.wikimedia.org/Wikibase/master/php/md_docs_topics_change-propagation.html Wikibase: Change propagation]. More metrics can be found on https://grafana-rw.wikimedia.org/d/hGFN2TH7z/edit-dispatching-via-jobs
 
==== Number of Rows in wb_changes table ====
 
The <code>wb_changes</code> table serves as the "buffer" from which the DispatchChanges job collects changes to dispatch to the client wikis. If that table keeps growing, then that implies that there might be a problem with the job and changes not getting dispatched.
 
The alert is currently set at 30,000 (30K) rows. We may want to adjust that value as we learn more about typical rows for that table and how it behaves as [[phab:T292728|T292728]] and [[phab:T292609|T292609]] get resolved.
 
'''Possible causes for spikes in this table in the past:'''
* deployments of changes to the helmfile controlling these things ([https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/730846 example]) result in a (short) interruption to queueing and running of jobs and thus result in a spike in the number of rows in the <code>wb_changes</code> table. Usually, that spike is gone quickly.
 
==== DispatchChanges normal job backlog ====
 
This job distributes changes on entities to the client wikis subscribed to them. If the backlog keeps growing, that could mean that not enough capacity is available to actually run this job as often as it needs to be. See [https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/725936 725936: changeprop-jobqueue: Increase concurrancy of DispatchChanges to 15] for how to increase it, if that is the problem.
 
The alert is currently set at 10 minutes (600,000 milliseconds) as 10 minutes is the [[:en:Service-level_objective|SLO]] for the duration of the entire process from a change happening at Wikidata to it appearing in the Recent Changes in the client wikis. The typical backlog is between 0.5 seconds and 1 second.
 
==== Delay injecting Recent Changes, aggregated across client wikis ====
 
This is the actual time between the change being made and it being inserted in the recent changes of the client wiki. The alert is on the 99th percentile of the metric, aggregated across all client wikis. The [[:en:Service-level_objective|SLO]] for this is 10 minutes, the alerting value is 60 minutes.


==Oozie Job==
==Oozie Job==


Contact WMF analytics to investigate
Sometimes these jobs will fail for random reasons.
 
They will be restarted, so no need to worry on a first failure.
 
If things continue to fail, contact WMF analytics to investigate on IRC in #wikimedia-analytics.

Revision as of 19:40, 22 February 2022

Alertmanager

Wikidata contact information for Alertmanager is set in https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/alertmanager/templates/alertmanager.yml.erb

The status of alerts can be seen at https://alerts.wikimedia.org/?q=team%3Dwikidata

Internally in WMDE there is a wikidata-monitoring mailing list you can subscribe to.

All alerts coming from grafana with tag of team: "wikidata-team" and receiver of Alertmanager will end up contacting the wikidata-monitoring internal mailing list. (See Alertmanager for more information on how to setup alerts that would contact the team).

Grafana

Alertmanager handle wikidata alerts dashboard on Grafana.

The dashboard can be found here: https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts

Maxlag: Above 10 for 1 hour

In the past this has been caused by:

  • dispatch lag being high, due to waiting for replication, due to a db server being overloaded, due to a long running query that was not correctly killed
  • lag on queryservice servers (see: task T302330 )

Edits: Wikidata edit rate

The edit rate on Wikidata can be a good indicator that something somewhere is wrong, although it will not always indicate exactly what that is.

You can view the edits dashboard at https://grafana.wikimedia.org/d/000000170/wikidata-edits

If MAXLAG is high, that might be a reason for low edit rate.

You may want to investigate what is going on with the API (as all edits go via the API) https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&var-metric=p50&var-module=wb*

API: Max p95 execute time for write modules

Investigate the wb api @ https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&var-metric=p50&var-module=wb*

In the past this has been caused by:

  • s8 db being overloaded, often for a fixable reason
  • Memcached being overloaded, in the past indicating UBNs

Termbox Request Errors

This kind of error occurs when Wikibase is unable to reach the Termbox Service, i.e. the HTTP request itself fails and is unlikely to have reached its destination. This error does *not* get triggered by erroneous responses, so it means there is a problem on the MediaWiki/Wikibase side or network issues.

Change Dispatching

See WMDE/Wikidata/Dispatching and Wikibase: Change propagation. More metrics can be found on https://grafana-rw.wikimedia.org/d/hGFN2TH7z/edit-dispatching-via-jobs

Number of Rows in wb_changes table

The wb_changes table serves as the "buffer" from which the DispatchChanges job collects changes to dispatch to the client wikis. If that table keeps growing, then that implies that there might be a problem with the job and changes not getting dispatched.

The alert is currently set at 30,000 (30K) rows. We may want to adjust that value as we learn more about typical rows for that table and how it behaves as T292728 and T292609 get resolved.

Possible causes for spikes in this table in the past:

  • deployments of changes to the helmfile controlling these things (example) result in a (short) interruption to queueing and running of jobs and thus result in a spike in the number of rows in the wb_changes table. Usually, that spike is gone quickly.

DispatchChanges normal job backlog

This job distributes changes on entities to the client wikis subscribed to them. If the backlog keeps growing, that could mean that not enough capacity is available to actually run this job as often as it needs to be. See 725936: changeprop-jobqueue: Increase concurrancy of DispatchChanges to 15 for how to increase it, if that is the problem.

The alert is currently set at 10 minutes (600,000 milliseconds) as 10 minutes is the SLO for the duration of the entire process from a change happening at Wikidata to it appearing in the Recent Changes in the client wikis. The typical backlog is between 0.5 seconds and 1 second.

Delay injecting Recent Changes, aggregated across client wikis

This is the actual time between the change being made and it being inserted in the recent changes of the client wiki. The alert is on the 99th percentile of the metric, aggregated across all client wikis. The SLO for this is 10 minutes, the alerting value is 60 minutes.

Oozie Job

Sometimes these jobs will fail for random reasons.

They will be restarted, so no need to worry on a first failure.

If things continue to fail, contact WMF analytics to investigate on IRC in #wikimedia-analytics.