You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incidents/2022-02-06 wdqs updater

From Wikitech-static
< Incidents
Revision as of 17:49, 8 April 2022 by imported>Krinkle (Krinkle moved page Incident documentation/2022-02-06 wdqs updater to Incidents/2022-02-06 wdqs updater)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2022-02-06 wdqs updater Start 2022-02-06T23:00:00
Task T301147 End 2022-02-07T06:20:00
People paged 0 Responder count 0
Coordinators 0 Affected metrics/SLOs WDQS Updater Lag, Wikidata MaxLag
Impact Edits throttled (mainly bots)
Summary The streaming updater stopped to function properly because a k8s node misbehaved

Documentation:

Actionables

  • phab:T305068: alert when flink does not have the capacity it expects
  • phab:T293063:adapt/create runbooks/cookbooks for the wdqs streaming updater

Scorecard

Incident Engagement™ ScoreCard
Rubric Question Score
People Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)
Were the people who responded prepared enough to respond effectively (0/5pt)
Did fewer than 5 people get paged (0/5pt)?
Were pages routed to the correct sub-team(s)?
Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours)
Process Was the incident status section actively updated during the incident? (0/1pt)
If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt)
Is there a phabricator task for the incident? (0/1pt)
Are the documented action items assigned? (0/1pt)
Is this a repeat of an earlier incident (-1 per prev occurrence)
Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task)
Tooling Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)
Did existing monitoring notify the initial responders? (1pt)
Were all engineering tools required available and in service? (0/5pt)
Was there a runbook for all known issues present? (0/5pt)
Total score