You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Incidents/2022-02-06 wdqs updater
Jump to navigation
Jump to search
document status: draft
Summary
Incident ID | 2022-02-06 wdqs updater | Start | 2022-02-06T23:00:00 |
---|---|---|---|
Task | T301147 | End | 2022-02-07T06:20:00 |
People paged | 0 | Responder count | 0 |
Coordinators | 0 | Affected metrics/SLOs | WDQS Updater Lag, Wikidata MaxLag |
Impact | Edits throttled (mainly bots) | ||
Summary | The streaming updater stopped to function properly because a k8s node misbehaved |
Documentation:
Actionables
- phab:T305068: alert when flink does not have the capacity it expects
- phab:T293063:adapt/create runbooks/cookbooks for the wdqs streaming updater
Scorecard
Rubric | Question | Score |
---|---|---|
People | Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt) | |
Were the people who responded prepared enough to respond effectively (0/5pt) | ||
Did fewer than 5 people get paged (0/5pt)? | ||
Were pages routed to the correct sub-team(s)? | ||
Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours) | ||
Process | Was the incident status section actively updated during the incident? (0/1pt) | |
If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt) | ||
Is there a phabricator task for the incident? (0/1pt) | ||
Are the documented action items assigned? (0/1pt) | ||
Is this a repeat of an earlier incident (-1 per prev occurrence) | ||
Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task) | ||
Tooling | Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt) | |
Did existing monitoring notify the initial responders? (1pt) | ||
Were all engineering tools required available and in service? (0/5pt) | ||
Was there a runbook for all known issues present? (0/5pt) | ||
Total score |