You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incidents/2022-02-22 wdqs updater codfw
document status: draft
Summary
Incident ID | 2022-02-22 wdqs updater codfw | Start | 2022-02-22 17:47:00 |
---|---|---|---|
Task | T302340 | End | 2022-02-22 19:27:00 |
People paged | 0 | Responder count | 3 |
Coordinators | Ryan Kemper | Affected metrics/SLOs | updateQueryServiceLag (Grafana) |
Impact | For about two hours, WDQS updates failed to be processed. As a consequence, bots and tools were unable to edit Wikidata during this time. | ||
Summary | WDQS updaters stopped processing updates in Codfw due to a failure with Flink in Codfw.
The API maxlag feature, is configured on Wikidata.org to incorporate WDQS lag. The updateQueryServiceLag service exists to transfer this datapoint from Prometheus to MW. Because bots generally opt-in to be friendly and enable the "maxlag" parameter, and because the metric was configured to consider both Eqiad and Codfw, their edits were rejected for two hours. |
Timeline
2022-02-22:
17:30 Search dev deploys a version upgrade (0.3.103) of the flink application to codfw for wdqs
17:31 The flink application is unable to restore from the savepoint
17:51 Search dev does not find any solution to unblock the situation and asks for a depool of wdqs@codfw (users no longer see stale results when hitting wdqs@codfw)
17:52 (traffic switched to eqiad) <gehel> !log depooling WDQS codfw (internal + public) - issues with deployment of new updater version on codfw
19:00 wikidata maxlag alert is triggered eventhough codfw is depooled (known limitation: phab:T238751)
19:20 wdqs@codfw is removed from the wikidata maxlag calculation (bots can resume editing)
19:20 Search dev rolls WDQS codfw flink state back to a previously saved checkpoint , restoring the processing of updates in WDQS. Within a few minutes lag catches up and the user impact resolves.
19:25 <ryankemper> !log T302330 `ryankemper@cumin1001:~$ sudo -E cumin '*mwmaint*' 'run-puppet-agent'` (getting https://gerrit.wikimedia.org/r/c/operations/puppet/+/764875 out)
19:27 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org
20:00 WCQS version 0.3.104 is deployed, which includes a fix for WCQS failures https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/764864. (Note: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/764864 addressed some WCQS failures but was not the primary cause of the WDQS failures)
2022-02-23
14:00 investigation of the root cause shows that flink can no longer start properly in k8s, the app was restarted in yarn
18:00 the flink app is still unable to run from k8s@codfw
2022-02-24
10:00 Search devs link the root cause to a poor implementation of the swift client protocol and decides to switch to a S3 client, the app will remain running in YARN while we move away from this swift client.
2022-03-08
10:00 The flink app is able to start on k8s@codfw thanks to the switch to the S3 client protocol
Documentation
- https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=8&orgId=1&var-cluster_name=wdqs&from=1645548333076&to=1645559701497 Graph of affected lag
Actionables
- https://phabricator.wikimedia.org/T238751 (pre-existing ticket) would have prevented the period in which Wikidata edits could not get through despite the affected hosts having already been depooled
TODO: Add the #Sustainability (Incident Followup) and the #SRE-OnFIRE (Pending Review & Scorecard) Phabricator tag to these tasks.
Scorecard
Rubric | Question | Score |
---|---|---|
People | Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt) | |
Were the people who responded prepared enough to respond effectively (0/5pt) | ||
Did fewer than 5 people get paged (0/5pt)? | ||
Were pages routed to the correct sub-team(s)? | ||
Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours) | ||
Process | Was the incident status section actively updated during the incident? (0/1pt) | |
If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt) | ||
Is there a phabricator task for the incident? (0/1pt) | ||
Are the documented action items assigned? (0/1pt) | ||
Is this a repeat of an earlier incident (-1 per prev occurrence) | ||
Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task) | ||
Tooling | Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt) | |
Did existing monitoring notify the initial responders? (1pt) | ||
Were all engineering tools required available and in service? (0/5pt) | ||
Was there a runbook for all known issues present? (0/5pt) | ||
Total score |