You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incidents/2022-02-22 wdqs updater codfw

From Wikitech-static
< Incidents
Revision as of 17:49, 8 April 2022 by imported>Krinkle (Krinkle moved page Incident documentation/2022-02-22 wdqs updater codfw to Incidents/2022-02-22 wdqs updater codfw)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2022-02-22 wdqs updater codfw Start 2022-02-22 17:47:00
Task T302340 End 2022-02-22 19:27:00
People paged 0 Responder count 3
Coordinators Ryan Kemper Affected metrics/SLOs updateQueryServiceLag (Grafana)
Impact For about two hours, WDQS updates failed to be processed. As a consequence, bots and tools were unable to edit Wikidata during this time.
Summary WDQS updaters stopped processing updates in Codfw due to a failure with Flink in Codfw.

The API maxlag feature, is configured on Wikidata.org to incorporate WDQS lag. The updateQueryServiceLag service exists to transfer this datapoint from Prometheus to MW. Because bots generally opt-in to be friendly and enable the "maxlag" parameter, and because the metric was configured to consider both Eqiad and Codfw, their edits were rejected for two hours.

Timeline

2022-02-22:

17:30 Search dev deploys a version upgrade (0.3.103) of the flink application to codfw for wdqs

17:31 The flink application is unable to restore from the savepoint

17:51 Search dev does not find any solution to unblock the situation and asks for a depool of wdqs@codfw (users no longer see stale results when hitting wdqs@codfw)

17:52 (traffic switched to eqiad) <gehel> !log depooling WDQS codfw (internal + public) - issues with deployment of new updater version on codfw

19:00 wikidata maxlag alert is triggered eventhough codfw is depooled (known limitation: phab:T238751)

19:20 wdqs@codfw is removed from the wikidata maxlag calculation (bots can resume editing)

19:20 Search dev rolls WDQS codfw flink state back to a previously saved checkpoint , restoring the processing of updates in WDQS. Within a few minutes lag catches up and the user impact resolves.

19:25 <ryankemper> !log T302330 `ryankemper@cumin1001:~$ sudo -E cumin '*mwmaint*' 'run-puppet-agent'` (getting https://gerrit.wikimedia.org/r/c/operations/puppet/+/764875 out)

19:27 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org

20:00 WCQS version 0.3.104 is deployed, which includes a fix for WCQS failures https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/764864. (Note: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/764864 addressed some WCQS failures but was not the primary cause of the WDQS failures)

2022-02-23

14:00 investigation of the root cause shows that flink can no longer start properly in k8s, the app was restarted in yarn

18:00 the flink app is still unable to run from k8s@codfw

2022-02-24

10:00 Search devs link the root cause to a poor implementation of the swift client protocol and decides to switch to a S3 client, the app will remain running in YARN while we move away from this swift client.

2022-03-08

10:00 The flink app is able to start on k8s@codfw thanks to the switch to the S3 client protocol

Documentation

Actionables


TODO: Add the #Sustainability (Incident Followup) and the #SRE-OnFIRE (Pending Review & Scorecard) Phabricator tag to these tasks.

Scorecard

Incident Engagement™ ScoreCard
Rubric Question Score
People Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)
Were the people who responded prepared enough to respond effectively (0/5pt)
Did fewer than 5 people get paged (0/5pt)?
Were pages routed to the correct sub-team(s)?
Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours)
Process Was the incident status section actively updated during the incident? (0/1pt)
If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt)
Is there a phabricator task for the incident? (0/1pt)
Are the documented action items assigned? (0/1pt)
Is this a repeat of an earlier incident (-1 per prev occurrence)
Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task)
Tooling Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)
Did existing monitoring notify the initial responders? (1pt)
Were all engineering tools required available and in service? (0/5pt)
Was there a runbook for all known issues present? (0/5pt)
Total score