You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Incident documentation/2021-07-26 ruwikinews DynamicPageList"

From Wikitech-static
Jump to navigation Jump to search
imported>Legoktm
(→‎Actionables: fill in)
imported>Legoktm
(→‎Summary: +graphs from Google Doc)
Line 7: Line 7:


== Summary ==
== Summary ==
[[File:Errors during 2021-07-26 DPL outage.png|thumb|Spikes in 503 errors returned to users]]
[[File:Overall traffic during 2021-07-26 DPL outage.png|thumb|Corresponding graphs of traffic during the outage, showing approximately 15% of requests receiving errors.]]
Following the bot import of 200,000 pages to the Russian Wikinews in the span of 3 days, slow queries originating from ruwikinews's usage of the [[mw:Extension:DynamicPageList (Wikimedia)|DynamicPageList extension]]  (also known as "[[git:mediawiki/extensions/intersection|intersection]]") overloaded the s3 cluster of databases, causing php-fpm processes to hang/stall, eventually taking down all wikis with it. The outage was resolved by disabling the DynamicPageList extension on ruwikinews and aggressively killing queries on s3 replicas. In worst case usage, DPL's database queries roughly scale to the size of the largest category being checked, which on ruwikinews is orders of magnitude higher than other wikis. One slow query seen during the outage took more than 3 minutes to finish on an idle replica (see [[phab:P16896|EXPLAIN]]).
Following the bot import of 200,000 pages to the Russian Wikinews in the span of 3 days, slow queries originating from ruwikinews's usage of the [[mw:Extension:DynamicPageList (Wikimedia)|DynamicPageList extension]]  (also known as "[[git:mediawiki/extensions/intersection|intersection]]") overloaded the s3 cluster of databases, causing php-fpm processes to hang/stall, eventually taking down all wikis with it. The outage was resolved by disabling the DynamicPageList extension on ruwikinews and aggressively killing queries on s3 replicas. In worst case usage, DPL's database queries roughly scale to the size of the largest category being checked, which on ruwikinews is orders of magnitude higher than other wikis. One slow query seen during the outage took more than 3 minutes to finish on an idle replica (see [[phab:P16896|EXPLAIN]]).


'''Impact''': There was 30 minutes of outage with high latencies, failing to load pages and errors around all wikis due to unavailable PHP-FPM workers. Based on traffic graphs (see below) the outage impacted to approximately a 15% of all incoming HTTP requests for wikis, those being either lost, suffering high latencies or 5XX error codes. The main impact was uncached requests, suffering a 0% availability during several moments of the outage, on all wikis.
'''Impact''': There was 30 minutes of outage with high latencies, failing to load pages and errors around all wikis due to unavailable PHP-FPM workers. Based on traffic graphs the outage impacted to approximately a 15% of all incoming HTTP requests for wikis, those being either lost, suffering high latencies or 5XX error codes. The main impact was uncached requests, suffering a 0% availability during several moments of the outage, on all wikis.
 


{{TOC|align=right}}
{{TOC|align=right}}
Line 33: Line 36:
* 10:48: Recommendation made to disable DynamicPageList on ruwikinews instead of increasing cache TTL
* 10:48: Recommendation made to disable DynamicPageList on ruwikinews instead of increasing cache TTL
* 10:50: Incident opened ([https://docs.google.com/document/d/15uw05hinO3ADNbbxKMPfpzxrTGkDBWifWY14-xu6lKU/edit?usp=sharing private Google Doc]).
* 10:50: Incident opened ([https://docs.google.com/document/d/15uw05hinO3ADNbbxKMPfpzxrTGkDBWifWY14-xu6lKU/edit?usp=sharing private Google Doc]).
[[File:MariaDB traffic during 2021-07-26 DPL outage.png|thumb|MariaDB traffic during the outage showing a dramatic increase in rows read and number of open connections.]]
* 10:51: Jaime sets the query killer on S3 replicas to 10 seconds for the MediaWiki user ([https://sal.toolforge.org/log/6mtx4noB8Fs0LHO5PaBK SAL entry])
* 10:51: Jaime sets the query killer on S3 replicas to 10 seconds for the MediaWiki user ([https://sal.toolforge.org/log/6mtx4noB8Fs0LHO5PaBK SAL entry])
* 10:55: Amir disables DPL on ruwikinews ([https://sal.toolforge.org/log/7Rp04noBa_6PSCT9NDtq SAL entry])
* 10:55: Amir disables DPL on ruwikinews ([https://sal.toolforge.org/log/7Rp04noBa_6PSCT9NDtq SAL entry])
Line 81: Line 85:
* {{in progress}} [[phab:T287916|T287916]] Disable DPL on wikis that aren't using it
* {{in progress}} [[phab:T287916|T287916]] Disable DPL on wikis that aren't using it
* {{stalled}} [[phab:T263220|T263220]] Limit concurrency of DPL queries
* {{stalled}} [[phab:T263220|T263220]] Limit concurrency of DPL queries
*[[phab:T287983|T287983]] Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage
* ...
* ...

Revision as of 20:03, 3 August 2021

document status: draft

Summary

File:Errors during 2021-07-26 DPL outage.png
Spikes in 503 errors returned to users
File:Overall traffic during 2021-07-26 DPL outage.png
Corresponding graphs of traffic during the outage, showing approximately 15% of requests receiving errors.

Following the bot import of 200,000 pages to the Russian Wikinews in the span of 3 days, slow queries originating from ruwikinews's usage of the DynamicPageList extension (also known as "intersection") overloaded the s3 cluster of databases, causing php-fpm processes to hang/stall, eventually taking down all wikis with it. The outage was resolved by disabling the DynamicPageList extension on ruwikinews and aggressively killing queries on s3 replicas. In worst case usage, DPL's database queries roughly scale to the size of the largest category being checked, which on ruwikinews is orders of magnitude higher than other wikis. One slow query seen during the outage took more than 3 minutes to finish on an idle replica (see EXPLAIN).

Impact: There was 30 minutes of outage with high latencies, failing to load pages and errors around all wikis due to unavailable PHP-FPM workers. Based on traffic graphs the outage impacted to approximately a 15% of all incoming HTTP requests for wikis, those being either lost, suffering high latencies or 5XX error codes. The main impact was uncached requests, suffering a 0% availability during several moments of the outage, on all wikis.


Timeline

Prologue

  • 2020-09-07 and 2020-09-08: Following rapid bot imports (~100k pages in 1 day), DynamicPageList queries from ruwikinews caused problems on s3, though it did not lead to a sitewide outage. A summary of that incident is available at T262240#6449531 (TODO: create proper incident report).
  • 2021-07-14 through 2021-07-17: NewsBots imports/creates 200,000 pages to the Russian Wikinews in the span of 3 days (per Wikimedia News).
  • 2021-07-26: Something happens, triggering cache invalidation of pages using the <DynamicPageList> tag.

Main outage, all times in UTC.

  • 10:30: Database overload starts OUTAGE BEGINS
  • 10:33: Page fire for both appserver and api_appserver clusters: Not enough idle PHP-FPM workers for Mediawiki
  • 10:34: Significant IRC alert spam ensues, comms move to #wikimedia-sre and #mediawiki_security
  • 10:35: "upstream connect error or disconnect/reset before headers. reset reason: overflow" on enwiki
  • 10:38: Manuel depools db2149, that seems the most affected DB (SAL entry)
  • 10:39: T287362 filed by users unable to access arwiki,
  • 10:40: after a brief apparent recovery the load shifts to another DB
  • 10:42: Slow query identified as coming from DynamicPageListHooks::processQuery
  • 10:42: Link to previous incident from 2020-09 established (T262240), people involved in that ticket pinged on IRC
  • 10:46-10:49: Manuel slowly repools db2149
  • 10:48: Recommendation made to disable DynamicPageList on ruwikinews instead of increasing cache TTL
  • 10:50: Incident opened (private Google Doc).
File:MariaDB traffic during 2021-07-26 DPL outage.png
MariaDB traffic during the outage showing a dramatic increase in rows read and number of open connections.
  • 10:51: Jaime sets the query killer on S3 replicas to 10 seconds for the MediaWiki user (SAL entry)
  • 10:55: Amir disables DPL on ruwikinews (SAL entry)
  • 10:56: Icinga recoveries starts to be fired
  • 10:59: Database throughput back to normal levels OUTAGE ENDS
  • 11:01: Last Icinga recovery

Detection

Icinga sent two pages at 10:33 for Not enough idle PHP-FPM workers for Mediawiki on the appserver and api_appserver clusters.

The first user report on IRC appears to have been at 10:35 in #wikimedia-sre: <RhinosF1> Meta is down.

Because this was a full outage, every host was individually alerting and so were services that depend upon MediaWiki. Each appserver triggered two alerts, like:

<icinga-wm> PROBLEM - Apache HTTP on mw2316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
<icinga-wm> PROBLEM - PHP7 rendering on mw2316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering

icinga-wm sent 145 messages to #wikimedia-operations between 10:34 and 10:36 before being kicked off the Libera Chat network for flooding. That IRC channel was unusable and discussion was moved to #wikimedia-sre and then #mediawiki_security.

Conclusions

What went well?

  • Automated monitoring detected the incident before humans did
  • Once the problematic query was identified, it was immediately linked to the previous incident
  • Once the extension was disabled + query killer running on s3, everything came back up

What went poorly?

  • DPL had already been identified as problematic and the primary mitigation to limit concurrency of DPL queries (T263220) had never actually been deployed.
  • Users did not see the standard "Wikimedia error" screen, instead got a cryptic plaintext error message
  • icinga-wm spammed #wikimedia-operations to unusability and then flooded off
  • A formal incident was only opened two-thirds of the way through the outage
  • Harassment of sysadmins who were involved in incident response aftewards

Where did we get lucky?

  • We did not get lucky.

How many people were involved in the remediation?

  • ~7 SREs and 1 software engineer

Links to relevant documentation

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

High-level discussion and brainstorming is happening in T287380: Decide on the future of DPL. Out of that, some specific actionables have been identified:

  • Yes Done 708390 Send queries to "vslow" database group
  • Underway In progress T287916 Disable DPL on wikis that aren't using it
  •  Stalled T263220 Limit concurrency of DPL queries
  • T287983 Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage
  • ...