You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incidents/2022-03-27 api

From Wikitech-static
< Incidents
Revision as of 17:49, 8 April 2022 by imported>Krinkle (Krinkle moved page Incident documentation/2022-03-27 api to Incidents/2022-03-27 api)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2022-03-27 api Start 2022-03-27 14:36
Task End 2022-03-28 12:39
People paged N/A Responder count 8
Coordinators Alexandros Affected metrics/SLOs
Impact Errors and elevated latencies for the MediaWiki API cluster
Summary A template changes in itwiki triggered translusion updates to many pages. Changeprop (with retries) issued thousands of requests to the API cluster to reparse the transcluding pages, including page summaries, which are done by Mobileapps

Timeline

All times in UTC.

2022-03-27

14:36: PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page - OUTAGE BEGINS

14:47: elukey checks access logs on mw1312 - user agent Mobileapps/WMF predominant (67325)

14:55: RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page

15:00: elukey checks access logs on mw1314 - user agent Mobileapps/WMF predominant (57540)

15:03: (potentially unrelated) <akosiaris>sigh, looking at logstash and seeing that mobileapps in codfw is so heavily throttled by kubernetes

15:15: RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page

15:33: PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad

15:39: RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad

15:39  Incident opened. Alexandros Kosiaris becomes IC.

15:45 Suspicions raised regarding the Mobileapps user-agent that was doing the majority of requests to the API cluster. That's an internal service.

15:49 Realization that Mobileapps in codfw is routinely throttled, leading to increased errors and latencies.

16:09 Realization that an update of https://it.wikipedia.org/w/index.php?title=Template:Avviso_utente , a template in itwiki lead to this via changeprop, RESTBase, mobileapps - OUTAGE ENDS

19:09 Pages again: PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2933 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver - OUTAGE BEGINS AGAIN

19:25 _joe_: restarting php on mw1380

19:35 _joe_: $ sudo cumin -b1 -s20 'A:mw-api and P{mw13[56-82].eqiad.wmnet}' 'restart-php7.2-fpm'

19:56 <_joe_> I restarted a few envoys of the servers that had more connections active in lvs, and now things look more balanced - OUTAGE ENDS AGAIN

[ This was because Envoy’s long-lived upstream connections prevented the saturation imbalance from self-correcting.]

2022-03-28

12:08: Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad - OUTAGE BEGINS AGAIN

12:13: Emperor,_joe_,jayme confirmed same problem as yesterday

12:39: jayme deployed a changeprop change lowering the concurrency for transclusion updates (from 200 to 100): https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/774462 - OUTAGE ENDS AGAIN

Detection

The consequences of the issues, that is not having enough PHP-FPM workers available was detected in a timely manner from icinga multiple times

  • 14:36: PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page
  • 15:33: PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad
  • 19:09 Pages again: PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2933 lt 0.3

Unfortunately it took a considerable amount of time to pin down the root cause.

Conclusions

What went well?

  • Multiple people responded
  • Automated monitoring detected the incident
  • Graphs and dashboard showcased the issue quickly

What went poorly?

  • The link between changeprop, mobileapps, restbase and API wasn't quickly drawn causing a prolonged and flapping outage
  • No changeprop experienced people were around.

Where did we get lucky?

How many people were involved in the remediation?

  • 8. 7 SREs, 1 software engineer from Performance

Links to relevant documentation

Actionables

Scorecard

Incident Engagement™ ScoreCard
Rubric Question Score
People Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt) 0
Were the people who responded prepared enough to respond effectively (0/5pt) 2
Did fewer than 5 people get paged (0/5pt)? 0
Were pages routed to the correct sub-team(s)? 0
Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours) 0
Process Was the incident status section actively updated during the incident? (0/1pt) 1
If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt) N/A
Is there a phabricator task for the incident? (0/1pt) 0
Are the documented action items assigned? (0/1pt) 0
Is this a repeat of an earlier incident (-1 per prev occurrence) -1
Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task) 0
Tooling Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt) 5
Did existing monitoring notify the initial responders? (1pt) 1
Were all engineering tools required available and in service? (0/5pt) 5
Was there a runbook for all known issues present? (0/5pt) 0
Total score