You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incidents/2022-03-27 api
document status: draft
Summary
Incident ID | 2022-03-27 api | Start | 2022-03-27 14:36 |
---|---|---|---|
Task | End | 2022-03-28 12:39 | |
People paged | N/A | Responder count | 8 |
Coordinators | Alexandros | Affected metrics/SLOs | |
Impact | Errors and elevated latencies for the MediaWiki API cluster | ||
Summary | A template changes in itwiki triggered translusion updates to many pages. Changeprop (with retries) issued thousands of requests to the API cluster to reparse the transcluding pages, including page summaries, which are done by Mobileapps |
Timeline
All times in UTC.
2022-03-27
14:36: PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page - OUTAGE BEGINS
14:47: elukey checks access logs on mw1312 - user agent Mobileapps/WMF predominant (67325)
14:55: RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page
15:00: elukey checks access logs on mw1314 - user agent Mobileapps/WMF predominant (57540)
15:03: (potentially unrelated) <akosiaris>sigh, looking at logstash and seeing that mobileapps in codfw is so heavily throttled by kubernetes
15:15: RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page
15:33: PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad
15:39: RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad
15:39 Incident opened. Alexandros Kosiaris becomes IC.
15:45 Suspicions raised regarding the Mobileapps user-agent that was doing the majority of requests to the API cluster. That's an internal service.
15:49 Realization that Mobileapps in codfw is routinely throttled, leading to increased errors and latencies.
16:09 Realization that an update of https://it.wikipedia.org/w/index.php?title=Template:Avviso_utente , a template in itwiki lead to this via changeprop, RESTBase, mobileapps - OUTAGE ENDS
19:09 Pages again: PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2933 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver - OUTAGE BEGINS AGAIN
19:25 _joe_: restarting php on mw1380
19:35 _joe_: $ sudo cumin -b1 -s20 'A:mw-api and P{mw13[56-82].eqiad.wmnet}' 'restart-php7.2-fpm'
19:56 <_joe_> I restarted a few envoys of the servers that had more connections active in lvs, and now things look more balanced - OUTAGE ENDS AGAIN
[ This was because Envoy’s long-lived upstream connections prevented the saturation imbalance from self-correcting.]
2022-03-28
12:08: Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad - OUTAGE BEGINS AGAIN
12:13: Emperor,_joe_,jayme confirmed same problem as yesterday
12:39: jayme deployed a changeprop change lowering the concurrency for transclusion updates (from 200 to 100): https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/774462 - OUTAGE ENDS AGAIN
Detection
The consequences of the issues, that is not having enough PHP-FPM workers available was detected in a timely manner from icinga multiple times
- 14:36: PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page
- 15:33: PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad
- 19:09 Pages again: PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2933 lt 0.3
Unfortunately it took a considerable amount of time to pin down the root cause.
Conclusions
What went well?
- Multiple people responded
- Automated monitoring detected the incident
- Graphs and dashboard showcased the issue quickly
What went poorly?
- The link between changeprop, mobileapps, restbase and API wasn't quickly drawn causing a prolonged and flapping outage
- No changeprop experienced people were around.
Where did we get lucky?
- 1 of the responders linked the https://it.wikipedia.org/w/index.php?title=Template:Avviso_utente template change to the event.
How many people were involved in the remediation?
- 8. 7 SREs, 1 software engineer from Performance
Links to relevant documentation
- https://wikitech.wikimedia.org/wiki/Changeprop
- Mobileapps (service)
Actionables
Mobileapps is often throttled in codfw T305482Limit changeprop transclusion concurrency. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/774462
Scorecard
Rubric | Question | Score |
---|---|---|
People | Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt) | 0 |
Were the people who responded prepared enough to respond effectively (0/5pt) | 2 | |
Did fewer than 5 people get paged (0/5pt)? | 0 | |
Were pages routed to the correct sub-team(s)? | 0 | |
Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours) | 0 | |
Process | Was the incident status section actively updated during the incident? (0/1pt) | 1 |
If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt) | N/A | |
Is there a phabricator task for the incident? (0/1pt) | 0 | |
Are the documented action items assigned? (0/1pt) | 0 | |
Is this a repeat of an earlier incident (-1 per prev occurrence) | -1 | |
Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task) | 0 | |
Tooling | Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt) | 5 |
Did existing monitoring notify the initial responders? (1pt) | 1 | |
Were all engineering tools required available and in service? (0/5pt) | 5 | |
Was there a runbook for all known issues present? (0/5pt) | 0 | |
Total score |