You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/2021-09-01 partial parsoid outage

From Wikitech-static
< Incident documentation
Revision as of 15:04, 29 September 2021 by imported>Alexandros Kosiaris (→‎Where did we get lucky?)
Jump to navigation Jump to search

document status: draft

Summary

At 02:05 UTC, the parse2007 server in Codfw started to spontaneously respond with fatal error, possibly due to a php-opcache corruption. At 11:20 UTC, the server was restarted per the (now common) procedure in response to such corruptions after which the errors immediately stopped. Other parse servers and MW servers were not affected.

Impact: For 9 hours, 10% of submissions to Parsoid to parse or save wiki pages were failing on all wikis.

Timeline

File:2021-09-01 Parsoid fatals.png
Fatal errors in Logstash.
File:2021-09-01 Parsoid POST 5xx.png
Failing POST requests in Grafana.

All times in UTC.

  • 02:05: Outage starts (Retroactively) Logs indicate that parse2007: Cannot declare class XWikimediaDebug in XWikimediaDebug.php started at this time. This is affecting about 10% of POST requests
  • 03:54: mmodell, as train conductor this week, notices the spike and creates T290120.
  • 12:00 Krinkle finds the report during routine triage, notices that it is still on-going at high frequency, still unattended, and investigates.
  • 12:10: The issue is investigated and understood to be a likely opcache corruption, given that the source code in question has not changed recently and is not known to be defined or referenced in an unusual manner. The usual procedure for opcache corruptions is to restart php-fpm.
  • 12:19: Effie restarts php-fpm on parse2007
  • 12:20: Outage ends

Detection

Human reporting an error.

Write how the issue was first detected. Was automated monitoring first to detect it? Or a human reporting an error?

Copy the relevant alerts that fired in this section.

Did the appropriate alert(s) fire? Was the alert volume manageable? Did they point to the problem with as much accuracy as possible?

TODO: If human only, an actionable should probably be to "add alerting".

Conclusions

What went well?

  • A simliar issue was found a year earlier at T245183#6212267 (and remembered today), which let us jump to the opcache suspicion and thus restart the php-fpm service as mitigation.
  • Logstash dashboards.
  • Grafana dashboards.

What went poorly?

  • For 9 hours, no-one triaged the task, or independently noticed the error spike, or otherwise investigated it.
  • No alerts fired.
    • There is an alert query for appserver fatals, but it did not fire because for two reasons. 1) The alert only measures overall HTTP traffic, with the majority of requests being GET rather than POST. 2) For the parsoid cluster there is a high level of timeouts and OOMs that regularly produce fatals. Thus when new deterministic fatal errors are introduced that affect user submissions, due to bad code or an unhealthy server, this alert can't fire unless it rises significantly above the combined rate of timeouts, OOMs, and other GET failures.
    • We also measure "mediawiki-new-errors" in Logstash, where this error stood out by more than 100X. However, we don't have an alert on that.
    • We also measure "appserver HTTP 5xx for POST" in Prometheus/Grafana, where this stood as a 20X increase from <0.1% to >10%. However, we don't alert on that.
  • Krinkle unable to ssh to parse2007.

Where did we get lucky?

  • Urbanecm was around to offer a workaround for the ssh issue using Keyholder.
  • Effie was around as well, who did the actual restart.
  • (To be confirmed) Are there retries at some level? I suspect not given POST retries are usualy unsafe, but there is also an apparent lack of observed impact on edit count, Restbase reqs, and VisualEditor saves.

Browsers definitely retry if a connection is prematurely closed (that is no status code is ever returned). It's part of RFC2616. See https://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8.2.4 And it includes POSTs. However there may be retries on the RESTBase level here as well. RESTBase does retries for sure, not sure whether is honors the rfc though

How many people were involved in the remediation?

  • 1 engineer.
  • 1 SRE.

Links to relevant documentation

Actionables

  • Let deployers ssh to parse hosts. T290144
  • TODO: Re-evaluate alerting strategy around "mediawiki-exceptions". We have a breakdown by cluster (web,api,job,parsoid). Do we need a breakdown by HTTP verb? (E.g. "read" GET/HEAD/OPTIONS vs write "POST/DELETE" or some such).
  • TODO: I was unable to find stats on error rates of api.php requests in Grafana. HTTP-level app server stats are insufficient since api errors are HTTP 200. The Graphite metrics for API req breakdown don't measure errors currently. The Logstash data for api-appservers errors is also insufficient since properly handled errors wouldn't be exceptions and wouldn't be logged there as such (e.g. when action=visualeditoredit finds Restbase/Parsoid respond with http 500, it responds to the client with an error. Where do we measure this?)

TODO: Add the #Sustainability (Incident Followup) Phabricator tag to these tasks.