You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/20161127-API: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Filippo Giunchedi
(Draft incident report)
 
imported>Krinkle
No edit summary
Line 10: Line 10:
* ~3.30 5xx are down, after three/four peaks, investigation continues. It is noted that there are many failing requests from euwiki
* ~3.30 5xx are down, after three/four peaks, investigation continues. It is noted that there are many failing requests from euwiki


[[File:20161127-api_outage.png]]
[[File:20161127-api_outage.png|thumb]]


* 5.00 symptoms are back, ori deploys a bandaid to mw1290 to ban (500) requests for euwiki coming from parsoid
* 5.00 symptoms are back, ori deploys a bandaid to mw1290 to ban (500) requests for euwiki coming from parsoid
Line 21: Line 21:


== Actionables ==
== Actionables ==
<onlyinclude>
<onlyinclude>
* To do #1 ({{Bug|T815}})
* To do #2 ({{Bug|T4711}})
* …
</onlyinclude>
</onlyinclude>


{{#ifeq:{{SUBPAGENAME}}|Report Template||
[[Category:Incident documentation]]
[[Category:Incident documentation]]
}}

Revision as of 21:29, 28 September 2021

Summary

[DRAFT]

On 2016-11-27 at around 2 UTC HHVM on the API cluster suffered from excessive memory consumption, eventually leading to an API outage. One leading factor to the outage was an expensive template being asked repeatedly as part of batch re-renderings. This is likely a reoccurrence of a similar event on 2016-10-17 (at https://phabricator.wikimedia.org/T148652) with similar underlying reasons. The investigation was carried out in https://phabricator.wikimedia.org/T151702.

Timeline

  • ~2.00 memory starts increasing
  • 2.27 First page for api.svc.eqiad.wmnet dispatched, Filippo starts investigating
  • ~2.40 First round of rolling restarts to help lessen the issue.
  • ~3.30 5xx are down, after three/four peaks, investigation continues. It is noted that there are many failing requests from euwiki
20161127-api outage.png
  • 5.00 symptoms are back, ori deploys a bandaid to mw1290 to ban (500) requests for euwiki coming from parsoid
  • ~5.30 5xx are back
  • 5.50 bandaid is confirmed working
  • 6.20 bandaid is deployed to APIs

Conclusions

Effective protection against excessive resource usage is still an open issue, in particular for the API cluster. Further, there's no isolation per-usage and that leads to bigger blast radiuses than needed.

Actionables