You are browsing a read-only backup copy of Wikitech. The live site can be found at

Incident documentation/20161127-API

From Wikitech-static
< Incident documentation
Revision as of 23:14, 28 November 2016 by imported>Filippo Giunchedi (Draft incident report)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search



On 2016-11-27 at around 2 UTC HHVM on the API cluster suffered from excessive memory consumption, eventually leading to an API outage. One leading factor to the outage was an expensive template being asked repeatedly as part of batch re-renderings. This is likely a reoccurrence of a similar event on 2016-10-17 (at with similar underlying reasons. The investigation was carried out in


  • ~2.00 memory starts increasing
  • 2.27 First page for api.svc.eqiad.wmnet dispatched, Filippo starts investigating
  • ~2.40 First round of rolling restarts to help lessen the issue.
  • ~3.30 5xx are down, after three/four peaks, investigation continues. It is noted that there are many failing requests from euwiki

20161127-api outage.png

  • 5.00 symptoms are back, ori deploys a bandaid to mw1290 to ban (500) requests for euwiki coming from parsoid
  • ~5.30 5xx are back
  • 5.50 bandaid is confirmed working
  • 6.20 bandaid is deployed to APIs


Effective protection against excessive resource usage is still an open issue, in particular for the API cluster. Further, there's no isolation per-usage and that leads to bigger blast radiuses than needed.