You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incident documentation/20160407-Mediawiki: Difference between revisions
Jump to navigation
Jump to search
imported>Gehel (Created page with "== Summary == All Mediawiki servers were serving mostly HTTP 5XX for about 5 minutes at 1350 UTC == Timeline == * 13:50 UTC: switch CirrusSearch traffic to codfw, with a bugg...") |
imported>Gehel |
||
Line 14: | Line 14: | ||
== Actionables == | == Actionables == | ||
Immediate issues have been addressed | Immediate issues have been addressed. This incident is mainly about human error (mine) and insufficient testing (me again). | ||
<onlyinclude> | <onlyinclude> | ||
* a standardized and automated canary test system would help mitigate this kind of issues | * a standardized and automated canary test system would help mitigate this kind of issues, but is probably a long term action outside of the scope of a post incident action. | ||
</onlyinclude> | </onlyinclude> | ||
[[Category:Incident documentation]] | [[Category:Incident documentation]] |
Revision as of 08:40, 8 April 2016
Summary
All Mediawiki servers were serving mostly HTTP 5XX for about 5 minutes at 1350 UTC
Timeline
- 13:50 UTC: switch CirrusSearch traffic to codfw, with a buggy configuration (see https://gerrit.wikimedia.org/r/#/c/282163/ for the correction)
- almost immediate raise in HTTP 5XX errors to 400K errors / minute
- 13:53 UTC: rollback
- 13:55 UTC: error rate back to reasonable level
Conclusions
- unit testing wmf-config is hard
- testing configuration changes related to datacenter is not possible on labs
- carefully testing this kind of change on test nodes (mw1017/mw1099/mw2017/mw2099) is the minimum required
Actionables
Immediate issues have been addressed. This incident is mainly about human error (mine) and insufficient testing (me again).
- a standardized and automated canary test system would help mitigate this kind of issues, but is probably a long term action outside of the scope of a post incident action.