You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/20160407-Mediawiki: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Gehel
(Created page with "== Summary == All Mediawiki servers were serving mostly HTTP 5XX for about 5 minutes at 1350 UTC == Timeline == * 13:50 UTC: switch CirrusSearch traffic to codfw, with a bugg...")
 
imported>Krinkle
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
== Summary ==
#REDIRECT [[Incidents/20160407-Mediawiki]]
All Mediawiki servers were serving mostly HTTP 5XX for about 5 minutes at 1350 UTC
 
== Timeline ==
* 13:50 UTC: switch CirrusSearch traffic to codfw, with a buggy configuration (see https://gerrit.wikimedia.org/r/#/c/282163/ for the correction)
* almost immediate raise in HTTP 5XX errors to 400K errors / minute
* 13:53 UTC: rollback
* 13:55 UTC: error rate back to reasonable level
 
== Conclusions ==
# unit testing wmf-config is hard
# testing configuration changes related to datacenter is not possible on labs
# carefully testing this kind of change on test nodes (mw1017/mw1099/mw2017/mw2099) is the minimum required
 
== Actionables ==
Immediate issues have been addressed, long term systemic issue
<onlyinclude>
* a standardized and automated canary test system would help mitigate this kind of issues
</onlyinclude>
 
[[Category:Incident documentation]]

Latest revision as of 17:45, 8 April 2022