You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/20160407-Mediawiki

From Wikitech-static
< Incident documentation
Revision as of 20:55, 7 April 2016 by imported>Gehel (Created page with "== Summary == All Mediawiki servers were serving mostly HTTP 5XX for about 5 minutes at 1350 UTC == Timeline == * 13:50 UTC: switch CirrusSearch traffic to codfw, with a bugg...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Summary

All Mediawiki servers were serving mostly HTTP 5XX for about 5 minutes at 1350 UTC

Timeline

  • 13:50 UTC: switch CirrusSearch traffic to codfw, with a buggy configuration (see https://gerrit.wikimedia.org/r/#/c/282163/ for the correction)
  • almost immediate raise in HTTP 5XX errors to 400K errors / minute
  • 13:53 UTC: rollback
  • 13:55 UTC: error rate back to reasonable level

Conclusions

  1. unit testing wmf-config is hard
  2. testing configuration changes related to datacenter is not possible on labs
  3. carefully testing this kind of change on test nodes (mw1017/mw1099/mw2017/mw2099) is the minimum required

Actionables

Immediate issues have been addressed, long term systemic issue

  • a standardized and automated canary test system would help mitigate this kind of issues