You are browsing a read-only backup copy of Wikitech. The live site can be found at

Incident documentation/20160610-ORES

From Wikitech-static
< Incident documentation
Revision as of 20:39, 10 June 2016 by imported>Halfak (→‎Actionables)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This is a template for an Incident Report. Replace notes with your own description.


ORES was down for an unknown amount of hours today due to a broken configuration file (99-redis.yaml).


at least 6 hours passes

  • 2016-06-10 @ 1930 UTC -- 503 errors and timeouts were noted
  • 2016-06-10 @ 2030 UTC -- 99-redis.yaml files are deleted and the workers are restarted. Service is restored.

Conclusions should not have been merged. We need a better testing process around puppet merges to make sure that they don't take down the service. Unlike a deploy, there's to a clear event at which puppet is run.

Also, this downtime did not cause a paging event.