You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/20160610-ORES: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Halfak
 
imported>Halfak
No edit summary
Line 17: Line 17:


== Actionables ==
== Actionables ==
<includeonly>
* Investigate why we were not paged when the downtime started [[Phab:T137592]]
* [[Phab:T137592]]
 
</includeonly>


[[Category:Incident documentation]]
[[Category:Incident documentation]]

Revision as of 21:18, 14 June 2016

This is a template for an Incident Report. Replace notes with your own description.

Summary

ORES was down for an unknown amount of hours today due to a broken configuration file (99-redis.yaml).

Timeline

at least 6 hours passes

  • 2016-06-10 @ 1930 UTC -- 503 errors and timeouts were noted
  • 2016-06-10 @ 2030 UTC -- 99-redis.yaml files are deleted and the workers are restarted. Service is restored.

Conclusions

https://github.com/wikimedia/operations-puppet/commit/78119152c47b7873fdd7bd0c38a356b5bff27226 should not have been merged. We need a better testing process around puppet merges to make sure that they don't take down the service. Unlike a deploy, there's to a clear event at which puppet is run.

Also, this downtime did not cause a paging event.

Actionables

  • Investigate why we were not paged when the downtime started Phab:T137592