You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/20160610-ORES: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Halfak
 
imported>Krinkle
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
''This is a template for an Incident Report.  Replace notes with your own description.''
#REDIRECT [[Incidents/20160610-ORES]]
 
== Summary ==
ORES was down for an unknown amount of hours today due to a broken configuration file (<code>99-redis.yaml</code>). 
 
== Timeline ==
 
* ??? -- https://github.com/wikimedia/operations-puppet/commit/78119152c47b7873fdd7bd0c38a356b5bff27226 was merged to the ''production'' branch of wikimedia puppet
'' at least 6 hours passes ''
* 2016-06-10 @ 1930 UTC -- 503 errors and timeouts were noted
* 2016-06-10 @ 2030 UTC -- 99-redis.yaml files are deleted and the workers are restarted.  Service is restored.
 
== Conclusions ==
https://github.com/wikimedia/operations-puppet/commit/78119152c47b7873fdd7bd0c38a356b5bff27226 should not have been merged.  We need a better testing process around puppet merges to make sure that they don't take down the service.  Unlike a deploy, there's to a clear event at which puppet is run. 
 
Also, this downtime did not cause a paging event. 
 
== Actionables ==
<includeonly>
* [[Phab:T137592]]
</includeonly>
 
[[Category:Incident documentation]]

Latest revision as of 17:45, 8 April 2022