You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/2018-11-29 ores: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
 
imported>Krinkle
 
Line 1: Line 1:
 
#REDIRECT [[Incidents/2018-11-29 ores]]
== Summary ==
ores.wikimedia.org was sending 500 for all score requests for 3 hours starting from 6AM UTC. It was due to config changes that was done as part of upgrading celery version of ores from three to two causing it to change its task serializer.
 
== Timeline ==
* November 28th 12:04 UTC: [https://gerrit.wikimedia.org/r/c/operations/puppet/+/476250 the problematic] puppet change got merged
* November 29th, 6:25 UTC: Logrotate restarted uwsgi services of ORES causing it to pick up the new config and start sending 500s
* 9:51 UTC: [https://gerrit.wikimedia.org/r/c/operations/puppet/+/476458 The revert] was created and deployed
 
== Conclusions ==
* Puppet should bind ores services to ores configs so it picks up the changes right away.
*<s>Logrotate should restart services in a better time. Not really doable</s>
* Contact number of WMDE staff should be avalible to SREs.
== Links to relevant documentation ==
* [[ORES/Deployment]]
* [http://docs.celeryproject.org/en/v4.1.0/whatsnew-4.0.html#lowercase-setting-names celery4 setting change]
* [[phab:T206333|Phabricator: Change default serializer of celery from pickle to json]]
 
== Actionables ==
 
* [[phab:T210719|ORES services should bind to ores config files]]
* [[phab:T210720|Logrotate should restart services when more people are around]]
* [[phab:T210721|Contact number of some WMDE staff should be avalible to SRE/RelEng]]
 
{{#ifeq:{{SUBPAGENAME}}|Report Template||
[[Category:Incident documentation]]
}}

Latest revision as of 17:46, 8 April 2022