You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incident documentation/2018-11-29 ores
Jump to navigation
Jump to search
Summary
ores.wikimedia.org was sending 500 for all score requests for 3 hours starting from 6AM UTC. It was due to config changes that was done as part of upgrading celery version of ores from three to two causing it to change its task serializer.
Timeline
- November 28th 12:04 UTC: the problematic puppet change got merged
- November 29th, 6:25 UTC: Logrotate restarted uwsgi services of ORES causing it to pick up the new config and start sending 500s
- 9:51 UTC: The revert was created and deployed
Conclusions
- Puppet should bind ores services to ores configs so it picks up the changes right away.
Logrotate should restart services in a better time. Not really doable- Contact number of WMDE staff should be avalible to SREs.
Links to relevant documentation
- ORES/Deployment
- celery4 setting change
- Phabricator: Change default serializer of celery from pickle to json