You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/2018-11-29 ores

From Wikitech-static
< Incident documentation
Revision as of 21:34, 31 March 2021 by imported>Krinkle (Krinkle moved page Incident documentation/20181129-ores to Incident documentation/2018-11-29 ores)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Summary

ores.wikimedia.org was sending 500 for all score requests for 3 hours starting from 6AM UTC. It was due to config changes that was done as part of upgrading celery version of ores from three to two causing it to change its task serializer.

Timeline

  • November 28th 12:04 UTC: the problematic puppet change got merged
  • November 29th, 6:25 UTC: Logrotate restarted uwsgi services of ORES causing it to pick up the new config and start sending 500s
  • 9:51 UTC: The revert was created and deployed

Conclusions

  • Puppet should bind ores services to ores configs so it picks up the changes right away.
  • Logrotate should restart services in a better time. Not really doable
  • Contact number of WMDE staff should be avalible to SREs.

Links to relevant documentation

Actionables