You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/20160620-ores

From Wikitech-static
< Incident documentation
Revision as of 18:23, 20 June 2016 by imported>Ladsgroup (→‎Summary)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Summary

ores.wikimedia.org was down today for about twenty minutes because of deploying a commit that changed reading config directory without proper order.

Timeline

SAL log

  • 10:58 Amir1: deploying bdc1e2b in ores nodes
  • 11:04 deployment finished and ores went down
    • puppet agent ran and services got restarted (uwsgi-ores, celery-ores-worker). Didn't solve the problem
    • Checking logs showed the problem presists due to bad config reading
  • 11:27 Amir1: rollbacking ae71d842dfc0958e06922062dd09d49243332a6a
    • ORES went live again
  • 12:13 Amir1: started depldoying ores in scb2001 bdc1e2bd
    • Worked as not expected. Didn't have any down time because it was one node in codfw
  • 13:04 Amir1: deploying 8e65182 to scb2001
  • We fixed it in 295214
    • Worked perfectly fine
  • 13:06 Amir1: full deployment for 8e65182 in ores nodes

Conclusions

A very shallow reasoning would be the issue of reading config directories which got changed a lot and now it's in a rather stable situation but that's dangerous. What we really need is a safe method to deploy ores which we did the second time today. The only thing is documenting them

Actionables

  • Status:    Unresolved Document safe steps to deploy ores in prod (bug T138234)