You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/20160925-ores

From Wikitech-static
< Incident documentation
Revision as of 22:15, 2 October 2016 by imported>Ladsgroup (Created page with " == Summary == At September 25th, ORES service had higher ~14%) timeout ratio for six hours. Because it ran out space due to too verbose logging. == Timeline == * Sept 25 10:3...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Summary

At September 25th, ORES service had higher ~14%) timeout ratio for six hours. Because it ran out space due to too verbose logging.

Timeline

  • Sept 25 10:34:40 UTC 2016: icinga test on ORES failed due to timeout.
  • 14:13 UTC: phab:T146581 is created.
  • 16:03 The fix deployed in labs.
  • 16:26 The fix deployed in prod.

Conclusions

We should have better monitoring disk space and be careful on verbosity of production services logs

Actionables