You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incidents/20160925-ores
< Incidents
Jump to navigation
Jump to search
Revision as of 17:45, 8 April 2022 by imported>Krinkle (Krinkle moved page Incident documentation/20160925-ores to Incidents/20160925-ores)
Summary
At September 25th, ORES service had higher ~14%) timeout ratio for six hours. Because it ran out space due to too verbose logging.
Timeline
- Sept 25 10:34:40 UTC 2016: icinga test on ORES failed due to timeout.
- 14:13 UTC: phab:T146581 is created.
- 16:03 The fix deployed in labs.
- 16:26 The fix deployed in prod.
Conclusions
We should have better monitoring disk space and be careful on verbosity of production services logs
Actionables
- Less verbose ORES task T146581
- Grafana monitor on disk space in ORES task T147163