You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Talk:Incident documentation/2019-08-23 network codfw
Timeline and detection
We should spend some time reconstructing the timeline of detection here at the start of the incident. It wasn't immediately obvious that codfw connectivity was at fault or that the link was flapping -- the first Icinga alarms were at ~20:30 UTC for HTTP unavailability (possibly unrelated?) and then at ~21:20 UTC for "search.svc.codfw.wmnet is DOWN: CRITICAL - Time to live exceeded" which made it more apparent that networking might be at fault. Smokeping's first mail sent was at 21:32.
But there were no alarms about OSPF status or other obvious indications that a network link was at fault from Icinga until 21:49 (PROBLEM - OSPF status on cr2-eqiad is CRITICAL). By this time @Ayounsi: had already diagnosed it as a network link issue and raised the cost of the link -- I'd appreciate hearing how, and especially if it involved having router access. ✍ CDanis 14:17, 24 August 2019 (UTC)
Weird monitoring artifacts in LibreNMS at the time of the incident
These are almost certainly just monitoring artifacts (counter resets being detected inappropriately due to failed scrapes?) but I wanted to point them out:
- asw-b-codfw: https://librenms.wikimedia.org/graphs/to=1566608700/device=96/type=device_bits/from=1566590700/legend=yes/?
- asw-c-codfw: https://librenms.wikimedia.org/graphs/to=1566608700/device=97/type=device_bits/from=1566590700/legend=yes/?
✍ CDanis 14:17, 24 August 2019 (UTC)