You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/2021-07-20 asw-a2-codfw crash

From Wikitech-static
< Incident documentation
Revision as of 16:34, 21 July 2021 by imported>Vgutierrez (First draft, this is still a WIP)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: draft

Summary

asw-a2-codfw, the switch handling the network traffic of rack A2 on codfw became unresponsive rendering 14 hosts unreachable. Besides losing the 14 hosts hosted on rack A2, two additional load balancers lost access to codfw's row A.

Impact: Degraded service on: RESTBase, MediaWiki API and edge caching service on codfw. Several services had to be migrated to eqiad: Maps Tile Service and WikiData Query Service. High availability lost in high-traffic1 load balancer in codfw.

Timeline

All times in UTC. Friday 16th

  • 13:16 asw-a2-codfw becomes unresponsive OUTAGE BEGINS
  • 13:36 authdns2001 is depooled to restore ns1.wikimedia.org
  • 14:07 kartotherian is moved from codfw to eqiad
  • 14:11 wdqs gets pooled in eqiad
  • 14:30 ports on the affected switch are marked as disabled on asw-a-codfw virtual-chassis
  • 14:37 disable affected network interface in lvs2010
  • 14:41 disable affected network interface in lvs2009
  • 15:14 re-enable affected network interface in lvs2010
  • 15:31 remote hands are used in codfw to power-cycle the affected switch without success
  • 15:38 re-enable affected network interface in lvs2009
  • 15:48 Decrease depool threshold for the edge caching services
  • 16:29 Decrease depool threshold for MediaWiki API service
  • 16:56 Error rate recovers OUTAGE ENDS

Monday 19th

  • 08:15 depool text cache codfw PoP
  • 17:10 defective switch gets replaced
  • 17:21 authdns2001 is pooled

TODO: Add timestamps for lvs restore in codfw

  • 20:29 pool text cache codfw PoP

Detection

The incident was detected via automated monitoring reporting several hosts of rack A2 going down at the very same time:

  • <icinga-wm> PROBLEM - Host elastic2038 is DOWN: PING CRITICAL - Packet loss = 100%
  • <icinga-wm> PROBLEM - Host kafka-logging2001 is DOWN: PING CRITICAL - Packet loss = 100
  • <icinga-wm> PROBLEM - Host ns1-v4 is DOWN: PING CRITICAL - Packet loss = 100%
  • <icinga-wm> PROBLEM - Host authdns2001 is DOWN: PING CRITICAL - Packet loss = 100%
  • <icinga-wm> PROBLEM - Host ms-be2051 is DOWN: PING CRITICAL - Packet loss = 100%
  • <icinga-wm> PROBLEM - Host lvs2007 is DOWN: PING CRITICAL - Packet loss = 100%
  • <icinga-wm> PROBLEM - Host thanos-fe2001 is DOWN: PING CRITICAL - Packet loss = 100%
  • <icinga-wm> PROBLEM - Host elastic2055 is DOWN: PING CRITICAL - Packet loss = 100%
  • <icinga-wm> PROBLEM - Host elastic2037 is DOWN: PING CRITICAL - Packet loss = 100%
  • <icinga-wm> PROBLEM - Host ms-fe2005 is DOWN: PING CRITICAL - Packet loss = 100%
  • <icinga-wm> PROBLEM - Host ms-be2040 is DOWN: PING CRITICAL - Packet loss = 100%

Conclusions

As a result of losing one single switch services were affected more than expected due to several weaknesses:

  • 3 load balancers including the backup one get the row A traffic from one single network switch
  • Depool threshold of several services is too restrictive to continue to work as expected losing a complete row

What went well?

  • Automated monitoring detected the incident
  • Even if several services has been affected, user facing impact was mild.
  • (Use bullet points) for example: automated monitoring detected the incident, outage was root-caused quickly, etc

What went poorly?

  • (Use bullet points) for example: documentation on the affected service was unhelpful, communication difficulties, etc

Where did we get lucky?

  • (Use bullet points) for example: user's error report was exceptionally detailed, incident occurred when the most people were online to assist, etc

How many people were involved in the remediation?

  • 5 SREs troubleshooting the issue plus 1 incident commander

Links to relevant documentation

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

  • To do #1 (TODO: Create task)
  • To do #2 (TODO: Create task)

TODO: Add the #Sustainability (Incident Followup) Phabricator tag to these tasks.