You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Incident documentation/2019-10-16 network eqsin
document status: final
An Equinix Singapore IXP peer flapped heavily, which overwhelmed the routing daemon on cr1-eqsin and caused all its BGP and OSPF sessions to flap or go down.
In addition to the external connectivity issues, as the primary transport link to codfw is on cr1, it caused the local caches to not be able to reach their peers in the main datacenters and serve 500 errors instead.
Estimated ~170k errors surfaced to users of eqsin PoP.
The following automated alerts got triggered:
- Varnish traffic drop between 30min ago and now at eqsin
- HTTP availability for Nginx -SSL terminators- at eqsin
- HTTP availability for Varnish at eqsin
- BFD status on cr1-codfw
- LVS HTTPS text-lb.eqsin.wikimedia.org - PAGE
This quickly pointed to an network issue in eqsin.
Was the alert volume manageable? yes
Did they point to the problem with as much accuracy as possible? yes
All times in UTC.
- 17:15 SSL terminator alerts in eqsin fire, non-paging -- OUTAGE BEGINS
- 17:28 First page fires -- LVS HTTPS text-lb.eqsin.wikimedia.org
- 17:29 eqsin depooled
- 17:29 Recovery on its own -- OUTAGE ENDS
What went well?
- The issue was quickly identified
- The issue recovered on its own
What went poorly?
- A router's routing daemon should not behave that way but that router's model is known to be weak
- The logs didn't have any information on why OSPF and BGP were behaving that way.
Where did we get lucky?
- Several SREs were around when the issue started
How many people were involved in the remediation?
- 6 SREs
Links to relevant documentation
Depooling the site: DNS#Change GeoDNS
NOTE: Please add the #wikimedia-incident Phabricator project to these follow-up tasks and move them to the "follow-up/actionable" column.
- T236878 Improve resiliency of the eqsin transport link by either:
- Terminating it on cr2-eqsin
- Adding a 2nd link
- Configuring link damping
- Replace cr1-eqsin with a better router (next FY)