You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/2018-04-10 Routing: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
 
imported>Krinkle
 
Line 1: Line 1:
== Summary ==
#REDIRECT [[Incidents/2018-04-10 Routing]]
 
A configuration change on routers located in the Ashburn and Singapore datacenters caused a service interruption of ~10min (22:53-23:03UTC) for users redirected to Ashburn, and ~40min for users redirected to Singapore. (22:47-23:24 UTC)
 
More details on: {{Phabricator|T191940}}
 
== Timeline ==
* 22:47 Change pushed to cr1-eqsin
* 22:53 Change pushed to cr2-eqiad
* 22:58 cr2-eqiad rolled-back
* 23:03 eqiad full recovery (after routing convergence)
* 23:22 cr1-eqsin rolled-back (partial recovery)
* 23:31 eqsin de-pooled
* 23:36 eqsin full recovery
 
== Conclusions ==
* Changes, even if already live in part of the infrastructure, need to be better discussed with the team
* POPs (especially non redundant ones) should be depooled before applying changes, if any doubt
* The same change had different results across the deployment:
** No issues, working as expected (eg. switches, cr2-esams)
** Partial failure (cr1-eqsin), connectivity to the router and rpd appeared in a healthy state, user traffic was being dropped
** Full failure (cr2-eqiad), instantly lost connectivity to the router
 
== Actionables ==
''Explicit next steps to prevent this from happening again as much as possible, with Phabricator tasks linked for every step.''<onlyinclude>
* Tickets have been opened with the vendor [[phab:T191667]] (update: crash reason found)
 
</onlyinclude>
 
{{#ifeq:{{SUBPAGENAME}}|Report Template||
[[Category:Incident documentation]]
}}

Latest revision as of 17:46, 8 April 2022