== Summary ==
A configuration change on routers located in the Ashburn and Singapore datacenters caused a service interruption of ~10min (22:53-23:03UTC) for users redirected to Ashburn, and ~40min for users redirected to Singapore. (22:47-23:24 UTC)
More details on: {{Phabricator|T191940}}
== Timeline ==
* 22:47 Change pushed to cr1-eqsin
* 22:53 Change pushed to cr2-eqiad
* 22:58 cr2-eqiad rolled-back
* 23:03 eqiad full recovery (after routing convergence)
* 23:22 cr1-eqsin rolled-back (partial recovery)
* 23:31 eqsin de-pooled
* 23:36 eqsin full recovery
== Conclusions ==
* Changes, even if already live in part of the infrastructure, need to be better discussed with the team
* POPs (especially non redundant ones) should be depooled before applying changes, if any doubt
* The same change had different results across the deployment:
** No issues, working as expected (eg. switches, cr2-esams)
** Partial failure (cr1-eqsin), connectivity to the router and rpd appeared in a healthy state, user traffic was being dropped
** Full failure (cr2-eqiad), instantly lost connectivity to the router
== Actionables ==
* Tickets have been opened with the vendor [[phab:T191667]] (update: crash reason found)
* Tickets have been opened with the vendor [[phab:T191667]] (update: crash reason found)
