You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Incident documentation/2021-10-08 network provider"

From Wikitech-static
Jump to navigation Jump to search
imported>RhinosF1
(→‎Summary: change said -> a. Said doesn't make sense as it hasn't been said above.)
 
imported>CDanis
Line 15: Line 15:
at 16:24 UTC, the first reports of recoveries arrived as the change propagated through the [[:en:Default-free_zone|DFZ]]
at 16:24 UTC, the first reports of recoveries arrived as the change propagated through the [[:en:Default-free_zone|DFZ]]


Unfortunately, due to the preponderance of the eqiad impact and its recovery, we didn't notice that it also impacted users from Russia to reach the wikis through esams.
Unfortunately, due to the preponderance of the eqiad impact and its recovery, we didn't notice that it also impacted users from Russia to reach the wikis through esams.  We also didn't receive any [[NEL]] reports from most Russia users until 16:24 or later, as the site they use for NELs is itself eqiad.  


At around 17:15 UTC, our monitoring shows a full recovery, indicating the issue being resolved upstream.
At around 17:15 UTC, our monitoring shows a full recovery, indicating the issue being resolved upstream.
Line 36: Line 36:
==Actionables==
==Actionables==
*[[phab:T292792|2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users]]
*[[phab:T292792|2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users]]
**Liked CR about making NEL alert paging
**[[gerrit:c/operations/puppet/+/727594|patch to make NEL alert paging]]
*Request RFO from provider
*Request RFO from provider

Revision as of 18:18, 11 October 2021

document status: in-review

Summary

At around 16:11 UTC our (non-paging) monitoring and users reported connectivity issues to and from our eqiad site.

Traceroutes showed a routing loop in a provider's network.

At 16:19 UTC, using their APIs, we asked that provider to stop advertising the prefixes for that site on our behalf.

at 16:24 UTC, the first reports of recoveries arrived as the change propagated through the DFZ

Unfortunately, due to the preponderance of the eqiad impact and its recovery, we didn't notice that it also impacted users from Russia to reach the wikis through esams. We also didn't receive any NEL reports from most Russia users until 16:24 or later, as the site they use for NELs is itself eqiad.

At around 17:15 UTC, our monitoring shows a full recovery, indicating the issue being resolved upstream.


Impact:

Readers:

  • United States east coast partial connectivity outage for 13min, as this provider is one of the many we use in the region
  • Russia, partial connectivity outage for 1h, as this provider is one of the many we use in the region (eg. we peer directly with Rostelecom)

Staff and community members connecting directly to services in eqiad (eg. gerrit). Due to the span of this provider's network, the further a user is from eqiad the more likely they will be using that provider and could be impacted.


Documentation:

Actionables