You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/2021-10-08 network provider: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
No edit summary
imported>Krinkle
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
{{irdoc|status=review}} <!--
#REDIRECT [[Incidents/2021-10-08 network provider]]
The status field should be one of:
* {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review.
* {{irdoc|status=review}} - The incident review working group will contact you then to finalise the report. See also the steps on [[Incident documentation]].
* {{irdoc|status=final}}
-->
 
==Summary==
Around 16:11 UTC our (non-paging) monitoring and users reported connectivity issues to and from our Eqiad location. Traceroutes showed a routing loop in a provider's network.
 
At 16:19 UTC, using the provider's APIs, we asked the provider to stop advertising the prefixes for Eqiad on our behalf.
 
At 16:24 UTC, the first reports of recoveries arrived as the change propagated through the [[:en:Default-free_zone|DFZ]].
 
Unfortunately, due to the preponderance of the Eqiad impact and its recovery, we didn't notice that it also impacted users from Russia to reach the wikis through Esams.  We also didn't receive any [[NEL]] reports from most Russia users until 16:24 or later, as the location we use for NELs about Esams is itself Eqiad.
 
At around 17:15 UTC, our monitoring shows a full recovery, indicating the issue being resolved upstream.
 
'''Impact''': For upto an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for ~13 minutes, and Russia for 1 hour. A subset of readers and contributors from these regions were unable to reach any wiki projects. Services such as Phabricator and Gerrit Code Review were affected as well. It was a partial issue because the network malfunction was limited to one of many providers we use in the affected regions.
 
Due to the span of this provider's network, the further a client is from Eqiad the more likely they will be using that provider to reach our network and thus could be impacted.<!-- Reminder: No private information on this page! -->
 
'''Documentation''':[[File:Screenshot 2021-10-08 at 08-23-35 NEL (Network Error Logging) - Elastic.png|none|thumb|[https://logstash.wikimedia.org/goto/93cb07d63964e0271ecf8eece8845a7b Logstash: NEL reports (restricted)]]]
*[https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?var-datasource=eqiad%20prometheus%2Fops&var-target_site=All&var-ip_version=ipv4&var-country_code=All&var-asn=All&from=1633620224947&to=1633632035019 Grafana: Ripe Atlas]
 
==Actionables==
*[[phab:T292792|2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users]]
**[[gerrit:c/operations/puppet/+/727594|patch to make NEL alert paging]]
*Request RFO from provider

Latest revision as of 17:49, 8 April 2022