You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incident documentation/2021-10-08 network provider: Difference between revisions
imported>CDanis |
imported>Krinkle No edit summary |
||
Line 7: | Line 7: | ||
==Summary== | ==Summary== | ||
Around 16:11 UTC our (non-paging) monitoring and users reported connectivity issues to and from our Eqiad location. Traceroutes showed a routing loop in a provider's network. | |||
At 16:19 UTC, using the provider's APIs, we asked the provider to stop advertising the prefixes for Eqiad on our behalf. | |||
At 16: | At 16:24 UTC, the first reports of recoveries arrived as the change propagated through the [[:en:Default-free_zone|DFZ]]. | ||
Unfortunately, due to the preponderance of the Eqiad impact and its recovery, we didn't notice that it also impacted users from Russia to reach the wikis through Esams. We also didn't receive any [[NEL]] reports from most Russia users until 16:24 or later, as the location we use for NELs about Esams is itself Eqiad. | |||
Unfortunately, due to the preponderance of the | |||
At around 17:15 UTC, our monitoring shows a full recovery, indicating the issue being resolved upstream. | At around 17:15 UTC, our monitoring shows a full recovery, indicating the issue being resolved upstream. | ||
'''Impact''': For upto an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for ~13 minutes, and Russia for 1 hour. A subset of clients from these regions were unable to reach all wiki projects as both readers and contributors. Services such as Phabricator and Gerrit Code Review were affected as well. | |||
'''Impact''': | |||
Readers: | Readers: | ||
* United States east coast partial connectivity outage for 13min, as this provider is one of the many we use in the region | * United States east coast partial connectivity outage for 13min. Partial, as this provider is only one of the many we use in the region. | ||
* Russia, partial connectivity outage for 1h | * Russia, partial connectivity outage for 1h. Partial as this provider is one of the many we use in the region (eg. we peer directly with Rostelecom). | ||
Staff and community members connecting directly to services in Eqiad (eg. Gerrit). Due to the span of this provider's network, the further a user is from Eqiad the more likely they will be using that provider and could be impacted.<!-- Reminder: No private information on this page! --> | |||
'''Documentation''': | '''Documentation''':[[File:Screenshot 2021-10-08 at 08-23-35 NEL (Network Error Logging) - Elastic.png|none|thumb|[https://logstash.wikimedia.org/goto/93cb07d63964e0271ecf8eece8845a7b Logstash: NEL reports (restricted)]]] | ||
*https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?var-datasource=eqiad%20prometheus%2Fops&var-target_site=All&var-ip_version=ipv4&var-country_code=All&var-asn=All&from=1633620224947&to=1633632035019 | *[https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?var-datasource=eqiad%20prometheus%2Fops&var-target_site=All&var-ip_version=ipv4&var-country_code=All&var-asn=All&from=1633620224947&to=1633632035019 Grafana: Ripe Atlas] | ||
==Actionables== | ==Actionables== |
Revision as of 19:47, 1 November 2021
document status: in-review
Summary
Around 16:11 UTC our (non-paging) monitoring and users reported connectivity issues to and from our Eqiad location. Traceroutes showed a routing loop in a provider's network.
At 16:19 UTC, using the provider's APIs, we asked the provider to stop advertising the prefixes for Eqiad on our behalf.
At 16:24 UTC, the first reports of recoveries arrived as the change propagated through the DFZ.
Unfortunately, due to the preponderance of the Eqiad impact and its recovery, we didn't notice that it also impacted users from Russia to reach the wikis through Esams. We also didn't receive any NEL reports from most Russia users until 16:24 or later, as the location we use for NELs about Esams is itself Eqiad.
At around 17:15 UTC, our monitoring shows a full recovery, indicating the issue being resolved upstream.
Impact: For upto an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for ~13 minutes, and Russia for 1 hour. A subset of clients from these regions were unable to reach all wiki projects as both readers and contributors. Services such as Phabricator and Gerrit Code Review were affected as well.
Readers:
- United States east coast partial connectivity outage for 13min. Partial, as this provider is only one of the many we use in the region.
- Russia, partial connectivity outage for 1h. Partial as this provider is one of the many we use in the region (eg. we peer directly with Rostelecom).
Staff and community members connecting directly to services in Eqiad (eg. Gerrit). Due to the span of this provider's network, the further a user is from Eqiad the more likely they will be using that provider and could be impacted.
Documentation: