You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Incident documentation/2021-10-08 network provider
document status: in-review
Summary and Metadata
The metadata is aimed at helping provide a quick snapshot of context around what happened during the incident.
|Incident ID||2021-10-08 network provider||UTC Start Timestamp:||YYYY-MM-DD hh:mm:ss|
|Incident Task||https://phabricator.wikimedia.org/T292792||UTC End Timestamp||YYYY-MM-DD hh:mm:ss|
|People Paged||<amount of people>||Responder Count||<amount of people>|
|Coordinator(s)||Names - Emails||Relevant Metrics / SLO(s) affected||Relevant metrics
% error budget
|Summary:||For up to an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for ~13 minutes, and Russia for 1 hour. A subset of readers and contributors from these regions were unable to reach any wiki projects. Services such as Phabricator and Gerrit Code Review were affected as well. It was a partial issue because the network malfunction was limited to one of many providers we use in the affected regions.|
Around 16:11 UTC our (non-paging) monitoring and users reported connectivity issues to and from our Eqiad location. Traceroutes showed a routing loop in a provider's network.
At 16:19 UTC, using the provider's APIs, we asked the provider to stop advertising the prefixes for Eqiad on our behalf.
At 16:24 UTC, the first reports of recoveries arrived as the change propagated through the DFZ.
Unfortunately, due to the preponderance of the Eqiad impact and its recovery, we didn't notice that it also impacted users from Russia to reach the wikis through Esams. We also didn't receive any NEL reports from most Russia users until 16:24 or later, as the location we use for NELs about Esams is itself Eqiad.
At around 17:15 UTC, our monitoring shows a full recovery, indicating the issue being resolved upstream.
Impact: For up to an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for ~13 minutes, and Russia for 1 hour. A subset of readers and contributors from these regions were unable to reach any wiki projects. Services such as Phabricator and Gerrit Code Review were affected as well. It was a partial issue because the network malfunction was limited to one of many providers we use in the affected regions.
Due to the span of this provider's network, the further a client is from Eqiad the more likely they will be using that provider to reach our network and thus could be impacted.
|Incident Engagement™ ScoreCard||Score|
|People||Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)|
|Were the people who responded prepared enough to respond effectively (0/5pt)|
|Did fewer than 5 people get paged (0/5pt)?|
|Were pages routed to the correct sub-team(s)?|
|Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours)|
|Process||Was the incident status section actively updated during the incident? (0/1pt)|
|If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt)|
|Is there a phabricator task for the incident? (0/1pt)|
|Are the documented action items assigned? (0/1pt)|
|Is this a repeat of an earlier incident (-1 per prev occurrence)|
|Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task)|
|Tooling||Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)|
|Did existing monitoring notify the initial responders? (1pt)|
|Were all engineering tools required available and in service? (0/5pt)|
|Was there a runbook for all known issues present? (0/5pt)|