You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/2021-10-08 network provider: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
No edit summary
imported>Herron
(Add summary and scorecard sections from template)
Line 6: Line 6:
-->
-->


==Summary==
==Summary and Metadata==
Around 16:11 UTC our (non-paging) monitoring and users reported connectivity issues to and from our Eqiad location. Traceroutes showed a routing loop in a provider's network.
The metadata is aimed at helping provide a quick snapshot of context around what happened during the incident.
{| class="wikitable"
|'''Incident ID'''
|2021-10-08 network provider
|'''UTC Start Timestamp:'''
|YYYY-MM-DD hh:mm:ss
|-
|'''Incident Task'''
| https://phabricator.wikimedia.org/T292792
|'''UTC End Timestamp'''
|YYYY-MM-DD hh:mm:ss
|-
|'''People Paged'''
|<amount of people>
|'''Responder Count'''
|<amount of people>
|-
| '''Coordinator(s)'''
|Names - Emails
|'''Relevant Metrics / SLO(s) affected'''
|Relevant metrics
% error budget
|-
|'''Summary:'''
| colspan="3" |For up to an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for ~13 minutes, and Russia for 1 hour. A subset of readers and contributors from these regions were unable to reach any wiki projects. Services such as Phabricator and Gerrit Code Review were affected as well. It was a partial issue because the network malfunction was limited to one of many providers we use in the affected regions.
|}Around 16:11 UTC our (non-paging) monitoring and users reported connectivity issues to and from our Eqiad location. Traceroutes showed a routing loop in a provider's network.


At 16:19 UTC, using the provider's APIs, we asked the provider to stop advertising the prefixes for Eqiad on our behalf.
At 16:19 UTC, using the provider's APIs, we asked the provider to stop advertising the prefixes for Eqiad on our behalf.
Line 17: Line 42:
At around 17:15 UTC, our monitoring shows a full recovery, indicating the issue being resolved upstream.
At around 17:15 UTC, our monitoring shows a full recovery, indicating the issue being resolved upstream.


'''Impact''': For upto an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for ~13 minutes, and Russia for 1 hour. A subset of readers and contributors from these regions were unable to reach any wiki projects. Services such as Phabricator and Gerrit Code Review were affected as well. It was a partial issue because the network malfunction was limited to one of many providers we use in the affected regions.
'''Impact''': For up to an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for ~13 minutes, and Russia for 1 hour. A subset of readers and contributors from these regions were unable to reach any wiki projects. Services such as Phabricator and Gerrit Code Review were affected as well. It was a partial issue because the network malfunction was limited to one of many providers we use in the affected regions.


Due to the span of this provider's network, the further a client is from Eqiad the more likely they will be using that provider to reach our network and thus could be impacted.<!-- Reminder: No private information on this page! -->
Due to the span of this provider's network, the further a client is from Eqiad the more likely they will be using that provider to reach our network and thus could be impacted.<!-- Reminder: No private information on this page! -->
Line 23: Line 48:
'''Documentation''':[[File:Screenshot 2021-10-08 at 08-23-35 NEL (Network Error Logging) - Elastic.png|none|thumb|[https://logstash.wikimedia.org/goto/93cb07d63964e0271ecf8eece8845a7b Logstash: NEL reports (restricted)]]]
'''Documentation''':[[File:Screenshot 2021-10-08 at 08-23-35 NEL (Network Error Logging) - Elastic.png|none|thumb|[https://logstash.wikimedia.org/goto/93cb07d63964e0271ecf8eece8845a7b Logstash: NEL reports (restricted)]]]
*[https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?var-datasource=eqiad%20prometheus%2Fops&var-target_site=All&var-ip_version=ipv4&var-country_code=All&var-asn=All&from=1633620224947&to=1633632035019 Grafana: Ripe Atlas]
*[https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?var-datasource=eqiad%20prometheus%2Fops&var-target_site=All&var-ip_version=ipv4&var-country_code=All&var-asn=All&from=1633620224947&to=1633632035019 Grafana: Ripe Atlas]
 
=Scorecard=
{| class="wikitable"
| colspan="2" |'''Incident Engagement™  ScoreCard'''
| '''Score'''
|-
| rowspan="5" |'''People'''
|Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)
|
|-
|Were the people who responded prepared enough to respond effectively (0/5pt)
|
|-
|Did fewer than 5 people get paged (0/5pt)?
|
|-
|Were pages routed to the correct sub-team(s)?
|
|-
|Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours)
|
|-
| rowspan="6" |'''Process'''
|Was the incident status section actively updated during the incident? (0/1pt)
|
|-
|If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt)
|
|-
|Is there a phabricator task for the incident? (0/1pt)
|
|-
|Are the documented action items assigned?  (0/1pt)
|
|-
|Is this a repeat of an earlier incident (-1 per prev occurrence)
|
|-
|Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task)
|
|-
| rowspan="4" |'''Tooling'''
|Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)
|
|-
|Did existing monitoring notify the initial responders? (1pt)
|
|-
|Were all engineering tools required available and in service? (0/5pt)
|
|-
|Was there a runbook for all known issues present? (0/5pt)
|
|-
| colspan="2" |'''Total Score'''
|
|}
==Actionables==
==Actionables==
*[[phab:T292792|2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users]]
*[[phab:T292792|2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users]]
**[[gerrit:c/operations/puppet/+/727594|patch to make NEL alert paging]]
**[[gerrit:c/operations/puppet/+/727594|patch to make NEL alert paging]]
*Request RFO from provider
*Request RFO from provider

Revision as of 18:03, 1 February 2022

document status: in-review

Summary and Metadata

The metadata is aimed at helping provide a quick snapshot of context around what happened during the incident.

Incident ID 2021-10-08 network provider UTC Start Timestamp: YYYY-MM-DD hh:mm:ss
Incident Task https://phabricator.wikimedia.org/T292792 UTC End Timestamp YYYY-MM-DD hh:mm:ss
People Paged <amount of people> Responder Count <amount of people>
Coordinator(s) Names - Emails Relevant Metrics / SLO(s) affected Relevant metrics

% error budget

Summary: For up to an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for ~13 minutes, and Russia for 1 hour. A subset of readers and contributors from these regions were unable to reach any wiki projects. Services such as Phabricator and Gerrit Code Review were affected as well. It was a partial issue because the network malfunction was limited to one of many providers we use in the affected regions.

Around 16:11 UTC our (non-paging) monitoring and users reported connectivity issues to and from our Eqiad location. Traceroutes showed a routing loop in a provider's network.

At 16:19 UTC, using the provider's APIs, we asked the provider to stop advertising the prefixes for Eqiad on our behalf.

At 16:24 UTC, the first reports of recoveries arrived as the change propagated through the DFZ.

Unfortunately, due to the preponderance of the Eqiad impact and its recovery, we didn't notice that it also impacted users from Russia to reach the wikis through Esams. We also didn't receive any NEL reports from most Russia users until 16:24 or later, as the location we use for NELs about Esams is itself Eqiad.

At around 17:15 UTC, our monitoring shows a full recovery, indicating the issue being resolved upstream.

Impact: For up to an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for ~13 minutes, and Russia for 1 hour. A subset of readers and contributors from these regions were unable to reach any wiki projects. Services such as Phabricator and Gerrit Code Review were affected as well. It was a partial issue because the network malfunction was limited to one of many providers we use in the affected regions.

Due to the span of this provider's network, the further a client is from Eqiad the more likely they will be using that provider to reach our network and thus could be impacted.

Documentation:

Scorecard

Incident Engagement™  ScoreCard Score
People Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)
Were the people who responded prepared enough to respond effectively (0/5pt)
Did fewer than 5 people get paged (0/5pt)?
Were pages routed to the correct sub-team(s)?
Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours)
Process Was the incident status section actively updated during the incident? (0/1pt)
If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt)
Is there a phabricator task for the incident? (0/1pt)
Are the documented action items assigned?  (0/1pt)
Is this a repeat of an earlier incident (-1 per prev occurrence)
Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task)
Tooling Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)
Did existing monitoring notify the initial responders? (1pt)
Were all engineering tools required available and in service? (0/5pt)
Was there a runbook for all known issues present? (0/5pt)
Total Score

Actionables