You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incidents/2022-04-06 esams network: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Herron
(drafting initial score, based on incident documentation)
 
imported>Krinkle
mNo edit summary
 
Line 3: Line 3:
==Summary==
==Summary==
{{Incident scorecard
{{Incident scorecard
| task =  
| task = T305532
| paged-num =  
| paged-num =  
| responders-num =  
| responders-num =  
| coordinators =  
| coordinators = Jaime C
| start =  
| start = 08:20
| end =  
| end = 08:50
| impact = For 30 minutes, wikis were slow or unreachable for a portion of clients to the Esams data center. This is one of two DCs serving Europe, Middle-East, and Africa.
}}
}}
<!-- Reminder: No private information on this page! -->
<!-- Reminder: No private information on this page! -->


<mark>Summary of what happened, in one or two paragraphs. Avoid assuming deep knowledge of the systems here, and try to differentiate between proximate causes and root causes.</mark>
This was due to [https://www.ams-ix.net/ams/news/outage-at-the-ams-ix-platform-in-amsterdam an issue] at the [[w:Amsterdam Internet Exchange|Amsterdam Internet Exchange]] (AMS-IX).


'''Documentation''':
'''Documentation''':
*https://docs.google.com/document/d/1FfWF7LVyDpWvtIcv3G-Z4xaPZvOR3Rn05NZegywfVaY/edit#heading=h.vg6rb6x2eccy
 
* https://www.wikimediastatus.net/incidents/jnqvz8gljzhy
 
*[https://docs.google.com/document/d/1FfWF7LVyDpWvtIcv3G-Z4xaPZvOR3Rn05NZegywfVaY/edit#heading=h.vg6rb6x2eccy Restricted document]


==Actionables==
==Actionables==

Latest revision as of 17:01, 9 May 2022

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2022-04-06 esams network Start 08:20
Task T305532 End 08:50
People paged Responder count
Coordinators Jaime C Affected metrics/SLOs
Impact For 30 minutes, wikis were slow or unreachable for a portion of clients to the Esams data center. This is one of two DCs serving Europe, Middle-East, and Africa.

This was due to an issue at the Amsterdam Internet Exchange (AMS-IX).

Documentation:

Actionables

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

  • To do #1 (TODO: Create task)
  • To do #2 (TODO: Create task)

TODO: Add the #Sustainability (Incident Followup) and the #SRE-OnFIRE (Pending Review & Scorecard) Phabricator tag to these tasks.

Scorecard

Incident Engagement™ ScoreCard
Question Score Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no) 1
Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no) 1
Were more than 5 people paged? (score 0 for yes, 1 for no) 0 Paged via batphone
Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no) 0 Paged via batphone
Were pages routed to online (business hours) engineers? (score 1 for yes,  0 if people were paged after business hours) 0 Paged via batphone
Process Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no) 1
Was the public status page updated? (score 1 for yes, 0 for no) 0
Is there a phabricator task for the incident? (score 1 for yes, 0 for no) 1
Are the documented action items assigned?  (score 1 for yes, 0 for no) 0
Is this a repeat of an earlier incident (score 0 for yes, 1 for no) 1
Tooling Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no) 1
Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no) 1
Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no) 1
Were all engineering tools required available and in service? (score 1 for yes, 0 for no) 1
Was there a runbook for all known issues present? (score 1 for yes, 0 for no) 1
Total score 10