You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incidents/2022-04-06 esams network: Difference between revisions
Jump to navigation
Jump to search
imported>Herron (drafting initial score, based on incident documentation) |
imported>Krinkle mNo edit summary |
||
Line 3: | Line 3: | ||
==Summary== | ==Summary== | ||
{{Incident scorecard | {{Incident scorecard | ||
| task = | | task = T305532 | ||
| paged-num = | | paged-num = | ||
| responders-num = | | responders-num = | ||
| coordinators = | | coordinators = Jaime C | ||
| start = | | start = 08:20 | ||
| end = | | end = 08:50 | ||
| impact = For 30 minutes, wikis were slow or unreachable for a portion of clients to the Esams data center. This is one of two DCs serving Europe, Middle-East, and Africa. | |||
}} | }} | ||
<!-- Reminder: No private information on this page! --> | <!-- Reminder: No private information on this page! --> | ||
This was due to [https://www.ams-ix.net/ams/news/outage-at-the-ams-ix-platform-in-amsterdam an issue] at the [[w:Amsterdam Internet Exchange|Amsterdam Internet Exchange]] (AMS-IX). | |||
'''Documentation''': | '''Documentation''': | ||
*https://docs.google.com/document/d/1FfWF7LVyDpWvtIcv3G-Z4xaPZvOR3Rn05NZegywfVaY/edit#heading=h.vg6rb6x2eccy | |||
* https://www.wikimediastatus.net/incidents/jnqvz8gljzhy | |||
*[https://docs.google.com/document/d/1FfWF7LVyDpWvtIcv3G-Z4xaPZvOR3Rn05NZegywfVaY/edit#heading=h.vg6rb6x2eccy Restricted document] | |||
==Actionables== | ==Actionables== |
Latest revision as of 17:01, 9 May 2022
document status: draft
Summary
Incident ID | 2022-04-06 esams network | Start | 08:20 |
---|---|---|---|
Task | T305532 | End | 08:50 |
People paged | Responder count | ||
Coordinators | Jaime C | Affected metrics/SLOs | |
Impact | For 30 minutes, wikis were slow or unreachable for a portion of clients to the Esams data center. This is one of two DCs serving Europe, Middle-East, and Africa. |
This was due to an issue at the Amsterdam Internet Exchange (AMS-IX).
Documentation:
Actionables
Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.
- To do #1 (TODO: Create task)
- To do #2 (TODO: Create task)
TODO: Add the #Sustainability (Incident Followup) and the #SRE-OnFIRE (Pending Review & Scorecard) Phabricator tag to these tasks.
Scorecard
Question | Score | Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no) | 1 | |
Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no) | 1 | ||
Were more than 5 people paged? (score 0 for yes, 1 for no) | 0 | Paged via batphone | |
Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no) | 0 | Paged via batphone | |
Were pages routed to online (business hours) engineers? (score 1 for yes, 0 if people were paged after business hours) | 0 | Paged via batphone | |
Process | Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no) | 1 | |
Was the public status page updated? (score 1 for yes, 0 for no) | 0 | ||
Is there a phabricator task for the incident? (score 1 for yes, 0 for no) | 1 | ||
Are the documented action items assigned? (score 1 for yes, 0 for no) | 0 | ||
Is this a repeat of an earlier incident (score 0 for yes, 1 for no) | 1 | ||
Tooling | Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no) | 1 | |
Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no) | 1 | ||
Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no) | 1 | ||
Were all engineering tools required available and in service? (score 1 for yes, 0 for no) | 1 | ||
Was there a runbook for all known issues present? (score 1 for yes, 0 for no) | 1 | ||
Total score | 10 |