You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incidents/2022-03-01 ulsfo network

From Wikitech-static
< Incidents
Revision as of 17:49, 8 April 2022 by imported>Krinkle (Krinkle moved page Incident documentation/2022-03-01 ulsfo network to Incidents/2022-03-01 ulsfo network)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2022-03-01 ulsfo network Start 2022-03-01 22:35:25
Task End 2022-03-01 22:55:00
People paged Responder count 4
Coordinators jhathaway Affected metrics/SLOs
Impact For 20 minutes, clients normally routed to Ulsfo were unable to reach any of our projects. This includes New Zealand, parts of Canada, and the United States west coast.
Summary Multiple of our redundant network providers for the San Francisco datacenter simultaneously experienced connectivity loss. After 20 minutes, clients were rerouted to other datacenters.

Documentation:

Actionables

  • T303219 Integrate DNS depools with Etcd and automate/remove the need for writing a Git commit.
  • Can we increase fiber redundancy?

Scorecard

Incident Engagement™ ScoreCard
Rubric Question Score
People Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)
Were the people who responded prepared enough to respond effectively (0/5pt)
Did fewer than 5 people get paged (0/5pt)?
Were pages routed to the correct sub-team(s)?
Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours)
Process Was the incident status section actively updated during the incident? (0/1pt)
If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt)
Is there a phabricator task for the incident? (0/1pt)
Are the documented action items assigned? (0/1pt)
Is this a repeat of an earlier incident (-1 per prev occurrence)
Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task)
Tooling Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)
Did existing monitoring notify the initial responders? (1pt)
Were all engineering tools required available and in service? (0/5pt)
Was there a runbook for all known issues present? (0/5pt)
Total score