You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incidents/2022-04-27 cr2-eqord down

From Wikitech-static
< Incidents
Revision as of 05:27, 27 April 2022 by imported>Marostegui (SLOs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2022-04-27 cr2-eqord down Start 05:13
Task End 07:45
People paged 18 Responder count 5
Coordinators marostegui Affected metrics/SLOs No relevant SLOs exist
Impact None

Equinix Chicago confirmed they were doing power maintenance, but it should not have affected racks on our floor, however cr2-eqord lost power.

There was also a problem with Telia and a fiber cut that took down eqord -> eqiad, since previous evening and Telia maintenance taking down remaining transports to codfw and ulsfo.

Documentation:

  • cathal@nbgw:~$ ping -c 3 208.115.136.238 PING 208.115.136.238 (208.115.136.238) 56(84) bytes of data. --- 208.115.136.238 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2055ms
  • Telia circuit failure logs:

Apr 25 17:32:42 cr2-eqord fpc0 MQSS(0): CHMAC0: Detected Ethernet MAC Remote Fault Delta Event for Port 5 (xe-0/1/5)

Apr 26 05:11:57 cr2-eqord fpc0 MQSS(0): CHMAC0: Detected Ethernet MAC Remote Fault Delta Event for Port 3 (xe-0/1/3)

Apr 26 05:11:57 cr2-eqord fpc0 MQSS(0): CHMAC0: Detected Ethernet MAC Local Fault Delta Event for Port 0 (xe-0/1/0)

Apr 26 07:15:19 cr2-eqord fpc0 MQSS(0): CHMAC0: Cleared Ethernet MAC Remote Fault Delta Event for Port 3 (xe-0/1/3)

Apr 26 07:15:47 cr2-eqord fpc0 MQSS(0): CHMAC0: Cleared Ethernet MAC Local Fault Delta Event for Port 0 (xe-0/1/0)

Actionables

  • Grant Cathal authorization to be able to create remote hands cases (DONE)

Scorecard

Incident Engagement™ ScoreCard
Question Score Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no)
Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no)
Were more than 5 people paged? (score 0 for yes, 1 for no)
Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no)
Were pages routed to online (business hours) engineers? (score 1 for yes,  0 if people were paged after business hours)
Process Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no)
Was the public status page updated? (score 1 for yes, 0 for no)
Is there a phabricator task for the incident? (score 1 for yes, 0 for no)
Are the documented action items assigned?  (score 1 for yes, 0 for no)
Is this a repeat of an earlier incident (score 0 for yes, 1 for no)
Tooling Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no)
Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no)
Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no)
Were all engineering tools required available and in service? (score 1 for yes, 0 for no)
Was there a runbook for all known issues present? (score 1 for yes, 0 for no)
Total score