You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incidents/2022-04-27 cr2-eqord down: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Herron
m (adding link to related gdoc)
imported>Krinkle
 
Line 1: Line 1:
{{irdoc|status=draft}}
#REDIRECT [[Incidents/2022-04-26 cr2-eqord down]]
 
==Summary==
{{Incident scorecard
| task =
| paged-num = 18
| responders-num = 5
| coordinators = marostegui
| start = 05:13
| end = 07:45
| metrics = No relevant SLOs exist
| impact = None
}}
<!-- Reminder: No private information on this page! -->
 
Equinix Chicago confirmed they were doing power maintenance, but it should not have affected racks on our floor, however cr2-eqord lost power.
 
There was also a problem with Telia and a fiber cut that took down eqord -> eqiad, since previous evening and Telia maintenance taking down remaining transports to codfw and ulsfo.
 
'''Documentation''': https://docs.google.com/document/d/13-kHFdSw33P6NJzS95c24zOHaJAKvVSl8DFF3o6VCmY/edit
*''cathal@nbgw:~$ ping -c 3 208.115.136.238'' ''PING 208.115.136.238 (208.115.136.238) 56(84) bytes of data.''  ''--- 208.115.136.238 ping statistics ---''  ''3 packets transmitted, 0 received, 100% packet loss, time 2055ms''
 
*Telia circuit failure logs:
''Apr 25 17:32:42 cr2-eqord fpc0 MQSS(0): CHMAC0: Detected Ethernet MAC Remote Fault Delta Event for Port 5 (xe-0/1/5) ''
 
''Apr 26 05:11:57 cr2-eqord fpc0 MQSS(0): CHMAC0: Detected Ethernet MAC Remote Fault Delta Event for Port 3 (xe-0/1/3)'' 
 
''Apr 26 05:11:57 cr2-eqord fpc0 MQSS(0): CHMAC0: Detected Ethernet MAC Local Fault Delta Event for Port 0 (xe-0/1/0)''
 
''Apr 26 07:15:19 cr2-eqord fpc0 MQSS(0): CHMAC0: Cleared Ethernet MAC Remote Fault Delta Event for Port 3 (xe-0/1/3)''
 
''Apr 26 07:15:47 cr2-eqord fpc0 MQSS(0): CHMAC0: Cleared Ethernet MAC Local Fault Delta Event for Port 0 (xe-0/1/0)''
 
==Actionables==
* Grant Cathal authorization to be able to create remote hands cases (DONE)
 
== Scorecard==
 
{| class="wikitable"
|+[[Incident Scorecard|Incident Engagement™  ScoreCard]]
!
! Question
!Score
! Notes
|-
! rowspan="5" |People
|Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no)
|0
|Info not logged
|-
|Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no)
|0
|Manually escalated to netops
|-
|Were more than 5 people paged? (score 0 for yes, 1 for no)
|0
|
|-
|Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no)
|0
|
|-
|Were pages routed to online (business hours) engineers? (score 1 for yes,  0 if people were paged after business hours)
|0
|
|-
! rowspan="5" |Process
|Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no)
|1
|
|-
| Was the public status page updated? (score 1 for yes, 0 for no)
|0
|
|-
|Is there a phabricator task for the incident? (score 1 for yes, 0 for no)
|0
|
|-
|Are the documented action items assigned?  (score 1 for yes, 0 for no)
|1
|
|-
|Is this a repeat of an earlier incident (score 0 for yes, 1 for no)
|0
|
|-
! rowspan="5" |Tooling
|Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no)
|1
|
|-
|Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no)
|1
|
|-
|Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no)
|1
|
|-
|Were all engineering tools required available and in service? (score 1 for yes, 0 for no)
|1
|
|-
|Was there a runbook for all known issues present? (score 1 for yes, 0 for no)
|1
|
|-
! colspan="2" align="right" |Total score
|7
|
|}

Latest revision as of 17:34, 9 May 2022