You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incidents/2021-11-25 eventgate-main outage: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
 
imported>Herron
(Updating to reflect scorecard amendments)
 
Line 55: Line 55:
=Scorecard=
=Scorecard=
{| class="wikitable"
{| class="wikitable"
| colspan="2" |'''Incident Engagement™  ScoreCard'''
!
|'''Score'''
!Question
!Score
!Notes
|-
|-
| rowspan="5" |'''People'''
! rowspan="5" |People
|Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)
|Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no)
|5
|1
|
|-
|-
|Were the people who responded prepared enough to respond effectively (0/5pt)
|Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no)
|4
|1
|
|-
|-
|Did fewer than 5 people get paged (0/5pt)?
|Were more than 5 people paged? (score 0 for yes, 1 for no)
|5
|1
|
|-
|-
|Were pages routed to the correct sub-team(s)?
|Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no)
|N/A
|N/A
|
|-
|-
|Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours)
|Were pages routed to online (business hours) engineers? (score 1 for yes,  0 if people were paged after business hours)  
|N/A
|N/A
|
|-
|-
| rowspan="6" |'''Process'''
! rowspan="5" |Process
|Was the incident status section actively updated during the incident? (0/1pt)  
|Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no)
|N/A
|N/A
|
|-
|-
|If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt)
|Was the public status page updated? (score 1 for yes, 0 for no)
|N/A
|N/A
|
|-
|-
|Is there a phabricator task for the incident? (0/1pt)
|Is there a phabricator task for the incident? (score 1 for yes, 0 for no)  
|0
|0
|
|-
|-
|Are the documented action items assigned?  (0/1pt)
| Are the documented action items assigned?  (score 1 for yes, 0 for no)  
|1
|1
|
|-
|-
|Is this a repeat of an earlier incident (-1 per prev occurrence)
|Is this a repeat of an earlier incident (score 0 for yes, 1 for no)
|0
|0
|
|-
|-
|Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task)
! rowspan="5" |Tooling
| -1
|Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no)
|0
|
|-
|-
| rowspan="4" |'''Tooling'''
|Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no)  
|Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)
|N/A
|N/A
|
|-
|-
|Did existing monitoring notify the initial responders? (1pt)
|Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no)
|N/A
|N/A
|
|-
|-
|Were all engineering tools required available and in service? (0/5pt)
|Were all engineering tools required available and in service? (score 1 for yes, 0 for no)
|5
|1
|
|-
|-
|Was there a runbook for all known issues present? (0/5pt)
|Was there a runbook for all known issues present? (score 1 for yes, 0 for no)  
|0
|0
|
|-
|-
| colspan="2" |'''Total Score'''
! colspan="2" align="right" |Total score
|19
|5
|
|}
|}
==Actionables==
==Actionables==

Latest revision as of 19:12, 26 April 2022

document status: in-review

Summary and Metadata

Incident ID 2021-11-25 eventgate-main outage UTC Start Timestamp: 2021-11-25 07:32
Incident Task https://phabricator.wikimedia.org/T299970 UTC End Timestamp 2021-11-25 07:35
People Paged 0 Responder Count 1
Coordinator(s) No Coordinator needed Relevant Metrics / SLO(s) affected
  • 25k MediaWiki backend errors
  • 1k web & API requests resulted in a 500
  • Event intake dropped to 0 (from 3k) for the duration

No SLO defined, no error budget consumed

Summary: For about 3 minutes (from 7:32 to 7:35 UTC), eventgate-main was unavailable. This resulted in 25,000 unrecoverable MediaWiki backend errors due to inability to queue new jobs. About 1,000 user-facing web requests and API requests failed with an HTTP 500 error. Event intake processing rate measured by eventgate briefly dropped from ~3000/second to 0/second during the outage.

During the helm3 migration of eqiad Kubernetes cluster the service eventgate-main experience an outage. The service was not available between 7:32 and 7:35 UTC.

For the helm3 migration the service had to be removed and re-deployed to the cluster. Most Kubernetes services were explicitly pooled in codfw-only during the re-deployments. eventgate-main was also falsely assumed to be served by Codfw but was still pooled in Eqiad. So during the time of removing and re-creating the pods, no traffic could be served for this service.

The commands used to migrate and re-deploy codfw (see T251305#7492328) were adapted and re-used for eqiad (see T251305#7526591). Due to a small difference in what Kubernetes services are pooled as active-active and what are active-passive, eventgate-main was missing in the depooling command (as is it not pooled in codfw currently).

Impact: For about 3 minutes (from 7:32 to 7:35 UTC), eventgate-main was unavailable. This resulted in 25,000 unrecoverable MediaWiki backend errors due to inability to queue new jobs. About 1,000 user-facing web requests and API requests failed with an HTTP 500 error. Event intake processing rate measured by eventgate briefly dropped from ~3000/second to 0/second during the outage.

Documentation:

Scorecard

Question Score Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no) 1
Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no) 1
Were more than 5 people paged? (score 0 for yes, 1 for no) 1
Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no) N/A
Were pages routed to online (business hours) engineers? (score 1 for yes,  0 if people were paged after business hours) N/A
Process Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no) N/A
Was the public status page updated? (score 1 for yes, 0 for no) N/A
Is there a phabricator task for the incident? (score 1 for yes, 0 for no) 0
Are the documented action items assigned?  (score 1 for yes, 0 for no) 1
Is this a repeat of an earlier incident (score 0 for yes, 1 for no) 0
Tooling Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no) 0
Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no) N/A
Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no) N/A
Were all engineering tools required available and in service? (score 1 for yes, 0 for no) 1
Was there a runbook for all known issues present? (score 1 for yes, 0 for no) 0
Total score 5

Actionables