You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/2021-11-25 eventgate-main outage: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
mNo edit summary
imported>Herron
(Add summary and scorecard sections from template)
Line 5: Line 5:
* {{irdoc|status=final}}
* {{irdoc|status=final}}
-->
-->
==Summary==
==Summary and Metadata==
During the [[Helm|helm3]] migration of eqiad Kubernetes cluster the service eventgate-main experience an outage. The service was not available between 7:32 and 7:35 UTC.  
{| class="wikitable"
|'''Incident ID'''
|2021-11-25 eventgate-main outage
|'''UTC Start Timestamp:'''
|YYYY-MM-DD hh:mm:ss
|-
|'''Incident Task'''
|https://phabricator.wikimedia.org/T299970
|'''UTC End Timestamp'''
|YYYY-MM-DD hh:mm:ss
|-
| '''People Paged'''
|<amount of people>
|'''Responder Count'''
|<amount of people>
|-
|'''Coordinator(s)'''
|Names - Emails
|'''Relevant Metrics / SLO(s) affected'''
|Relevant metrics
% error budget
|-
|'''Summary:'''
| colspan="3" |For about 3 minutes (from 7:32 to 7:35 UTC), eventgate-main was unavailable. This resulted in 25,000 unrecoverable MediaWiki backend errors due to inability to queue new jobs. About 1,000 user-facing web requests and API requests failed with an HTTP 500 error. Event intake processing rate measured by eventgate briefly dropped from ~3000/second to 0/second during the outage.
|}During the [[Helm|helm3]] migration of eqiad Kubernetes cluster the service eventgate-main experience an outage. The service was not available between 7:32 and 7:35 UTC.  


For the helm3 migration the service had to be removed and re-deployed to the cluster. Most Kubernetes services were explicitly pooled in codfw-only during the re-deployments. eventgate-main was also falsely assumed to be served by [[Codfw cluster|Codfw]] but was still pooled in [[Eqiad cluster|Eqiad]]. So during the time of removing and re-creating the pods, no traffic could be served for this service.
For the helm3 migration the service had to be removed and re-deployed to the cluster. Most Kubernetes services were explicitly pooled in codfw-only during the re-deployments. eventgate-main was also falsely assumed to be served by [[Codfw cluster|Codfw]] but was still pooled in [[Eqiad cluster|Eqiad]]. So during the time of removing and re-creating the pods, no traffic could be served for this service.
Line 23: Line 47:
File:2021-11-25-varnish-http500.png
File:2021-11-25-varnish-http500.png
</gallery>
</gallery>
 
=Scorecard=
{| class="wikitable"
| colspan="2" |'''Incident Engagement™  ScoreCard'''
|'''Score'''
|-
| rowspan="5" |'''People'''
|Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)
|
|-
|Were the people who responded prepared enough to respond effectively (0/5pt)
|
|-
|Did fewer than 5 people get paged (0/5pt)?
|
|-
|Were pages routed to the correct sub-team(s)?
|
|-
|Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours)
|
|-
| rowspan="6" |'''Process'''
|Was the incident status section actively updated during the incident? (0/1pt)
|
|-
|If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt)
|
|-
|Is there a phabricator task for the incident? (0/1pt)
|
|-
|Are the documented action items assigned?  (0/1pt)
|
|-
|Is this a repeat of an earlier incident (-1 per prev occurrence)
|
|-
|Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task)
|
|-
| rowspan="4" |'''Tooling'''
|Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)
|
|-
|Did existing monitoring notify the initial responders? (1pt)
|
|-
|Were all engineering tools required available and in service? (0/5pt)
|
|-
|Was there a runbook for all known issues present? (0/5pt)
|
|-
| colspan="2" |'''Total Score'''
|
|}
==Actionables==
==Actionables==



Revision as of 17:46, 1 February 2022

document status: in-review

Summary and Metadata

Incident ID 2021-11-25 eventgate-main outage UTC Start Timestamp: YYYY-MM-DD hh:mm:ss
Incident Task https://phabricator.wikimedia.org/T299970 UTC End Timestamp YYYY-MM-DD hh:mm:ss
People Paged <amount of people> Responder Count <amount of people>
Coordinator(s) Names - Emails Relevant Metrics / SLO(s) affected Relevant metrics

% error budget

Summary: For about 3 minutes (from 7:32 to 7:35 UTC), eventgate-main was unavailable. This resulted in 25,000 unrecoverable MediaWiki backend errors due to inability to queue new jobs. About 1,000 user-facing web requests and API requests failed with an HTTP 500 error. Event intake processing rate measured by eventgate briefly dropped from ~3000/second to 0/second during the outage.

During the helm3 migration of eqiad Kubernetes cluster the service eventgate-main experience an outage. The service was not available between 7:32 and 7:35 UTC.

For the helm3 migration the service had to be removed and re-deployed to the cluster. Most Kubernetes services were explicitly pooled in codfw-only during the re-deployments. eventgate-main was also falsely assumed to be served by Codfw but was still pooled in Eqiad. So during the time of removing and re-creating the pods, no traffic could be served for this service.

The commands used to migrate and re-deploy codfw (see T251305#7492328) were adapted and re-used for eqiad (see T251305#7526591). Due to a small difference in what Kubernetes services are pooled as active-active and what are active-passive, eventgate-main was missing in the depooling command (as is it not pooled in codfw currently).

Impact: For about 3 minutes (from 7:32 to 7:35 UTC), eventgate-main was unavailable. This resulted in 25,000 unrecoverable MediaWiki backend errors due to inability to queue new jobs. About 1,000 user-facing web requests and API requests failed with an HTTP 500 error. Event intake processing rate measured by eventgate briefly dropped from ~3000/second to 0/second during the outage.

Documentation:

Scorecard

Incident Engagement™  ScoreCard Score
People Were the people responding to this incident sufficiently different than the previous N incidents? (0/5pt)
Were the people who responded prepared enough to respond effectively (0/5pt)
Did fewer than 5 people get paged (0/5pt)?
Were pages routed to the correct sub-team(s)?
Were pages routed to online (working hours) engineers (0/5pt)? (score 0 if people were paged after-hours)
Process Was the incident status section actively updated during the incident? (0/1pt)
If this was a major outage noticed by the community, was the public status page updated? If the issue was internal, was the rest of the organization updated with relevant incident statuses? (0/1pt)
Is there a phabricator task for the incident? (0/1pt)
Are the documented action items assigned?  (0/1pt)
Is this a repeat of an earlier incident (-1 per prev occurrence)
Is there an open task that would prevent this incident / make mitigation easier if implemented? (0/-1p per task)
Tooling Did the people responding have trouble communicating effectively during the incident due to the existing or lack of tooling? (0/5pt)
Did existing monitoring notify the initial responders? (1pt)
Were all engineering tools required available and in service? (0/5pt)
Was there a runbook for all known issues present? (0/5pt)
Total Score

Actionables