You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incidents/2021-11-25 eventgate-main outage: Difference between revisions
imported>Krinkle m (Krinkle moved page Incident documentation/2021-11-25 eventgate-main outage to Incidents/2021-11-25 eventgate-main outage) |
imported>Herron (Updating to reflect scorecard amendments) |
||
Line 55: | Line 55: | ||
=Scorecard= | =Scorecard= | ||
{| class="wikitable" | {| class="wikitable" | ||
! | |||
!Question | |||
!Score | |||
!Notes | |||
|- | |- | ||
! rowspan="5" |People | |||
|Were the people responding to this incident sufficiently different than the previous | |Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no) | ||
| | |1 | ||
| | |||
|- | |- | ||
|Were the people who responded prepared enough to respond effectively (0 | |Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no) | ||
| | |1 | ||
| | |||
|- | |- | ||
| | |Were more than 5 people paged? (score 0 for yes, 1 for no) | ||
| | |1 | ||
| | |||
|- | |- | ||
|Were pages routed to the correct sub-team(s)? | |Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no) | ||
|N/A | |N/A | ||
| | |||
|- | |- | ||
|Were pages routed to online ( | |Were pages routed to online (business hours) engineers? (score 1 for yes, 0 if people were paged after business hours) | ||
|N/A | |N/A | ||
| | |||
|- | |- | ||
! rowspan="5" |Process | |||
|Was the incident status section actively updated during the incident? (0 | |Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no) | ||
|N/A | |N/A | ||
| | |||
|- | |- | ||
| | |Was the public status page updated? (score 1 for yes, 0 for no) | ||
|N/A | |N/A | ||
| | |||
|- | |- | ||
|Is there a phabricator task for the incident? (0 | |Is there a phabricator task for the incident? (score 1 for yes, 0 for no) | ||
|0 | |0 | ||
| | |||
|- | |- | ||
|Are the documented action items assigned? (0 | | Are the documented action items assigned? (score 1 for yes, 0 for no) | ||
|1 | |1 | ||
| | |||
|- | |- | ||
|Is this a repeat of an earlier incident ( | |Is this a repeat of an earlier incident (score 0 for yes, 1 for no) | ||
|0 | |0 | ||
| | |||
|- | |- | ||
| | ! rowspan="5" |Tooling | ||
| | |Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no) | ||
|0 | |||
| | |||
|- | |- | ||
| | |Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no) | ||
|N/A | |N/A | ||
| | |||
|- | |- | ||
|Did existing monitoring notify the initial responders? ( | |Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no) | ||
|N/A | |N/A | ||
| | |||
|- | |- | ||
|Were all engineering tools required available and in service? (0 | |Were all engineering tools required available and in service? (score 1 for yes, 0 for no) | ||
| | |1 | ||
| | |||
|- | |- | ||
|Was there a runbook for all known issues present? (0 | |Was there a runbook for all known issues present? (score 1 for yes, 0 for no) | ||
|0 | |0 | ||
| | |||
|- | |- | ||
! colspan="2" align="right" |Total score | |||
| | |5 | ||
| | |||
|} | |} | ||
==Actionables== | ==Actionables== |
Latest revision as of 19:12, 26 April 2022
document status: in-review
Summary and Metadata
Incident ID | 2021-11-25 eventgate-main outage | UTC Start Timestamp: | 2021-11-25 07:32 |
Incident Task | https://phabricator.wikimedia.org/T299970 | UTC End Timestamp | 2021-11-25 07:35 |
People Paged | 0 | Responder Count | 1 |
Coordinator(s) | No Coordinator needed | Relevant Metrics / SLO(s) affected |
No SLO defined, no error budget consumed |
Summary: | For about 3 minutes (from 7:32 to 7:35 UTC), eventgate-main was unavailable. This resulted in 25,000 unrecoverable MediaWiki backend errors due to inability to queue new jobs. About 1,000 user-facing web requests and API requests failed with an HTTP 500 error. Event intake processing rate measured by eventgate briefly dropped from ~3000/second to 0/second during the outage. |
During the helm3 migration of eqiad Kubernetes cluster the service eventgate-main experience an outage. The service was not available between 7:32 and 7:35 UTC.
For the helm3 migration the service had to be removed and re-deployed to the cluster. Most Kubernetes services were explicitly pooled in codfw-only during the re-deployments. eventgate-main was also falsely assumed to be served by Codfw but was still pooled in Eqiad. So during the time of removing and re-creating the pods, no traffic could be served for this service.
The commands used to migrate and re-deploy codfw (see T251305#7492328) were adapted and re-used for eqiad (see T251305#7526591). Due to a small difference in what Kubernetes services are pooled as active-active and what are active-passive, eventgate-main was missing in the depooling command (as is it not pooled in codfw currently).
Impact: For about 3 minutes (from 7:32 to 7:35 UTC), eventgate-main was unavailable. This resulted in 25,000 unrecoverable MediaWiki backend errors due to inability to queue new jobs. About 1,000 user-facing web requests and API requests failed with an HTTP 500 error. Event intake processing rate measured by eventgate briefly dropped from ~3000/second to 0/second during the outage.
Documentation:
- 2021-11-25-mediawiki-exceptions.png
- 2021-11-25-eventgate-statistics.png
- 2021-11-25-varnish-http500.png
Scorecard
Question | Score | Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no) | 1 | |
Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no) | 1 | ||
Were more than 5 people paged? (score 0 for yes, 1 for no) | 1 | ||
Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no) | N/A | ||
Were pages routed to online (business hours) engineers? (score 1 for yes, 0 if people were paged after business hours) | N/A | ||
Process | Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no) | N/A | |
Was the public status page updated? (score 1 for yes, 0 for no) | N/A | ||
Is there a phabricator task for the incident? (score 1 for yes, 0 for no) | 0 | ||
Are the documented action items assigned? (score 1 for yes, 0 for no) | 1 | ||
Is this a repeat of an earlier incident (score 0 for yes, 1 for no) | 0 | ||
Tooling | Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no) | 0 | |
Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no) | N/A | ||
Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no) | N/A | ||
Were all engineering tools required available and in service? (score 1 for yes, 0 for no) | 1 | ||
Was there a runbook for all known issues present? (score 1 for yes, 0 for no) | 0 | ||
Total score | 5 |
Actionables
- automate maintenance and proper de-depooling of Kubernetes services using a cookbook T277677 and T260663
- reduce snowflake services which need special treatment and make most/all of them active-active (for example T288685)
- optional: create a lvs/pybal/k8s service dashboard to see which service is pooled in which DC (will create a task)
- T296699: Pool eventgate-main in both datacenters (active/active)