You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incident documentation/2021-11-25 eventgate-main outage: Difference between revisions
imported>Legoktm (→Actionables: +1) |
imported>Krinkle mNo edit summary |
||
(One intermediate revision by one other user not shown) | |||
Line 1: | Line 1: | ||
{{irdoc|status=draft}} | {{irdoc|status=review}} <!-- | ||
The status field should be one of: | |||
* {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review. | |||
* {{irdoc|status=review}} - The incident review working group will contact you then to finalise the report. See also the steps on [[Incident documentation]]. | |||
* {{irdoc|status=final}} | |||
--> | |||
==Summary== | ==Summary== | ||
During the helm3 migration of eqiad Kubernetes | During the [[Helm|helm3]] migration of eqiad Kubernetes cluster the service eventgate-main experience an outage. The service was not available between 7:32 and 7:35 UTC. | ||
For the helm3 migration the service had to be removed and re-deployed to the cluster. Most Kubernetes services were explicitly pooled in codfw only during the re-deployments. eventgate-main was also falsely assumed to be served by | For the helm3 migration the service had to be removed and re-deployed to the cluster. Most Kubernetes services were explicitly pooled in codfw-only during the re-deployments. eventgate-main was also falsely assumed to be served by [[Codfw cluster|Codfw]] but was still pooled in [[Eqiad cluster|Eqiad]]. So during the time of removing and re-creating the pods, no traffic could be served for this service. | ||
The commands used to migrate and re-deploy codfw (see [[phab:T251305#7492328|T251305#7492328]]) were adapted and re-used for eqiad (see [[phab:T251305#7526591|T251305#7526591]]). Due to a small difference in what Kubernetes services are pooled as active-active and what are active-passive, eventgate-main was missing in the depooling command (as is it not pooled in codfw currently). | The commands used to migrate and re-deploy codfw (see [[phab:T251305#7492328|T251305#7492328]]) were adapted and re-used for eqiad (see [[phab:T251305#7526591|T251305#7526591]]). Due to a small difference in what Kubernetes services are pooled as active-active and what are active-passive, eventgate-main was missing in the depooling command (as is it not pooled in codfw currently). | ||
'''Impact''': For about 3 minutes (from 7:32 to 7:35 UTC), eventgate-main was unavailable. This resulted in 25,000 unrecoverable MediaWiki backend errors due to inability to queue new jobs. About 1,000 user-facing web requests and API requests failed with an HTTP 500 error. Event intake processing rate measured by eventgate briefly dropped from ~3000/second to 0/second during the outage. | |||
'''Impact''': | |||
'''Documentation''': | '''Documentation''': | ||
*[https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?orgId=1&var-datasource=eqiad%20prometheus%2Fops&viewPanel=18&from=1637825199597&to=1637826150301 Grafana Dashboard MediaWiki Exceptions] | *[https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?orgId=1&var-datasource=eqiad%20prometheus%2Fops&viewPanel=18&from=1637825199597&to=1637826150301 Grafana Dashboard MediaWiki Exceptions] | ||
*[https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&from= | *[https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&from=1637824200000&to=1637827200000 Grafana eventgate statistics] | ||
*[https://grafana.wikimedia.org/d/000000503/varnish-http-errors?viewPanel=7&orgId=1&from=1637825257550&to=1637826384984 Grafana Varnish http errors] | *[https://grafana.wikimedia.org/d/000000503/varnish-http-errors?viewPanel=7&orgId=1&from=1637825257550&to=1637826384984 Grafana Varnish http errors] | ||
<gallery mode="packed"> | |||
File:2021-11-25-mediawiki-exceptions.png | |||
File:2021-11-25-eventgate-statistics.png | |||
File:2021-11-25-varnish-http500.png | |||
</gallery> | |||
==Actionables== | ==Actionables== | ||
* automate maintenance and proper de-depooling of Kubernetes services using a cookbook [[phab:T277677|T277677]] and [[phab:T260663|T260663]] | *automate maintenance and proper de-depooling of Kubernetes services using a cookbook [[phab:T277677|T277677]] and [[phab:T260663|T260663]] | ||
* reduce snowflake services which need special treatment and make most/all of them active-active (for example [[phab:T288685|T288685]]) | *reduce snowflake services which need special treatment and make most/all of them active-active (for example [[phab:T288685|T288685]]) | ||
* optional: create a lvs/pybal/k8s service dashboard to see which service is pooled in which DC (will create a task) | *optional: create a lvs/pybal/k8s service dashboard to see which service is pooled in which DC (will create a task) | ||
*[[phab:T296699|T296699: Pool eventgate-main in both datacenters (active/active)]] | *[[phab:T296699|T296699: Pool eventgate-main in both datacenters (active/active)]] | ||
Revision as of 23:43, 1 December 2021
document status: in-review
Summary
During the helm3 migration of eqiad Kubernetes cluster the service eventgate-main experience an outage. The service was not available between 7:32 and 7:35 UTC.
For the helm3 migration the service had to be removed and re-deployed to the cluster. Most Kubernetes services were explicitly pooled in codfw-only during the re-deployments. eventgate-main was also falsely assumed to be served by Codfw but was still pooled in Eqiad. So during the time of removing and re-creating the pods, no traffic could be served for this service.
The commands used to migrate and re-deploy codfw (see T251305#7492328) were adapted and re-used for eqiad (see T251305#7526591). Due to a small difference in what Kubernetes services are pooled as active-active and what are active-passive, eventgate-main was missing in the depooling command (as is it not pooled in codfw currently).
Impact: For about 3 minutes (from 7:32 to 7:35 UTC), eventgate-main was unavailable. This resulted in 25,000 unrecoverable MediaWiki backend errors due to inability to queue new jobs. About 1,000 user-facing web requests and API requests failed with an HTTP 500 error. Event intake processing rate measured by eventgate briefly dropped from ~3000/second to 0/second during the outage.
Documentation:
Actionables
- automate maintenance and proper de-depooling of Kubernetes services using a cookbook T277677 and T260663
- reduce snowflake services which need special treatment and make most/all of them active-active (for example T288685)
- optional: create a lvs/pybal/k8s service dashboard to see which service is pooled in which DC (will create a task)
- T296699: Pool eventgate-main in both datacenters (active/active)