You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Incident documentation/2021-11-25 eventgate-main outage"

From Wikitech-static
Jump to navigation Jump to search
imported>Legoktm
imported>Jelto
Line 1: Line 1:
{{irdoc|status=draft}}
{{irdoc|status=review}} <!--
The status field should be one of:
* {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review.
* {{irdoc|status=review}} - The incident review working group will contact you then to finalise the report. See also the steps on [[Incident documentation]].
* {{irdoc|status=final}}
-->
==Summary==
==Summary==
During the helm3 migration of eqiad Kubernetes environment the service eventgate-main had reduced availability. The service was not available between 7:32 and 7:35 UTC.  
During the helm3 migration of eqiad Kubernetes cluster the service eventgate-main had reduced availability. The service was not available between 7:32 and 7:35 UTC.  


For the helm3 migration the service had to be removed and re-deployed to the cluster. Most Kubernetes services were explicitly pooled in codfw only during the re-deployments. eventgate-main was also falsely assumed to be served by codfw but was still pooled in eqiad. So during the time of removing and re-creating the pods, no traffic could be served for this service.
For the helm3 migration the service had to be removed and re-deployed to the cluster. Most Kubernetes services were explicitly pooled in codfw-only during the re-deployments. eventgate-main was also falsely assumed to be served by codfw but was still pooled in eqiad. So during the time of removing and re-creating the pods, no traffic could be served for this service.


The commands used to migrate and re-deploy codfw (see [[phab:T251305#7492328|T251305#7492328]]) were adapted and re-used for eqiad (see [[phab:T251305#7526591|T251305#7526591]]). Due to a small difference in what Kubernetes services are pooled as active-active and what are active-passive, eventgate-main was missing in the depooling command (as is it not pooled in codfw currently).
The commands used to migrate and re-deploy codfw (see [[phab:T251305#7492328|T251305#7492328]]) were adapted and re-used for eqiad (see [[phab:T251305#7526591|T251305#7526591]]). Due to a small difference in what Kubernetes services are pooled as active-active and what are active-passive, eventgate-main was missing in the depooling command (as is it not pooled in codfw currently).
Line 21: Line 26:
* optional: create a lvs/pybal/k8s service dashboard to see which service is pooled in which DC (will create a task)
* optional: create a lvs/pybal/k8s service dashboard to see which service is pooled in which DC (will create a task)
*[[phab:T296699|T296699: Pool eventgate-main in both datacenters (active/active)]]
*[[phab:T296699|T296699: Pool eventgate-main in both datacenters (active/active)]]
<mark>TODO: Add the [[phab:project/view/4758/|#Sustainability (Incident Followup)]] Phabricator tag to these tasks.</mark>

Revision as of 09:04, 30 November 2021

document status: in-review

Summary

During the helm3 migration of eqiad Kubernetes cluster the service eventgate-main had reduced availability. The service was not available between 7:32 and 7:35 UTC.

For the helm3 migration the service had to be removed and re-deployed to the cluster. Most Kubernetes services were explicitly pooled in codfw-only during the re-deployments. eventgate-main was also falsely assumed to be served by codfw but was still pooled in eqiad. So during the time of removing and re-creating the pods, no traffic could be served for this service.

The commands used to migrate and re-deploy codfw (see T251305#7492328) were adapted and re-used for eqiad (see T251305#7526591). Due to a small difference in what Kubernetes services are pooled as active-active and what are active-passive, eventgate-main was missing in the depooling command (as is it not pooled in codfw currently).


Impact: Between 7:32 and 7:35 UTC approximately 25.000 MediaWiki exceptions were created (see MediaWiki dashboard below). For around 5 minutes increased user-facing 5XX rates could be seen (see Varnish dashboard below), with a maximum of around ~230 5XX errors per minute. The produced events in eventgate were drastically reduced during the outage (see eventgate dashboard below).

Documentation:

Actionables