You are browsing a read-only backup copy of Wikitech. The live site can be found at

Incident documentation/2021-11-25 eventgate-main outage

From Wikitech-static
< Incident documentation
Revision as of 15:36, 25 November 2021 by imported>Jelto (add eventgate-main outage documentation)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: draft


During the helm3 migration of eqiad Kubernetes environment the service eventgate-main had reduced availability. The service was not available between 7:32 and 7:35 UTC.

For the helm3 migration the service had to be removed and re-deployed to the cluster. Most Kubernetes services were explicitly pooled in codfw only during the re-deployments. eventgate-main was also falsely assumed to be served by codfw but was still pooled in eqiad. So during the time of removing and re-creating the pods, no traffic could be served for this service.

The commands used to migrate and re-deploy codfw (see T251305#7492328) were adapted and re-used for eqiad (see T251305#7526591). Due to a small difference in what Kubernetes services are pooled as active-active and what are active-passive, eventgate-main was missing in the depooling command (as is it not pooled in codfw currently).

Impact: Between 7:32 and 7:35 UTC approximately 25.000 MediaWiki exceptions were created (see MediaWiki dashboard below). For around 5 minutes increased user-facing 5XX rates could be seen (see Varnish dashboard below), with a maximum of around ~230 5XX errors per minute. The produced events in eventgate were drastically reduced during the outage (see eventgate dashboard below).



  • automate maintenance and proper de-depooling of Kubernetes services using a cookbook T277677
  • reduce snowflake services which need special treatment and make most/all of them active-active (for example T288685)
  • optional: create a lvs/pybal/k8s service dashboard to see which service is pooled in which DC (will create a task)

TODO: Add the #Sustainability (Incident Followup) Phabricator tag to these tasks.