You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/2021-11-25 eventgate-main outage: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Legoktm
imported>Krinkle
 
(4 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{irdoc|status=draft}}
#REDIRECT [[Incidents/2021-11-25 eventgate-main outage]]
==Summary==
During the helm3 migration of eqiad Kubernetes environment the service eventgate-main had reduced availability. The service was not available between 7:32 and 7:35 UTC.
 
For the helm3 migration the service had to be removed and re-deployed to the cluster. Most Kubernetes services were explicitly pooled in codfw only during the re-deployments. eventgate-main was also falsely assumed to be served by codfw but was still pooled in eqiad. So during the time of removing and re-creating the pods, no traffic could be served for this service.
 
The commands used to migrate and re-deploy codfw (see [[phab:T251305#7492328|T251305#7492328]]) were adapted and re-used for eqiad (see [[phab:T251305#7526591|T251305#7526591]]). Due to a small difference in what Kubernetes services are pooled as active-active and what are active-passive, eventgate-main was missing in the depooling command (as is it not pooled in codfw currently).
 
 
'''Impact''': Between 7:32 and 7:35 UTC approximately 25.000 MediaWiki exceptions were created (see MediaWiki dashboard below). For around 5 minutes increased user-facing 5XX rates could be seen (see Varnish dashboard below), with a maximum of around ~230 5XX errors per minute. The produced events in eventgate were drastically reduced during the outage (see eventgate dashboard below).
 
'''Documentation''':
*[https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?orgId=1&var-datasource=eqiad%20prometheus%2Fops&viewPanel=18&from=1637825199597&to=1637826150301 Grafana Dashboard MediaWiki Exceptions]
*[https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&from=1637821829000&to=1637829749000 Grafana eventgate statistics]
*[https://grafana.wikimedia.org/d/000000503/varnish-http-errors?viewPanel=7&orgId=1&from=1637825257550&to=1637826384984 Grafana Varnish http errors]
 
==Actionables==
 
* automate maintenance and proper de-depooling of Kubernetes services using a cookbook [[phab:T277677|T277677]] and [[phab:T260663|T260663]]
* reduce snowflake services which need special treatment and make most/all of them active-active (for example [[phab:T288685|T288685]])
* optional: create a lvs/pybal/k8s service dashboard to see which service is pooled in which DC (will create a task)
*[[phab:T296699|T296699: Pool eventgate-main in both datacenters (active/active)]]
 
<mark>TODO: Add the [[phab:project/view/4758/|#Sustainability (Incident Followup)]] Phabricator tag to these tasks.</mark>

Latest revision as of 17:49, 8 April 2022