You are browsing a read-only backup copy of Wikitech. The live site can be found at

Incident documentation/2021-07-14 eventgate-analytics latency spike caused MW app server overload

From Wikitech-static
< Incident documentation
Revision as of 21:54, 14 July 2021 by imported>Cwhite (→‎Actionables)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: draft


While working on EventGate should use recent service-runner (^2.8.1) with Prometheus support, Andrew Otto deployed the changes to eventgate-analytics in codfw (the active DC). This change removes the prometheus-statsd-exporter container in favor of direct prometheus support added in recent versions of service-runner and service-template-node.

The deploy went fine in unloaded staging and eqiad clusters, but when deploying to codfw, request latency from MediaWiki to eventgate-analytics spiked, which caused PHP worker slots to fill up, which in turn caused some MediaWiki API requests to fail.

Helm noticed that the eventgate-analytics deploy to codfw itself was not doing well, and auto-rolled back the deployemnt:

$ kube_env eventgate-analytics codfw; helm history production
4       	Wed Jul 14 16:07:12 2021	SUPERSEDED	eventgate-0.3.1 	           	Upgrade "production" failed: timed out waiting for the co...
5       	Wed Jul 14 16:17:18 2021	DEPLOYED  	eventgate-0.2.14	           	Rollback to 3

Impact: MediaWiki API servers experienced a ~10 minute period of request failures.



  • Figure out why this happened and fix. Based on this log message, it seems likely that a bug in the service-runner prometheus integration caused the nodejs worker process to die.
    • Further investigation uncovered that require('prom-client') within a worker causes the observed issue. Both service-runner and node-rdkafka-prometheus require prom-client. It was proposed to patch node-rdkafka-prometheus to handle passing in the prom-client instance.