You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Incident documentation/2021-07-14 eventgate-analytics latency spike caused MW app server overload"

From Wikitech-static
Jump to navigation Jump to search
imported>Ottomata
imported>Krinkle
(fix invalid <syntaxhighlight>)
Line 13: Line 13:
Helm noticed that the eventgate-analytics deploy to codfw itself was not doing well, and auto-rolled back the deployemnt:
Helm noticed that the eventgate-analytics deploy to codfw itself was not doing well, and auto-rolled back the deployemnt:


<syntaxhighlight>
<pre>
$ kube_env eventgate-analytics codfw; helm history production
$ kube_env eventgate-analytics codfw; helm history production
REVISION UPDATED                STATUS    CHART          APP VERSION DESCRIPTION
REVISION UPDATED                STATUS    CHART          APP VERSION DESCRIPTION
Line 19: Line 19:
4      Wed Jul 14 16:07:12 2021 SUPERSEDED eventgate-0.3.1           Upgrade "production" failed: timed out waiting for the co...
4      Wed Jul 14 16:07:12 2021 SUPERSEDED eventgate-0.3.1           Upgrade "production" failed: timed out waiting for the co...
5      Wed Jul 14 16:17:18 2021 DEPLOYED  eventgate-0.2.14           Rollback to 3
5      Wed Jul 14 16:17:18 2021 DEPLOYED  eventgate-0.2.14           Rollback to 3
</syntaxhighlight>
</pre>




Line 25: Line 25:


'''Documentation''':
'''Documentation''':
* https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&from=1626278199112&to=1626279999112&var-datasource=codfw%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=eventgate-analytics
* [https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&from=1626278199112&to=1626279999112&var-datasource=codfw%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=eventgate-analytics Grafana: Envoy telemetry]
* https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=9&from=1626276391814&orgId=1&to=1626279991814&var-datasource=codfw%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200
* [https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=9&from=1626276391814&orgId=1&to=1626279991814&var-datasource=codfw%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200 Grafana: Application Servers dashboard]
* https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?viewPanel=14&orgId=1&from=1626278171118&to=1626279427383
* [https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?viewPanel=14&orgId=1&from=1626278171118&to=1626279427383 Grafana: Envoy telemetry / Upstream latency]


== Actionables ==
== Actionables ==

Revision as of 20:38, 29 July 2021

document status: in-review

Summary

While working on EventGate should use recent service-runner (^2.8.1) with Prometheus support, Andrew Otto deployed the changes to eventgate-analytics in codfw (the active DC). This change removes the prometheus-statsd-exporter container in favor of direct prometheus support added in recent versions of service-runner and service-template-node.

The deploy went fine in unloaded staging and eqiad clusters, but when deploying to codfw, request latency from MediaWiki to eventgate-analytics spiked, which caused PHP worker slots to fill up, which in turn caused some MediaWiki API requests to fail.

Helm noticed that the eventgate-analytics deploy to codfw itself was not doing well, and auto-rolled back the deployemnt:

$ kube_env eventgate-analytics codfw; helm history production
REVISION	UPDATED                 	STATUS    	CHART           	APP VERSION	DESCRIPTION
[...]
4       	Wed Jul 14 16:07:12 2021	SUPERSEDED	eventgate-0.3.1 	           	Upgrade "production" failed: timed out waiting for the co...
5       	Wed Jul 14 16:17:18 2021	DEPLOYED  	eventgate-0.2.14	           	Rollback to 3


Impact: MediaWiki API servers experienced a ~10 minute period of request failures.

Documentation:

Actionables

  • Figure out why this happened and fix. Based on this log message, it seems likely that a bug in the service-runner prometheus integration caused the nodejs worker process to die. [DONE]
    • Further investigation uncovered that require('prom-client') within a worker causes the observed issue. Both service-runner and node-rdkafka-prometheus require prom-client. It was proposed to patch node-rdkafka-prometheus to handle passing in the prom-client instance.
    • node-rdkafka-prometheus is an unmaintained project, so we have forked it to @wikimedia/node-rdkafka-propetheus and fixed the issue there. Additionally, if this issue in prom-client is fixed, we probably won't need the patch we made to node-rdkafka-prometheus for this fix.