You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Incidents/2021-09-04 appserver latency
document status: final
Summary
An increase in load on a database server resulted in many queries being much slower to respond. This in turn meant backend traffic occupies appserver php-fpm workers for much longer, and a proportion of those requests will fail entirely due to unavailable workers. The failed requests got an error page with the message "upstream connect error or disconnect/reset before headers. reset reason: overflow".
Impact: For 37 minutes, backends were slow (taking several seconds to respond) and 2% of requests failed entirely. This affected logged-in users, most bots/API queries, and some page views from unregistered users for pages that were recently edited or otherwise expired from the CDN cache.
Documentation:
- Public task about incident. T290373
- Grafana: Application RED dashboard
- 4-Sep-2021-http-status.png
HTTP error rates.
- 4-Sep-2021-latency.png
Latency buckets.
- 4-Sep-2021-latency-quantile.png
Latency quantile estimates.
Actionables
- T277416 (restricted)