Jump to content

This is a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Data Platform/Data Lake/Data Issues/2025-06-30 Haproxykafka silently stopped sending request data to Kafka

From Wikitech

2025-06-30 Haproxykafka silently stopped sending request data to Kafka

Status Closed
Severity High
Business data steward
Technical data steward Andreas Hoelzl
Incident coordinator Fabrizio Furnari
Incident response team Fabrizio Furnari, Valentin Gutierrez
Data detected 2025-07-21
Date resolved 2025-08-22
Start of issue 2023-06-30
Phabricator ticket T400039 T401246

Description

On July 21st 2025, we learnt that we lost about 2 weeks of logs from cp5017 and about 10 weeks of logs from cp3071.

Apparently cp5017 haproxykafka process (and service) was up as reported by systemd but no messages has been sent to Kafka in the intervals from 2025-06-30@03:23UTC to 2025-7-07@12:40UTC and from 2025-07-15@15:59UTC to 2025-07-21@08:44UTC when the functionality has been (inadvertently) restarted while debugging. cp3071 was impacted from 2025-05-11 to 2025-07-23.

Debugging during the issue has been hard due to the complete refusal of the process to serve pprof information (through the /debug/ endpoint) and even strace showed no activity with network, file or process queries.

While debugging threads backtraces with gdb for eventual deadlocks, the process restarted it's usual behavior, restarting processing and sending logs to the kafka cluster.

Impact

We estimate that other hosts could have been affected by the same problem but they recovered while cp5017 and cp3071 did not.

Using the up metric in Prometheus we can estimate the duration of the downtime for haproxykafka on the affected cache nodes: For cp5017, the query sum_over_time((up{instance="cp5017:9341"} == bool 0)[90d:30s]) * 30 returns 1.14 million seconds, or approximately 13.1 days. For cp3071, the query sum_over_time((up{instance="cp3071:9341"} == bool 0)[90d:30s]) * 30 returns 6.36 million seconds, or approximately 73.6 days.

I attempted to use Turnilo webrequest_sampled_live ( https://w.wiki/ExKR ) to get a more granular view of traffic per CDN host, but it appears that breakdown is no longer available.

As a fallback, using pageviews_daily, which reports 780,594,326 total pageviews across the CDN for the last day, and assuming the 56 cache_text hosts are the only ones reporting pageviews, we estimate an average of: 13,939,184 pageviews per CDN server per day

Using this as a baseline: Estimated pageviews lost for cp5017: 13.9M * 13.1 days ≈ 182.6 million pageviews Estimated pageviews lost for cp3071: 13.9M * 73.6 days ≈ 1.026 billion pageviews

These are upper-bound estimates and assume full loss of logging for the entire duration and even distribution of traffic.

In total we estimate a 2% loss of pageviews during the given period. Countries impacted are the ones served by the esams and eqsin data centers, see https://phabricator.wikimedia.org/T401246#11094837 .

Root cause

Unknown.

Recommenations

Given the issue with cp5017 being unable to process logs in the latest days (and no one noticed) we need to review how HAProxyKafka alerts are defined.

To avoid this in the future we should add a couple of alerts: Check if the prometheus exporter is up Check that we're sending a reasonable amount of messages, compared to the number of requests received by HAProxy (essentially replicating the current HaproxyKafka alert for DE.

We have assessed that the impact of this data loss is low, and wouldn't need a correction or estimation in our metrics. Moreover, with all the heuristic changes and backfilling the data during this period (see T395934) we can't do much anyway. We will note this for YoY comparison or any future needs.