Data Platform/Data Lake/Data Issues/2025-06-30 Haproxykafka silently stopped sending request data to Kafka
2025-06-30 Haproxykafka silently stopped sending request data to Kafka
| Status | Closed | |
| Severity | High | |
| Business data steward | ||
| Technical data steward | Andreas Hoelzl | |
| Incident coordinator | Fabrizio Furnari | |
| Incident response team | Fabrizio Furnari, Valentin Gutierrez | |
| Data detected | 2025-07-21 | |
| Date resolved | 2025-08-22 | |
| Start of issue | 2023-06-30 | |
| Phabricator ticket | T400039 T401246 | |
Description
On July 21st 2025, we learnt that we lost about 2 weeks of logs from cp5017 and about 10 weeks of logs from cp3071.
Apparently cp5017 haproxykafka process (and service) was up as reported by systemd but no messages has been sent to Kafka in the intervals from 2025-06-30@03:23UTC to 2025-7-07@12:40UTC and from 2025-07-15@15:59UTC to 2025-07-21@08:44UTC when the functionality has been (inadvertently) restarted while debugging. cp3071 was impacted from 2025-05-11 to 2025-07-23.
Debugging during the issue has been hard due to the complete refusal of the process to serve pprof information (through the /debug/ endpoint) and even strace showed no activity with network, file or process queries.
While debugging threads backtraces with gdb for eventual deadlocks, the process restarted it's usual behavior, restarting processing and sending logs to the kafka cluster.
Impact
We estimate that other hosts could have been affected by the same problem but they recovered while cp5017 and cp3071 did not.
Using the up metric in Prometheus we can estimate the duration of the downtime for haproxykafka on the affected cache nodes: For cp5017, the query sum_over_time((up{instance="cp5017:9341"} == bool 0)[90d:30s]) * 30 returns 1.14 million seconds, or approximately 13.1 days. For cp3071, the query sum_over_time((up{instance="cp3071:9341"} == bool 0)[90d:30s]) * 30 returns 6.36 million seconds, or approximately 73.6 days.
I attempted to use Turnilo webrequest_sampled_live ( https://w.wiki/ExKR ) to get a more granular view of traffic per CDN host, but it appears that breakdown is no longer available.
As a fallback, using pageviews_daily, which reports 780,594,326 total pageviews across the CDN for the last day, and assuming the 56 cache_text hosts are the only ones reporting pageviews, we estimate an average of: 13,939,184 pageviews per CDN server per day
Using this as a baseline: Estimated pageviews lost for cp5017: 13.9M * 13.1 days ≈ 182.6 million pageviews Estimated pageviews lost for cp3071: 13.9M * 73.6 days ≈ 1.026 billion pageviews
These are upper-bound estimates and assume full loss of logging for the entire duration and even distribution of traffic.
In total we estimate a 2% loss of pageviews during the given period. Countries impacted are the ones served by the esams and eqsin data centers, see https://phabricator.wikimedia.org/T401246#11094837 .
Root cause
Unknown.
Recommenations
Given the issue with cp5017 being unable to process logs in the latest days (and no one noticed) we need to review how HAProxyKafka alerts are defined.
To avoid this in the future we should add a couple of alerts: Check if the prometheus exporter is up Check that we're sending a reasonable amount of messages, compared to the number of requests received by HAProxy (essentially replicating the current HaproxyKafka alert for DE.
We have assessed that the impact of this data loss is low, and wouldn't need a correction or estimation in our metrics. Moreover, with all the heuristic changes and backfilling the data during this period (see T395934) we can't do much anyway. We will note this for YoY comparison or any future needs.