Data Platform/Data Lake/Data Issues/2025-06-03 May 2025 spike in bot traffic
May 2025 spike in bot traffic
| Status | Resolved |
| Severity | High |
| Business data steward | Omari Sefu |
| Technical data steward | Andreas Hoelzl |
| Incident coordinator | Omari Sefu, Andreas Hoelzl |
| Incident response team | Marcel Ruiz Forns, Hamid Ghani (lead analyst), Maya Kampurath, Joseph Allemandou |
| Date detected | June 3rd, 2025 |
| Date resolved | August 28th, 2025 (deployment of new bot detection heuristics); October 8th, 2025 (backfill complete) |
| Start of issue |
Mar 20, 2025 (effective date for the correction to pageview and daily unique device data using the new bot detection heuristic)
May 1, 2025 (effective date for the correction to monthly unique device data) May & June 2025 (period where we observed the most notable increase in user pageviews and unique devices before the new bot detection heuristic was implemented) |
| Phabricator ticket | T395934 , T405667 (backfill) |
Summary
As Wikipedia traffic comes in, we classify it as coming from either humans (“users”) or bots (“spider” or “automated”). Some bots attempt to bypass rate limits or other restrictions with increasing sophistication.
In June 2025, we observed an unusually large increase in our measurement of unique devices for May 2025 ( T395934 ). This spike was predominantly geolocated to Brazil, accounting for approximately 300 million additional unique devices compared to the previous year. We also observed a substantial rise in “user” pageviews, with Brazil again showing the largest increase (roughly 1.2 billion additional pageviews compared to May 2024). This represented a new peak in traffic growth from Brazil, which began in January 2025 with modest increases and spiked at the end of April and into May 2025.
After investigation, it became clear that non-human (bot) traffic was significantly inflating our user pageview and unique device counts. These bots bypassed our existing bot detection heuristics.
Impact on metrics
| Metrics | Month | Pre-correction (data reported before October 2025) |
Post-correction
(data reported after October 2025) |
Difference between pre- and post-corrected data |
| User pageviews | March 2025 (partial month March 20-31 corrected)) |
16.6B
(6.4B) |
16.2B
(6.0B) |
-2.4% Overall month
(-6% (within corrected period)) |
| April 2025 | 15.3B | 14.3B | -6.6% | |
| May 2025 | 17.2B | 14.5B | -15.5% | |
| June 2025 | 15.5B | 13.3B | -14.3% | |
| July 2025 | 15.4B | 14.2B | -8.0% | |
| August 2025 (partial month) | 15.4B | 14.4B | -6.3% | |
| Unique devices | Note: we capture unique devices on a per project and per project family basis, but they cannot be aggregated across project families. | |||
| Metric | Month | Pre-correction (data reported before October 2025) |
Post-correction
(data reported after October 2025) |
Difference between pre- and post-corrected data |
| User pageviews | March 2025 (only March 20-31 corrected) |
15.3B
(5.88B) |
15.1B
(5.62B) |
-1.7% (overall month)
(-4.4% within corrected period) |
| April 2025 | 14.3B | 13.7B | -4.3% | |
| May 2025 | 15.2B | 13.8B | -9.2% | |
| June 2025 | 13.8B | 12.5B | -9.0% | |
| July 2025 | 14.0B | 13.4B | -4.3% | |
|
August 2025
(partial month) |
14.2B | 13.7B | -3.8% | |
| Unique devices | March 2025 | 1.85B | none (not corrected) | |
| April 2025 | 1.82B | none (not corrected) | ||
| May 2025 | 2.15B | 1.78B | -17.0% | |
| June 2025 | 2.08B | 1.69B | -18.9% | |
| July 2025 | 1.93B | 1.76B | -9.1% | |
| August 2025 | 1.81B | 1.73B | -4.3% | |
Note: Post-correction metrics better reflect user behavior. Previous metrics were skewed by non-human activity.
Findings
Signals indicating bot traffic
Missing referral headers, outdated browser versions, and large amounts of traffic that diverged from standard browsing behavior.
Root cause
A new pattern for bot traffic emerged that was not being detected by our existing heuristics.
Affected datasets and services
Full list of datasets and services in T405667
- actor datasets (webrequest & pageview)
- browser_general_daily
- browser_metrics_weekly
- clickstream_monthly
- interlanguage_daily
- referrer_daily
- pageviews datasets (hourly & daily; per project, per article, top articles, top articles per country)
- projectview_hourly
- projectview_geo
- unique_devices datasets (hourly & daily; per domain & per project family)
Resolution & decisions
In addition to correcting the data going forward, we decided to backfill historical data with corrections where possible.
New heuristic: We developed a new heuristic to identify bot traffic based on the signals we identified in our investigation. This heuristic was deployed on August 28, 2025 ( T395934#11129327 ).
Backfill to correct historical data: As of Oct 8, 2025, we corrected publicly-available pageview data going back to March 20, 2025, and unique device measurements going back to May 1, 2025, using the new heuristic.