Jump to content

This is a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Data Platform/Data Lake/Data Issues/2025-06-03 May 2025 spike in bot traffic

From Wikitech

May 2025 spike in bot traffic

Status Resolved
Severity High
Business data steward Omari Sefu
Technical data steward Andreas Hoelzl
Incident coordinator Omari Sefu, Andreas Hoelzl
Incident response team Marcel Ruiz Forns, Hamid Ghani (lead analyst), Maya Kampurath, Joseph Allemandou
Date detected June 3rd, 2025
Date resolved August 28th, 2025 (deployment of new bot detection heuristics); October 8th, 2025 (backfill complete)
Start of issue Mar 20, 2025 (effective date for the correction to pageview and daily unique device data using the new bot detection heuristic)

May 1, 2025 (effective date for the correction to monthly unique device data) May & June 2025 (period where we observed the most notable increase in user pageviews and unique devices before the new bot detection heuristic was implemented)

Phabricator ticket T395934 , T405667 (backfill)

Summary

As Wikipedia traffic comes in, we classify it as coming from either humans (“users”) or bots (“spider” or “automated”). Some bots attempt to bypass rate limits or other restrictions with increasing sophistication.

In June 2025, we observed an unusually large increase in our measurement of unique devices for May 2025 ( T395934 ). This spike was predominantly geolocated to Brazil, accounting for approximately 300 million additional unique devices compared to the previous year. We also observed a substantial rise in “user” pageviews, with Brazil again showing the largest increase (roughly 1.2 billion additional pageviews compared to May 2024). This represented a new peak in traffic growth from Brazil, which began in January 2025 with modest increases and spiked at the end of April and into May 2025.

After investigation, it became clear that non-human (bot) traffic was significantly inflating our user pageview and unique device counts. These bots bypassed our existing bot detection heuristics.

Impact on metrics

All Wikimedia projects
Metrics Month Pre-correction (data reported before October 2025) Post-correction

(data reported after October 2025)

Difference between pre- and post-corrected data
User pageviews March 2025 (partial month March 20-31 corrected)) 16.6B

(6.4B)

16.2B

(6.0B)

-2.4% Overall month

(-6% (within corrected period))

April 2025 15.3B 14.3B -6.6%
May 2025 17.2B 14.5B -15.5%
June 2025 15.5B 13.3B -14.3%
July 2025 15.4B 14.2B -8.0%
August 2025 (partial month) 15.4B 14.4B -6.3%
Unique devices Note: we capture unique devices on a per project and per project family basis, but they cannot be aggregated across project families.
All Wikipedias
Metric Month Pre-correction (data reported before October 2025) Post-correction

(data reported after October 2025)

Difference between pre- and post-corrected data
User pageviews March 2025 (only March 20-31 corrected) 15.3B

(5.88B)

15.1B

(5.62B)

-1.7% (overall month)

(-4.4% within corrected period)

April 2025 14.3B 13.7B -4.3%
May 2025 15.2B 13.8B -9.2%
June 2025 13.8B 12.5B -9.0%
July 2025 14.0B 13.4B -4.3%
August 2025

(partial month)

14.2B 13.7B -3.8%
Unique devices March 2025 1.85B none (not corrected)
April 2025 1.82B none (not corrected)
May 2025 2.15B 1.78B -17.0%
June 2025 2.08B 1.69B -18.9%
July 2025 1.93B 1.76B -9.1%
August 2025 1.81B 1.73B -4.3%

Note: Post-correction metrics better reflect user behavior. Previous metrics were skewed by non-human activity.

Findings

Signals indicating bot traffic

Missing referral headers, outdated browser versions, and large amounts of traffic that diverged from standard browsing behavior.

Root cause

A new pattern for bot traffic emerged that was not being detected by our existing heuristics.

Affected datasets and services

Full list of datasets and services in T405667

  • actor datasets (webrequest & pageview)
  • browser_general_daily
  • browser_metrics_weekly
  • clickstream_monthly
  • interlanguage_daily
  • referrer_daily
  • pageviews datasets (hourly & daily; per project, per article, top articles, top articles per country)
  • projectview_hourly
  • projectview_geo
  • unique_devices datasets (hourly & daily; per domain & per project family)

Resolution & decisions

In addition to correcting the data going forward, we decided to backfill historical data with corrections where possible.

New heuristic: We developed a new heuristic to identify bot traffic based on the signals we identified in our investigation. This heuristic was deployed on August 28, 2025 ( T395934#11129327 ).

Backfill to correct historical data: As of Oct 8, 2025, we corrected publicly-available pageview data going back to March 20, 2025, and unique device measurements going back to May 1, 2025, using the new heuristic.