You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/AQS/Pageviews
This page documents the Pageview API (v1), a public API developed and maintained by the Wikimedia Foundation that serves analytical data about article pageviews of Wikipedia and its sister projects. With it, you can get pageview trends on specific articles or projects; filter by agent type or access method, and choose different time ranges and granularities; you can also get the most viewed articles of a certain project and timespan, and even check out the countries that visit a project the most. Have fun!
For information about the underlying data itself (in particular the exact definition of a pageview), see meta:Research:Page view and the documentation of the underlying database table pageview_hourly.
Quick start
Wikimedia REST API (includes interactive examples).
Pageview counts by article
Daily counts
Get a pageview count timeseries of en.wikipedia
's article Albert Einstein
for the month of October 2015:
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Albert_Einstein/daily/2015100100/2015103100
Get a pageview count timeseries of de.wikipedia
's article Johann Wolfgang von Goethe
from October 13th 2015 to October 27th 2015 counting only the pageviews generated by human users:
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/de.wikipedia/all-access/user/Johann_Wolfgang_von_Goethe/daily/2015101300/2015102700
Get the number of pageviews of es.wiktionary
's entry hoy
generated via mobile web on November 1st, 2015:
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/es.wiktionary/mobile-web/all-agents/hoy/daily/2015110100/2015110100
Monthly counts
Get a monthly pageview count de.wikipedia
's article Barack_Obama
for the year 2016:
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/de.wikipedia/all-access/all-agents/Barack_Obama/monthly/2016010100/2016123100
Slice and dice pageview counts
Get a daily pageview count timeseries of all projects for the month of October 2015:
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/all-projects/all-access/all-agents/daily/2015100100/2015103100
Get an hourly timeseries of all project's pageviews belonging to human users visiting the mobile app on October 1st, 2015:
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/all-projects/mobile-app/user/hourly/2015100100/2015100123
Get the number of pageviews of ca.wikipedia
generated by spiders on mobile web on November 1st, 2015:
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/ca.wikipedia/mobile-web/spider/daily/2015110100/2015110100
Get a count of the number of pageviews to all projects tagged as 'automated'. For more details on what automated means please see: Analytics/Data_Lake/Traffic/BotDetection
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/all-projects/all-access/automated/monthly/2018033100/2020050400
Most viewed articles
All data for this metric is from agents that we believe to be human users. We detect and exclude automated agents that are either self-identified or fit heuristic rules we tailored to this dataset.
Get the top 1000 most visited articles from en.wikipedia
for October 10th, 2015:
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2015/10/10
Get the top 1000 articles from pt.wikipedia
visited via the mobile app on November 1st, 2015:
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/top/pt.wikipedia/mobile-app/2015/11/01
Get the top 1000 most visited articles from en.wikisource
for all days in October, 2015:
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikisource/all-access/2015/10/all-days
Pageviews split by country
All data for this metric is from agents that we believe to be human users. We detect and exclude automated agents that are either self-identified or fit heuristic rules we tailored to this dataset.
Get the top countries that visited any wikimedia project in November 2017 from any access (desktop, mobile web and app):
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/top-by-country/all-projects/all-access/2017/11
Get the top countries that visited Portuguese Wikipedia via the mobile app on August 2016:
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/top-by-country/pt.wikipedia/mobile-app/2016/08
Most viewed articles per country
Get the top 1000 most viewed articles from Japan on January 2nd, 2021:
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/top-per-country/JP/all-access/2021/01/02
Get the top 1000 most viewed articles visited via the mobile app from the United States on February 28th, 2021:
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/top-per-country/US/mobile-app/2021/02/28
Pageviews for ALL projects
Daily
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/all-projects/all-access/all-agents/daily/2015100100/2015103000
Monthly
GET
https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/all-projects/all-access/all-agents/monthly/2015100100/2016103000
Pagecounts (legacy data)
- Analytics/AQS/Legacy Pagecounts: Legacy pagecounts (covers 2008–2016)
The API
What is it?
The Pageview API is a collection of REST endpoints that serve analytical data about pageviews in Wikimedia's projects. It's developed and maintained by WMF's Analytics and Services teams, and is implemented using Analytics' Hadoop cluster and RESTBase. This API is meant to be used by anyone interested in pageview statistics on Wikimedia wikis: Foundation, communities, and the rest of the world.
How to access
The API is accessible via https
at wikimedia.org/api/rest_v1
. As it is public, it doesn't need authentication and it supports CORS. The urls are structured like this:
/metrics/pageviews/{endpoint}/{parameter 1}/{parameter 2}/.../{parameter N}
Country data and privacy
Pageviews by country privacy study: [1]
Reference
Please, see AQS's RESTBase docs for a complete and interactive technical reference on Pageview API endpoints.
Updates and backfilling
The data is loaded at the end of the timespan in question. So data for 2015-12-01
will be loaded on 2015-12-02 00:00:00 UTC
; Data for 2015-11-10 18:00:00 UTC
will be loaded on 2015-11-10 19:00:00 UTC
; and so on. The loading can take a long time. It's usually a few hours, but sometimes 24 hours, sometimes more if there are problems. See the #Gotchas for more details.
The API serves data starting at 2015-08-01.
As for the date range available, we only have the quality source data we need going back to May 1st, 2015. We will finish back-filling to that date but we can't go further back since we delete the more sensitive raw logs that we generate this data from (for privacy reasons).
Gotchas
- Very high number of views of the "-" page
- Dash value is being used as a special value for "no page title found" when extracting titles from urls thus page titled "-" might seem like it receives an unusually high number of pageviews
- 404 means zero or not loaded yet
- At some point you may get a
404 not found
response from the API. Sometimes, this means that there are0
pageviews for the given project, timespan and filters you specified in the query. Another case this may happen is when your client requests the data for today and the correspondent data has not yet been loaded into the API's database yet (see #Updates_and_backfilling). The problem is that the API, because of implementation reasons, can not distinguish between actual zeros, or data that hasn't been loaded yet in the database. For now, it's up to the user to control that. - 404s within timeseries
- Because of the same caveat (404 means zero or not loaded yet), if you request a timeseries from the API, you might get no data for the dates that have
0
pageviews. This may create holes in the timeseries and break charting libraries. For now, it's up to the user to control that and fill in the missing zeros.
- 429 throttling
- Client has made too many requests and it is being throttled. This will happen if the storage cannot keep up with the request ratio from a given IP. Throttling is enforced at the storage layer, meaning that if you request data we have in cache (cause other client has requested it earlier) there is no throttling. Throttling will be enabled late May 2016.
- Pageviews for yesterday not available
- Data loads into the pageview API from a large stream of data. This process can take a while. It usually is done within a few hours, but can last 24 hours or more if there are problems. We sometimes also have to re-load data if we find bugs or problems. We will try to announce any and all such significant problems on the analytics-l mailing list.
- Concrete figures in per country data aren't reported
- When a wiki is small enough, per country data from the per country pageviews endpoint could be used to identify the country of a user or group of users if we report the exact numbers untouched. In order to prevent a situation where personally identifiable information is revealed through the API, the endpoint:
- Doesn't report exact figures of pageviews per country. Pageview values are given in a bucketed format: instead of reporting 7,876,451 views, the API will return "views": "1000000-9999999". However, even though two or more countries might appear to have the same views in terms of buckets, their rank field does reveal which one has more views.
- Doesn't report countries with a number of pageviews lower than 100. If there are no countries in a project that go over that threshold, the project is not reported by the API.
- Furthermore, the Pageviews API never reports pageview values of zero. This rule prevails in the by country endpoint.
Sample app
Here is a simple web application sample that shows how to access the Analytics Query Service via JavaScript.
Clients and Tools
The following API client libraries are available:
- mwviews Python package
- waxer R package, which includes pageviews in addition to other metrics
- pageviews R package
- pageviews.js JavaScript library
- pageviewapi Python package
Example applications:
Changelog
- 2015-11-01
- Initial release. Featuring 3 endpoints for pageview metrics:
per-article
,aggregate
andtop
. Some endpoints do not support all granularities yet.
- 2016-03-09
- Remove "-" from the top pageviews. The "-" page is both a redirect to the Hyphen Minus page, and the way the Analytics team flags pages with unknown titles (such as search pages, diff pages, and other action specific pages). Community globally asked for this title to be removed from the top list.
- 2016-04-08
- Strip out 'www' if it's passed into the project parameter. This is confusing when people try to look up www.mediawiki. Fix decoding bug where articles with % in their titles were causing a 500 error. Fix the date range to include start and end in the results.
- 2020-04-29
- Add the
automated
agent type for traffic identified as bots not self-identified. The new value is available as a filter in the API, and also affects the top metrics as bots-created artifacts should be a lot less present.
Issues with data
Issues with data are documented in the hive data store from which data is extracted: Analytics/Data/Pageview_hourly#Changes_and_known_problems_since_2015-06-16
See also
- "Making our pageview data easily accessible" (announcement blog post, December 2015)