You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Analytics/Data Lake/Traffic: Difference between revisions
imported>Milimetric m (Milimetric moved page Analytics/Data to Analytics/Data Lake/Traffic) |
imported>Nettrom (→Hive tables: add link to Analytics/Data Lake/Traffic/mediawiki api request for the API request dataset #raddocs) |
||
(8 intermediate revisions by 6 users not shown) | |||
Line 1: | Line 1: | ||
This page | Traffic refers to pageviews to the pages of a wiki project. This page links to detailed information about traffic datasets in the [[Analytics/Data Lake|Data Lake]]. | ||
== Analytics | Most of the datasets below are updated at hourly granularity, meaning that you'll get an hour of new data every hour, with between 2 and 3 hours delay (for the hour to be finished, and the data to be computed). | ||
== Datasets == | |||
* [[/Pageviews|Pageviews and Projectviews dumps]] | === Hive tables === | ||
* [[/Pagecounts-ez|Compressed pageviews dumps]] | |||
These datasets are available as Hive tables and can be [[Analytics/Data Lake#Querying|queried]] using one of the available SQL engines, or accessed directly through HDFS. | |||
{| class="wikitable sortable" | |||
|+ | |||
!Dataset Name | |||
!Description | |||
|- | |||
|[[Analytics/Data Lake/Traffic/Webrequest|webrequest hive table]] | |||
- See also a separate [[Analytics/Data Lake/Traffic/Webrequest#Derived%20streams|list of Hive tables derived from webrequest]] | |||
|The webrequest stream contains data on all the hits to Wikimedia's servers. This includes requests for page HTML, images, CSS, and Javascript, as well as requests to the API. | |||
|- | |||
|[[Analytics/Data Lake/Traffic/Pageview actor|pageview_actor hive table]] | |||
|The wmf.pageview_actor table is a smaller version of [[Analytics/Data/Webrequest|webrequest]] table with fewer columns. | |||
|- | |||
|[[Analytics/Data Lake/Traffic/Pageview hourly|pageview_hourly hive table]] | |||
|The wmf.pageview_hourly table contains 'pre-aggregated' [[Analytics/Data/Webrequest|webrequest]] data, filtered to keep only pageviews, and aggregated over a predefined set of dimensions. | |||
|- | |||
|[[Analytics/Data Lake/Traffic/Projectview hourly|projectview_hourly hive table]] | |||
|The <code>wmf.projectview_hourly</code> table is 'pre-aggregated' [[Analytics/Data/Webrequest|webrequest]] data at the project level. It is different from the <code>[[Analytics/Data/Pageview hourly|wmf.pageview_hourly]]</code> dataset in that it involves less dimensions and is therefore smaller in data size (and faster to query). | |||
|- | |||
|[[Analytics/Data Lake/Traffic/Unique Devices|uniques devices]] | |||
|This dataset gives you how many distinct devices visit our projects | |||
|- | |||
|[[Analytics/Data Lake/Traffic/Browser general|browser general]] | |||
|This dataset gives you pageview statistics broken down by user-agent related dimensions like OS family, OS major, browser family, browser major | |||
|- | |||
|[[Analytics/Data Lake/Traffic/mediawiki api request|mediawiki_api_request]] | |||
|The <code>mediawiki_api_request</code> table provides the log of api requests to MediaWiki | |||
|- | |||
|[[Analytics/Data Lake/Traffic/mobile apps session metrics|mobile apps session metrics]] | |||
|Contains aggregate stats about pageview sessions on the Android and iOS Wikipedia mobile apps | |||
|- | |||
|[[Analytics/Data Lake/Traffic/mobile apps uniques|mobile apps uniques]] | |||
|Counts how many different Android and iOS Wikipedia mobile apps installs accessed Wikimedia sites during the given day or month | |||
|- | |||
|[[Analytics/Data Lake/Traffic/Interlanguage|inter language]] | |||
|Traffic between different languages on the same project family | |||
|- | |||
|[[Analytics/Data Lake/Traffic/Virtualpageview hourly|virtualpageview_hourly]] | |||
|Provides data about page previews on desktop Wikipedia | |||
|} | |||
* | |||
=== Dumps === | |||
These datasets are made available as files, updated at regular intervals. | |||
* [[/Pageviews|Pageviews and Projectviews dumps]] [To be updated] | |||
* [[/Pagecounts-ez|Compressed pageviews dumps]] [To be updated] | |||
* [[/Mediacounts|mediacounts]] | * [[/Mediacounts|mediacounts]] | ||
== Deprecated or Obsolete Datasets == | |||
* [[/Pagecounts-raw|pagecounts-raw | The following datasets are no longer in use, but the pages are kept to document history: | ||
* [[/Pagecounts-all-sites|pagecounts-all-sites | |||
* [[/Pagecounts-raw|pagecounts-raw]] | |||
* [[/Pagecounts-all-sites|pagecounts-all-sites]] | |||
* [[Obsolete:Analytics/Data Lake/Traffic/Cirrus|Search requests]] | |||
* [[Obsolete:Analytics/Data Lake/Traffic/CirrusQueryClicks|Search requests clicks]] | |||
== Access == | == Access == | ||
All data in the Data Lake is private by default. For this, reference [[Analytics/Data access]]. Some of the data above is public in other systems (see [[Analytics]] main page) | |||
== History == | == History == | ||
Some partial information about the evolution of publishing analytics data at WMF is recorded [https://meta.wikimedia.org/wiki/Research:Timeline_of_Wikimedia_analytics here in a timeline]. |
Latest revision as of 22:50, 22 September 2021
Traffic refers to pageviews to the pages of a wiki project. This page links to detailed information about traffic datasets in the Data Lake.
Most of the datasets below are updated at hourly granularity, meaning that you'll get an hour of new data every hour, with between 2 and 3 hours delay (for the hour to be finished, and the data to be computed).
Datasets
Hive tables
These datasets are available as Hive tables and can be queried using one of the available SQL engines, or accessed directly through HDFS.
Dataset Name | Description |
---|---|
webrequest hive table
- See also a separate list of Hive tables derived from webrequest |
The webrequest stream contains data on all the hits to Wikimedia's servers. This includes requests for page HTML, images, CSS, and Javascript, as well as requests to the API. |
pageview_actor hive table | The wmf.pageview_actor table is a smaller version of webrequest table with fewer columns. |
pageview_hourly hive table | The wmf.pageview_hourly table contains 'pre-aggregated' webrequest data, filtered to keep only pageviews, and aggregated over a predefined set of dimensions. |
projectview_hourly hive table | The wmf.projectview_hourly table is 'pre-aggregated' webrequest data at the project level. It is different from the wmf.pageview_hourly dataset in that it involves less dimensions and is therefore smaller in data size (and faster to query).
|
uniques devices | This dataset gives you how many distinct devices visit our projects |
browser general | This dataset gives you pageview statistics broken down by user-agent related dimensions like OS family, OS major, browser family, browser major |
mediawiki_api_request | The mediawiki_api_request table provides the log of api requests to MediaWiki
|
mobile apps session metrics | Contains aggregate stats about pageview sessions on the Android and iOS Wikipedia mobile apps |
mobile apps uniques | Counts how many different Android and iOS Wikipedia mobile apps installs accessed Wikimedia sites during the given day or month |
inter language | Traffic between different languages on the same project family |
virtualpageview_hourly | Provides data about page previews on desktop Wikipedia |
Dumps
These datasets are made available as files, updated at regular intervals.
- Pageviews and Projectviews dumps [To be updated]
- Compressed pageviews dumps [To be updated]
- mediacounts
Deprecated or Obsolete Datasets
The following datasets are no longer in use, but the pages are kept to document history:
Access
All data in the Data Lake is private by default. For this, reference Analytics/Data access. Some of the data above is public in other systems (see Analytics main page)
History
Some partial information about the evolution of publishing analytics data at WMF is recorded here in a timeline.