You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Traffic: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Milimetric
m (Milimetric moved page Analytics/Data to Analytics/Data Lake/Traffic)
 
imported>Nettrom
(→‎Hive tables: add link to Analytics/Data Lake/Traffic/mediawiki api request for the API request dataset #raddocs)
 
(8 intermediate revisions by 6 users not shown)
Line 1: Line 1:
This page contains links to documentation about datasets maintained by the Analytics team.
Traffic refers to pageviews to the pages of a wiki project. This page links to detailed information about traffic datasets in the [[Analytics/Data Lake|Data Lake]].


== Analytics Cluster Data (Hive/Hadoop, etc.) ==
Most of the datasets below are updated at hourly granularity, meaning that you'll get an hour of new data every hour, with between 2 and 3 hours delay (for the hour to be finished, and the data to be computed).
* [[/Webrequest|webrequest hive table]] (raw and refined)
 
* [[/Pageview_hourly|pageview_hourly hive table]]
== Datasets ==
* [[/Projectview_hourly|projectview_hourly hive table]]
 
* [[/Pageviews|Pageviews and Projectviews dumps]]
=== Hive tables ===
* [[/Pagecounts-ez|Compressed pageviews dumps]]
 
These datasets are available as Hive tables and can be [[Analytics/Data Lake#Querying|queried]] using one of the available SQL engines, or accessed directly through HDFS.  
{| class="wikitable sortable"
|+
!Dataset Name
!Description
|-
|[[Analytics/Data Lake/Traffic/Webrequest|webrequest hive table]]
 
- See also a separate [[Analytics/Data Lake/Traffic/Webrequest#Derived%20streams|list of Hive tables derived from webrequest]]
|The webrequest stream contains data on all the hits to Wikimedia's servers. This includes requests for page HTML, images, CSS, and Javascript, as well as requests to the API.
|-
|[[Analytics/Data Lake/Traffic/Pageview actor|pageview_actor hive table]]
|The wmf.pageview_actor table is a smaller version of [[Analytics/Data/Webrequest|webrequest]] table with fewer columns.
|-
|[[Analytics/Data Lake/Traffic/Pageview hourly|pageview_hourly hive table]]
|The wmf.pageview_hourly table contains 'pre-aggregated' [[Analytics/Data/Webrequest|webrequest]] data, filtered to keep only pageviews, and aggregated over a predefined set of dimensions.
|-
|[[Analytics/Data Lake/Traffic/Projectview hourly|projectview_hourly hive table]]
|The <code>wmf.projectview_hourly</code> table is 'pre-aggregated' [[Analytics/Data/Webrequest|webrequest]] data at the project level. It is different from the <code>[[Analytics/Data/Pageview hourly|wmf.pageview_hourly]]</code> dataset in that it involves less dimensions and is therefore smaller in data size (and faster to query).
|-
|[[Analytics/Data Lake/Traffic/Unique Devices|uniques devices]]
|This dataset gives you how many distinct devices visit our projects
|-
|[[Analytics/Data Lake/Traffic/Browser general|browser general]]
|This dataset gives you pageview statistics broken down by user-agent related dimensions like OS family, OS major, browser family, browser major
|-
|[[Analytics/Data Lake/Traffic/mediawiki api request|mediawiki_api_request]]
|The <code>mediawiki_api_request</code> table provides the log of api requests to MediaWiki
|-
|[[Analytics/Data Lake/Traffic/mobile apps session metrics|mobile apps session metrics]]
|Contains aggregate stats about pageview sessions on the Android and iOS Wikipedia mobile apps
|-
|[[Analytics/Data Lake/Traffic/mobile apps uniques|mobile apps uniques]]
|Counts how many different Android and iOS Wikipedia mobile apps installs accessed Wikimedia sites during the given day or month
|-
|[[Analytics/Data Lake/Traffic/Interlanguage|inter language]]
|Traffic between different languages on the same project family
|-
|[[Analytics/Data Lake/Traffic/Virtualpageview hourly|virtualpageview_hourly]]
|Provides data about page previews on desktop Wikipedia
|}
*
 
=== Dumps ===
 
These datasets are made available as files, updated at regular intervals.
 
* [[/Pageviews|Pageviews and Projectviews dumps]] [To be updated]
* [[/Pagecounts-ez|Compressed pageviews dumps]] [To be updated]
* [[/Mediacounts|mediacounts]]
* [[/Mediacounts|mediacounts]]
* [[/Unique Devices|Uniques Devices]]
 
* [[/Browser_general|Browser General]]
== Deprecated or Obsolete Datasets ==
* [[/ApiAction|ApiAction]]
 
* [[/Pagecounts-raw|pagecounts-raw (deprecated)]]
The following datasets are no longer in use, but the pages are kept to document history:
* [[/Pagecounts-all-sites|pagecounts-all-sites (deprecated)]]
 
* [[/Pagecounts-raw|pagecounts-raw]]
* [[/Pagecounts-all-sites|pagecounts-all-sites]]
* [[Obsolete:Analytics/Data Lake/Traffic/Cirrus|Search requests]]
* [[Obsolete:Analytics/Data Lake/Traffic/CirrusQueryClicks|Search requests clicks]]


== Access ==
== Access ==
Some of the data above is public, but some needs special access. For this, reference [[Analytics/Data access]]
All data in the Data Lake is private by default. For this, reference [[Analytics/Data access]]. Some of the data above is public in other systems (see [[Analytics]] main page)


== History ==
== History ==
The evolution of publishing analytics data at WMF is recorded [https://meta.wikimedia.org/wiki/Research:Timeline_of_Wikimedia_analytics here in a timeline].
Some partial information about the evolution of publishing analytics data at WMF is recorded [https://meta.wikimedia.org/wiki/Research:Timeline_of_Wikimedia_analytics here in a timeline].

Latest revision as of 22:50, 22 September 2021

Traffic refers to pageviews to the pages of a wiki project. This page links to detailed information about traffic datasets in the Data Lake.

Most of the datasets below are updated at hourly granularity, meaning that you'll get an hour of new data every hour, with between 2 and 3 hours delay (for the hour to be finished, and the data to be computed).

Datasets

Hive tables

These datasets are available as Hive tables and can be queried using one of the available SQL engines, or accessed directly through HDFS.

Dataset Name Description
webrequest hive table

- See also a separate list of Hive tables derived from webrequest

The webrequest stream contains data on all the hits to Wikimedia's servers. This includes requests for page HTML, images, CSS, and Javascript, as well as requests to the API.
pageview_actor hive table The wmf.pageview_actor table is a smaller version of webrequest table with fewer columns.
pageview_hourly hive table The wmf.pageview_hourly table contains 'pre-aggregated' webrequest data, filtered to keep only pageviews, and aggregated over a predefined set of dimensions.
projectview_hourly hive table The wmf.projectview_hourly table is 'pre-aggregated' webrequest data at the project level. It is different from the wmf.pageview_hourly dataset in that it involves less dimensions and is therefore smaller in data size (and faster to query).
uniques devices This dataset gives you how many distinct devices visit our projects
browser general This dataset gives you pageview statistics broken down by user-agent related dimensions like OS family, OS major, browser family, browser major
mediawiki_api_request The mediawiki_api_request table provides the log of api requests to MediaWiki
mobile apps session metrics Contains aggregate stats about pageview sessions on the Android and iOS Wikipedia mobile apps
mobile apps uniques Counts how many different Android and iOS Wikipedia mobile apps installs accessed Wikimedia sites during the given day or month
inter language Traffic between different languages on the same project family
virtualpageview_hourly Provides data about page previews on desktop Wikipedia

Dumps

These datasets are made available as files, updated at regular intervals.

Deprecated or Obsolete Datasets

The following datasets are no longer in use, but the pages are kept to document history:

Access

All data in the Data Lake is private by default. For this, reference Analytics/Data access. Some of the data above is public in other systems (see Analytics main page)

History

Some partial information about the evolution of publishing analytics data at WMF is recorded here in a timeline.