You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Analytics/Data Lake/Traffic"

From Wikitech-static
Jump to navigation Jump to search
imported>Aklapper
(Replace non-working anchor link)
imported>Mayakpwiki
(→‎Access: #raddocs)
Line 1: Line 1:
This page links to detailed information about traffic datasets in the [[Analytics/Data Lake|Data Lake]].
Traffic refers to pageviews to the pages of a wiki project. This page links to detailed information about traffic datasets in the [[Analytics/Data Lake|Data Lake]].


Most of the datasets below are updated at hourly granularity, meaning that you'll get an hour of new data every hour, with between 2 and 3 hours delay (for the hour to be finished, and the data to be computed).
Most of the datasets below are updated at hourly granularity, meaning that you'll get an hour of new data every hour, with between 2 and 3 hours delay (for the hour to be finished, and the data to be computed).


== Datasets ==
== Datasets ==
* [[/Webrequest|webrequest hive table]] (raw and refined)
 
** See also a separate [[/Webrequest#Derived streams|list of Hive tables derived from webrequest]]  
=== Hive tables ===
* [[Analytics/Data Lake/Traffic/Pageview actor|pageview_actor hive table]]
 
*[[/Pageview_hourly|pageview_hourly hive table]]
These datasets are available as Hive tables and can be [[Analytics/Data Lake#Querying|queried]] using one of the available SQL engines, or accessed directly through HDFS.
* [[/Projectview_hourly|projectview_hourly hive table]]
{| class="wikitable sortable"
|+
!Dataset Name
!Description
|-
|[[Analytics/Data Lake/Traffic/Webrequest|webrequest hive table]]
 
- See also a separate [[Analytics/Data Lake/Traffic/Webrequest#Derived%20streams|list of Hive tables derived from webrequest]]
|The webrequest stream contains data on all the hits to Wikimedia's servers. This includes requests for page HTML, images, CSS, and Javascript, as well as requests to the API.
|-
|[[Analytics/Data Lake/Traffic/Pageview actor|pageview_actor hive table]]
|The wmf.pageview_actor table is a smaller version of [[Analytics/Data/Webrequest|webrequest]] table with fewer columns.
|-
|[[Analytics/Data Lake/Traffic/Pageview hourly|pageview_hourly hive table]]
|The wmf.pageview_hourly table contains 'pre-aggregated' [[Analytics/Data/Webrequest|webrequest]] data, filtered to keep only pageviews, and aggregated over a predefined set of dimensions.
|-
|[[Analytics/Data Lake/Traffic/Projectview hourly|projectview_hourly hive table]]
|The <code>wmf.projectview_hourly</code> table is 'pre-aggregated' [[Analytics/Data/Webrequest|webrequest]] data at the project level. It is different from the <code>[[Analytics/Data/Pageview hourly|wmf.pageview_hourly]]</code> dataset in that it involves less dimensions and is therefore smaller in data size (and faster to query).
|-
|[[Analytics/Data Lake/Traffic/Unique Devices|uniques devices]]
|This dataset gives you how many distinct devices visit our projects
|-
|[[Analytics/Data Lake/Traffic/Browser general|browser general]]
|This dataset gives you pageview statistics broken down by user-agent related dimensions like OS family, OS major, browser family, browser major
|-
|api_request
|The <code>mediawiki_api_request</code> table provides the log of api requests to MediaWiki
|-
|[[Analytics/Data Lake/Traffic/mobile apps session metrics|mobile apps session metrics]]
|Contains aggregate stats about pageview sessions on the Android and iOS Wikipedia mobile apps
|-
|[[Analytics/Data Lake/Traffic/mobile apps uniques|mobile apps uniques]]
|Counts how many different Android and iOS Wikipedia mobile apps installs accessed Wikimedia sites during the given day or month
|-
|[[Analytics/Data Lake/Traffic/Interlanguage|inter language]]
|Traffic between different languages on the same project family
|-
|[[Analytics/Data Lake/Traffic/Virtualpageview hourly|virtualpageview_hourly]]
|Provides data about page previews on desktop Wikipedia
|}
*
 
=== Dumps ===
 
These datasets are made available as files, updated at regular intervals.
 
* [[/Pageviews|Pageviews and Projectviews dumps]] [To be updated]
* [[/Pageviews|Pageviews and Projectviews dumps]] [To be updated]
* [[/Pagecounts-ez|Compressed pageviews dumps]] [To be updated]
* [[/Pagecounts-ez|Compressed pageviews dumps]] [To be updated]
* [[/Mediacounts|mediacounts]]
* [[/Mediacounts|mediacounts]]
* [[/Unique Devices|Uniques Devices]]
 
* [[/Browser_general|Browser General]]
== Deprecated or Obsolete Datasets ==
* [[/ApiAction|ApiAction]]
 
* [[Analytics/Data Lake/Traffic/mobile apps session metrics|mobile apps session metrics]]
The following datasets are no longer in use, but the pages are kept to document history:
* [[Analytics/Data Lake/Traffic/mobile apps uniques|mobile apps uniques]]
 
* [[/Pagecounts-raw|pagecounts-raw (deprecated)]]
* [[/Pagecounts-raw|pagecounts-raw]]
* [[/Pagecounts-all-sites|pagecounts-all-sites (deprecated)]]
* [[/Pagecounts-all-sites|pagecounts-all-sites]]
* [[Analytics/Data Lake/Traffic/Cirrus|Search requests]]
* [[Obsolete:Analytics/Data Lake/Traffic/Cirrus|Search requests]]
* [[Analytics/Data Lake/Traffic/CirrusQueryClicks|Search requests clicks]]
* [[Obsolete:Analytics/Data Lake/Traffic/CirrusQueryClicks|Search requests clicks]]
* [[/Interlanguage|Inter-language]] (Traffic between different languages on the same project family)
* [[Analytics/Data_Lake/Traffic/Virtualpageview_hourly|virtualpageview_hourly]] (page previews on desktop Wikipedia)


== Access ==
== Access ==
Some of the data above is public in other systems (see [[Analytics]] main page), but the Data Lake is private by default. For this, reference [[Analytics/Data access]]
All data in the Data Lake is private by default. For this, reference [[Analytics/Data access]]. Some of the data above is public in other systems (see [[Analytics]] main page)


== History ==
== History ==
Some partial information about the evolution of publishing analytics data at WMF is recorded [https://meta.wikimedia.org/wiki/Research:Timeline_of_Wikimedia_analytics here in a timeline].
Some partial information about the evolution of publishing analytics data at WMF is recorded [https://meta.wikimedia.org/wiki/Research:Timeline_of_Wikimedia_analytics here in a timeline].

Revision as of 22:50, 21 September 2021

Traffic refers to pageviews to the pages of a wiki project. This page links to detailed information about traffic datasets in the Data Lake.

Most of the datasets below are updated at hourly granularity, meaning that you'll get an hour of new data every hour, with between 2 and 3 hours delay (for the hour to be finished, and the data to be computed).

Datasets

Hive tables

These datasets are available as Hive tables and can be queried using one of the available SQL engines, or accessed directly through HDFS.

Dataset Name Description
webrequest hive table

- See also a separate list of Hive tables derived from webrequest

The webrequest stream contains data on all the hits to Wikimedia's servers. This includes requests for page HTML, images, CSS, and Javascript, as well as requests to the API.
pageview_actor hive table The wmf.pageview_actor table is a smaller version of webrequest table with fewer columns.
pageview_hourly hive table The wmf.pageview_hourly table contains 'pre-aggregated' webrequest data, filtered to keep only pageviews, and aggregated over a predefined set of dimensions.
projectview_hourly hive table The wmf.projectview_hourly table is 'pre-aggregated' webrequest data at the project level. It is different from the wmf.pageview_hourly dataset in that it involves less dimensions and is therefore smaller in data size (and faster to query).
uniques devices This dataset gives you how many distinct devices visit our projects
browser general This dataset gives you pageview statistics broken down by user-agent related dimensions like OS family, OS major, browser family, browser major
api_request The mediawiki_api_request table provides the log of api requests to MediaWiki
mobile apps session metrics Contains aggregate stats about pageview sessions on the Android and iOS Wikipedia mobile apps
mobile apps uniques Counts how many different Android and iOS Wikipedia mobile apps installs accessed Wikimedia sites during the given day or month
inter language Traffic between different languages on the same project family
virtualpageview_hourly Provides data about page previews on desktop Wikipedia

Dumps

These datasets are made available as files, updated at regular intervals.

Deprecated or Obsolete Datasets

The following datasets are no longer in use, but the pages are kept to document history:

Access

All data in the Data Lake is private by default. For this, reference Analytics/Data access. Some of the data above is public in other systems (see Analytics main page)

History

Some partial information about the evolution of publishing analytics data at WMF is recorded here in a timeline.