You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

User:Milimetric/Notebook/Pageview Hourly: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>ODimitrijevic
(Minor changes)
imported>Milimetric
Line 19: Line 19:
Once pageview_hourly data is available, it is used to generate all other pageview related datasets.  The data is sanitized and pushed out to the public via [https://dumps.wikimedia.org/other/analytics/ dumps] and [https://wikimedia.org/api/rest_v1/#/Pageviews%20data the pageview API].  It's also processed internally and loaded for querying in various interfaces.
Once pageview_hourly data is available, it is used to generate all other pageview related datasets.  The data is sanitized and pushed out to the public via [https://dumps.wikimedia.org/other/analytics/ dumps] and [https://wikimedia.org/api/rest_v1/#/Pageviews%20data the pageview API].  It's also processed internally and loaded for querying in various interfaces.


The main source of complexity here is how this data is transformed in order to better protect user privacy.
The main source of complexity here is how this data is transformed in order to better protect user privacy.  The rest of this description will focus on this transformation.  Additional context for how we handle privacy at Wikimedia Foundation can be found in documents such as [[meta:Data_retention_guidelines]].


== Dimensions and Metrics ==
== Dimensions and Metrics ==
Line 38: Line 38:
For example, the user_agent_map field is a mapping of properties like "device family" and "browser family" that can be extracted with reasonable certainty from the User Agent string found in a webrequest record.
For example, the user_agent_map field is a mapping of properties like "device family" and "browser family" that can be extracted with reasonable certainty from the User Agent string found in a webrequest record.


An example of annotating data is the "automata" agent_type.  We use heuristics to determine when a specific user agent is acting less like a human and more like an automata.  Well behaved automatic agents will generally be detected as "spider" by well-established regular expressions.  Less well-behaved agents need different [[Analytics/Data_Lake/Traffic/BotDetection | approaches]].
An example of annotating data is the "automata" agent_type.  We use [https://github.com/wikimedia/analytics-refinery/blob/master/oozie/learning/features/actor/hourly/calculate_features_actor_hourly.hql heuristics] to determine when a specific user agent is acting less like a human and more like an automata.  Well behaved automatic agents will generally be detected as "spider" by well-established regular expressions.  Less well-behaved agents need different [[Analytics/Data_Lake/Traffic/BotDetection | approaches]] (also [https://docs.google.com/document/d/1q14GH7LklhMvDh0jwGaFD4eXvtQ5tLDmw3UeFTmb3KM/edit documented here]).


Once we simplify dimensions and reduce entropy, we can aggregate view_counts.  This removes detail like which specific IP accessed which article, and keeps buckets like "X users from Spain accessed a specific article".
Once we simplify dimensions and reduce entropy, we can aggregate view_counts.  This removes detail like which specific IP accessed which article, and keeps buckets like "X users from Spain accessed a specific article".

Revision as of 11:46, 15 August 2022

Overview

Dataset Fact Sheet
Update Frequency Hourly, with a lag of about 2 hours
Trusted Dataset Yes

Description

A page view is a request for the content of a web page. Page views on Wikimedia projects is our most important content consumption metric.

The Wikimedia Foundation defined what a Pageview means for the projects we host. The data is extracted from Webrequest and retained since May 2015.

Once pageview_hourly data is available, it is used to generate all other pageview related datasets. The data is sanitized and pushed out to the public via dumps and the pageview API. It's also processed internally and loaded for querying in various interfaces.

The main source of complexity here is how this data is transformed in order to better protect user privacy. The rest of this description will focus on this transformation. Additional context for how we handle privacy at Wikimedia Foundation can be found in documents such as meta:Data_retention_guidelines.

Dimensions and Metrics

One way to think of this data is as a bunch of buckets. The size of the buckets is measured by the view_count, the only metric here. And the bucket is defined by the value of all the dimensions we track. For example:

"In a specific hour, on ro.wikipedia, article X in namespace 0 was viewed N times by probably human users from Spain, through the desktop website, using an iPad"

That would be one row in this dataset.

Data Transformation Process

As raw webrequest data is transformed into pageview_hourly records, the following types of transformations reduce entropy for the purpose of long-term privacy-protecting storage:

  • Extracting information
  • Annotating data
  • Aggregating

For example, the user_agent_map field is a mapping of properties like "device family" and "browser family" that can be extracted with reasonable certainty from the User Agent string found in a webrequest record.

An example of annotating data is the "automata" agent_type. We use heuristics to determine when a specific user agent is acting less like a human and more like an automata. Well behaved automatic agents will generally be detected as "spider" by well-established regular expressions. Less well-behaved agents need different approaches (also documented here).

Once we simplify dimensions and reduce entropy, we can aggregate view_counts. This removes detail like which specific IP accessed which article, and keeps buckets like "X users from Spain accessed a specific article".

Each field will have detailed explanation about any transformations that apply to it, see the Schema tab.

Examples

See Also