You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Traffic/Webrequest: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Nuria
No edit summary
imported>Joal
(Add line about dataloss in "changes and known problems" section.)
 
(31 intermediate revisions by 17 users not shown)
Line 1: Line 1:
[[File:Pageview_@_Wikimedia_(WMF_Analytics_lightning_talk,_June_2015).pdf|thumb|350px|page=6|<code>wmf.webrequest</code> and <code>wmf_raw.webrequest</code> within the Wikimedia Foundation's pageview data pipeline (Hive/Hadoop is the rounded box at the bottom)]]
[[File:Pageview_@_Wikimedia_(WMF_Analytics_lightning_talk,_June_2015).pdf|thumb|350px|page=6|<code>wmf.webrequest</code> and <code>wmf_raw.webrequest</code> within the Wikimedia Foundation's pageview data pipeline (Hive/Hadoop is the rounded box at the bottom)]]
The '''webrequest''' datasets contain logs of all the [[:en:Hit_(Internet)|hits]] to the WMF's servers (specifically, the [[Varnish]] servers). This includes requests for page HTML, images, CSS, and Javascript, as well as requests to the API.<ref>It doesn't include some internal (within-cluster) API requests which are not routed through the Varnish servers.</ref> For privacy reasons, this data is purged after 90 days.
The '''webrequest''' [[:en:Data_stream|stream]] contains data on all the [[:en:Hit_(Internet)|hits]] to Wikimedia's servers. This includes requests for page HTML, images, CSS, and Javascript, as well as requests to the API.<ref>It doesn't include some internal (within-cluster) API requests which are not routed through the Varnish servers.</ref> For privacy reasons, this data is purged after 90 days.


There are two webrequest tables available on [[Analytics/Cluster/Hive|Hive]]. The <code>wmf_raw.webrequest</code> table is JSON data directly imported from Kafka. <code>wmf.webrequest</code> contains the 'refined' webrequest data.  The raw table is purged more frequently than 90 days.
There are two webrequest tables available on [[Analytics/Cluster/Hive|Hive]]. The <code>wmf_raw.webrequest</code> table is JSON data directly imported from Kafka. <code>wmf.webrequest</code> contains the 'refined' webrequest data.  The raw table is purged more frequently than 90 days.


For many purposes, it is preferable to instead query the much smaller tables [[Analytics/Data/Pageview hourly|pageview_hourly]] or [[Analytics/Data/Projectview hourly|projectview_hourly]], which contain 'pre-aggregated' webrequest data, filtered to keep only pageviews, and aggregated over a predefined set of dimension.
For many purposes, it is preferable to query the much smaller [[Analytics/Data/Pageview hourly|pageview_hourly]] and [[Analytics/Data/Projectview hourly|projectview_hourly]] tables. These are derived from webrequest by filtering to keep only [[meta:Research:Page view|pageviews]] and aggregating over a predefined set of dimensions.
 
== Pipeline ==
The stream originates from the [[Varnish|Varnish caching servers]], which emit log data in their [[cache log format]]. A [[Analytics/Systems/Varnishkafka|varnishkafka]] instance ([https://github.com/wikimedia/puppet/blob/2053c98638efc97ef35dd8a7d17ad5751088a432/modules/profile/manifests/cache/kafka/webrequest.pp#L131 Puppet code]) pipes the Varnish log data into [[Kafka]], in the <code>webrequest_text</code> and <code>webrequest_upload</code> topics.
 
[[Analytics/Systems/Cluster/Gobblin|Goblin]] then loads the data from Kafka into [[Analytics/Systems/Cluster|HDFS]], where it is stored as [[w:Snappy (compression)|Snappy]]-compressed [http://wiki.apache.org/hadoop/SequenceFile SequenceFiles] in JSON format. A [[Analytics/Systems/Cluster/Hive|Hive]] table with hourly partitions is mapped on top of these data files. This is <code>wmf_raw.webrequest</code>.
 
An [[Analytics/Systems/Cluster/Oozie|Oozie]] job ([https://github.com/wikimedia/analytics-refinery/tree/master/oozie/webrequest/load code]) then checks for missing or duplicated data and creates a "refined" copy which has additional fields and is stored as a Parquet file. Parquet is a [[w:Column-oriented DBMS|column-based data storage]] format, which is much more efficient to query than the "raw" JSON data (see [[Analytics/Cluster/Data_Format_Experiments#Results|Data Format Experiments#Results]] for more info). This refined copy is <code>wmf.webrequest</code>.  


== wmf.webrequest ==
== wmf.webrequest ==
The <code>wmf.webrequest</code> table in Hive is a "refined" version of the webrequest data.  It contains the same rows (requests) as the <code>wmf_raw.webrequest</code> table, but with additional fields added during the refinement phase.  It is stored in the Parquet format.  Parquet is a [[w:Column-oriented DBMS|column-based data storage]] format. When querying <code>wmf.webrequest</code>, Hive will not have to parse JSON, and it will only have to read in data from disk for columns that you are explicitly using in your query. This should increase performance of most Hive queries. See [[Analytics/Cluster/Data_Format_Experiments#Results|Data Format Experiments#Results]] for more info.
If you want to query the webrequest stream, you almost always want to use the "refined" version available in <code>wmf.webrequest</code>, which been enriched with extra fields and stored in a format that is much faster to query(NOTE: descriptions below are a little brief on purpose, for precise definitions you can [https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/fc5699f16d1f1be757e9c60468351b6b044a70d1/oozie/webrequest/load/refine_webrequest.hql#70 look at the refine step], which is a simple Hive query invoking UDFs, and follow that through by [https://codesearch.wmcloud.org/analytics/ searching refinery-source])


=== Current Schema ===
=== Current Schema ===
Line 20: Line 27:
| dt || string || Timestamp at cache in [[W:en:ISO 8601|ISO 8601]] format
| dt || string || Timestamp at cache in [[W:en:ISO 8601|ISO 8601]] format
|-
|-
| time_firstbyte || double || Time to first byte
| time_firstbyte || double || Time to first byte in seconds (a sample value is .000243). The time it took from when the req was first received by varnish until it started sending response bytes back to the client.
|-
|-
| ip || string || IP of packet at cache
| ip || string || IP of packet at cache
Line 40: Line 47:
| content_type || string || Content-Type header of response
| content_type || string || Content-Type header of response
|-
|-
| referer || string || Referer header of request, "unknown" value indicates that url in referrer was not parseable or defective, "none" indicates no referrer was received.  
| referer || string || [[:en:HTTP_referer|Referer header]] of the request. "Unknown" value indicates that url in referrer was not parseable or defective, "-" indicates no referrer was received.  
|-
|-
| x_forwarded_for || string || X-Forwarded-For header of request
| x_forwarded_for || string || X-Forwarded-For header of request
Line 48: Line 55:
| accept_language || string || Accept-Language header of request
| accept_language || string || Accept-Language header of request
|-
|-
| x_analytics || string || [[X-Analytics]] header of response, looks like: ns=-1;special=Userlogin;WMF-Last-Access=09-May-2017;WMF-Last-Access-Global=09-May-2017;https=1
| x_analytics || string || [[X-Analytics]] header of response, looks like: <code>ns=-1;special=Userlogin;WMF-Last-Access=09-May-2017;WMF-Last-Access-Global=09-May-2017;https=1</code>
|-
|-
| range || string || Range header of response
| range || string || Range header of response
Line 64: Line 71:
| user_agent_map || map<string,string> || User-agent map with browser_family, browser_major, device_family, os_family, os_major, os_minor and wmf_app_version keys and associated values
| user_agent_map || map<string,string> || User-agent map with browser_family, browser_major, device_family, os_family, os_major, os_minor and wmf_app_version keys and associated values
|-
|-
| x_analytics_map || map<string,string> || [[X-Analytics|X_analytics]] map view of the x_analytics field
| x_analytics_map || map<string,string> || The [[X-Analytics]] header broken up into key-value pairs
|-
|-
| ts || timestamp || [[W:en:Unix time|Unix timestamp]] in milliseconds extracted from dt
| ts || timestamp || [[W:en:Unix time|Unix timestamp]] in milliseconds extracted from dt
|-
|-
| access_method || string || Method used to accessing the site (mobile app<nowiki>|</nowiki>mobile web<nowiki>|</nowiki>desktop)
| access_method || string || Client type used to access the site (<code>mobile app</code>, <code>mobile web</code>, or <code>desktop</code>). Mobile app requests are identified by the user agent including <code>WikipediaApp</code> or <code>Wikipedia/5.0</code>. Mobile web requests are identified by the hostname containing a subdomain of <code>m</code>, <code>zero</code>, <code>wap</code>, or <code>mobile</code>. Any other request is classified as <code>desktop</code>. This is implemented in the [https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#L277 <code>getAccessMethod</code> function in refinery-source].
|-
|-
| agent_type || string || Categorise the agent making the webrequest as either user or spider (automatas to be added).
| agent_type || string || Categorise the agent making the webrequest as either user or spider (automatas to be added).
Line 74: Line 81:
| is_zero || boolean || Indicates if the webrequest is accessed through a zero provider
| is_zero || boolean || Indicates if the webrequest is accessed through a zero provider
|-
|-
| referer_class || string || Indicates if a referer is internal, external or unknown.
| referer_class || string || The type of site found in the [[w:en:HTTP referer|referer header]]: <code>internal</code>, <code>external (search engine)</code>, <code>external</code>, <code>none</code>, or <code>unknown</code>.
|-
|-
| normalized_host || struct<  
| normalized_host || struct<  
Line 84: Line 91:
| struct containing project_class/project_familly (such as wikipedia or wikidata for instance - project_class is to be deprecated, use project_family instead), project (such as en or commons), qualifiers (a list of in-between values, such as m and/or zero) and tld (org most often)
| struct containing project_class/project_familly (such as wikipedia or wikidata for instance - project_class is to be deprecated, use project_family instead), project (such as en or commons), qualifiers (a list of in-between values, such as m and/or zero) and tld (org most often)
|-
|-
| pageview_info || map<string,string> || map containing project, language_variant and page_title values only when is_pageview = TRUE.
| pageview_info || map<string,string> || Map containing <code>project</code> (e.g. <code>en.wikipedia</code>), <code>language_variant</code>, and <code>page_title</code> (e.g. <code>Karan_Arjun</code> or <code>Special:Search</code>).  Populated only for pageviews (<code>is_pageview = TRUE</code>). <code>page_title</code> is the page_title requested, not the page_title shown to the user after redirects, which is different than the logic for <code>page_id</code>.
|-
|-
| page_id || int || MediaWiki page_id for the page served by this webrequest. This is not set for pages requested through the MediaWiki API.  For MediaWiki redirects, this is set to the redirected-to page_id, i.e. the page_id of the content shown to the user, not the page_id of the originally requested title.
| page_id || int || MediaWiki page_id for the page served by this webrequest. This is not set for pages requested through the MediaWiki API.  For MediaWiki redirects, this is set to the redirected-to page_id, i.e. the page_id of the content shown to the user, not the page_id of the originally requested title.
Line 94: Line 101:
|List containing tags qualifying the request, ex: [portal, wikidata]. Will be used to split webrequest into smaller subsets.
|List containing tags qualifying the request, ex: [portal, wikidata]. Will be used to split webrequest into smaller subsets.
|-
|-
| isp_data || map<string, string> || Internet Service Provider data in a map with keys isp, organization, autonomous_system_organization and autonomous_system_number. Ex: {"organization":"blah-some","autonomous_system_organization":"some","isp":"Some","autonomous_system_number":"some number"}
| isp_data || map<string, string> || Internet Service Provider data in a map with keys <code>isp</code>, <code>organization</code>, <code>autonomous_system_organization</code> and <code>autonomous_system_number</code>. Ex: <code>{"organization":"blah-some","autonomous_system_organization":"some","isp":"Some","autonomous_system_number":"some number"}</code>
|-
|-
|accept
|accept
Line 100: Line 107:
|Accept header of request
|Accept header of request
|-
|-
| webrequest_source || string || Source Varnish cluster (partition field), values are the part after 'cache_' in the [[Varnish#Cache_Clusters|Varnish cluster names]].
|tls
|string
|TLS information of request
|-
|tls_map
|string
|Map view of TLS information (keys are vers, keyx, auth and ciph)
|-
|ch_ua
|string
|Value of the Sec-CH-UA request header
|-
|ch_ua_mobile
|string
|Value of the Sec-CH-UA-Mobile request header
|-
|ch_ua_platform
|string
|Value of the Sec-CH-UA-Platform request header
|-
| webrequest_source || string || Which [[Varnish]] cluster handled the request. Current values are <code>text</code> and <code>upload</code> (partition field)
|-
|-
| year || int || Unpadded year of request (partition field)
| year || int || Unpadded year of request (partition field)
Line 111: Line 138:
|}
|}


[[File:Introduction_to_Hive.pdf|thumb|420px|page=13|Example query]]
=== Usage notes ===
Note the <code>webrequest_source</code>, <code>year</code>, <code>month</code>, <code>day</code>, and <code>hour</code> fields. These are Hive [https://web.archive.org/web/20130707130128/https://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/ partitions], and are explicit mappings to hourly imports in HDFS. You must include at least one partition predicate in the <code>where</code> clause of your queries (even if it is just <code>year > 0</code>).  Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in data during a particular day, you could add <code>where year = 2014 and month = 1 and day = 12</code>. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.
 
The [[metawiki:Research:Page_view/Tags#Spider|spider/user classification]] recorded in the <code>agent_type</code> field is based on the user agent, evaluated against some regexes (the standard spider detection provided by ua-parser augmented by a custom bot regex, see [https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java code] for detail).


Note  the <code>webrequest_source</code>, <code>year</code>, <code>month</code>, <code>day</code>, and <code>hour</code> fields.  These are Hive [https://web.archive.org/web/20130707130128/https://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/ partitions], and are explicit mappings to hourly imports in HDFS.  You must include at least one partition predicate in the <code>where</code> clause of your queries (even if it is just <code>year > 0</code>).  Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results.  For example, if are only interested in data during a particular day, you could add <code>where year = 2014 and month = 1 and day = 12</code>. This will instruct Hive to only process data for partitions that match that partition predicate.  You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.
Queries over the full table will sometimes fail with a <code>java.io.FileNotFoundException: File does not exist</code> error during execution, in case the purging of the oldest data (after 90 days) happens before the query accesses those partitions. This can be avoided by manually excluding the oldest available dates in the WHERE clause.


The [[metawiki:Research:Page_view/Tags#Spider|spider/user classification]] recorded in the <code>agent_type</code> field is based on the user agent, evaluated against some regexes (the standard spider detection provided by ua-parser augmented by a custom bot regex, see [https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java code] for detail).
=== Sample queries ===


See also [[Cache log format]] and [https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java refinery code] for more detail about the provenance and format of some of the fields.
====Distinct IPs====
<syntaxhighlight lang="SQL">
SELECT
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_host = "en.wikipedia.org"
  AND http_status = 200;
</syntaxhighlight>
====Top raw referrals for an article ====
<syntaxhighlight lang="SQL">
SELECT
  referer,
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_path = "/wiki/London"
  AND uri_host = "en.wikipedia.org"
  AND http_status = 200
GROUP BY referer
ORDER BY hits DESC
LIMIT 50;
</syntaxhighlight>
====Top internal referrals for an article====
<syntaxhighlight lang="SQL">
SELECT
  SUBSTR(referer,30) AS source,
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_path = "/wiki/London"
  AND uri_host = "en.wikipedia.org"
  AND referer LIKE "http://en.wikipedia.org/wiki/%"
  AND http_status = 200
GROUP BY
  SUBSTR(referer,30)
ORDER BY hits DESC
LIMIT 50;
</syntaxhighlight>
====Top outbound clicks from an article====
<syntaxhighlight lang="SQL">
SELECT
  SUBSTR(uri_path,7) AS target,
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_host = "en.wikipedia.org"
  AND referer LIKE "http://en.wikipedia.org/wiki/London%"
AND
  http_status = 200
GROUP BY SUBSTR(uri_path,7)
ORDER BY hits
DESC LIMIT 50;
</syntaxhighlight>


=== Changes and known problems since 2015-03-04 ===
=== Changes and known problems since 2015-03-04 ===
Line 337: Line 440:
|{{Phabricator|188776}}
|{{Phabricator|188776}}
|
|
|Update pageview-definition so that foundation.wikimedia host can be flagged as pageview.
|Update pageview definition to include foundation.wikimedia.org host
|-
|-
|2018 -09-09
|2018 -09-09
| onwards
| onwards
| [[phab: T211077| task T211077]]
| [[phab: T211077| task T211077]]
| Referrer class change for Chrome Mobile  
|
|
| Referrer class change for Chrome Mobile  
|-
|-
|2018-10-09
|2018-10-09
Line 357: Line 460:
| Issues with data coming from upload that might result in 'fake' traffic for some webp files
| Issues with data coming from upload that might result in 'fake' traffic for some webp files
|-
|-
}
|2019-01-07
|
|[[phab:T212862|Task T212862]], [[phab:T153821|Task T153821]]
|
| Updated IPs for Labs. Wikitech added to PageviewDefinition.
|-
|2019-04-01
|
|[[phab:T219842|Task: T219842]]
|
| Kafka jumbo failover produces some data loss for hour 22 UTC.
|-
|2019-06-04
|
|{{Phabricator|T224451}}
|
| From 2019-04-25 until 2019-06-04, pages with "+" in the title were omitted from pageviews.
|-
|2019-06-05
|
|{{Phabricator|T224187}}
|
| From now on, user agent strings longer than 400 chars are not parsed.
|-
|2019-06-28
|
|{{Phabricator|T225792}}
|
|doc.wikimedia.org host is not flagged as pageview anymore
|-
|2019-09-18
|
|{{Phabricator|T212854}}
|0.0.22
|Update user-agent parser to up-to-date version. Main differences are <code>os_family</code> bug fixes (<code>Windows 98</code> is now <code>Windows</code> for instance), <code>os_major</code> bug fixes (<code>8.1</code> is now <code>8</code>), <code>Google Search App</code> on iOS is not falsely counted at <code>Safari</code> anymore, and various <code>device_family</code> improvement.
|-
|2019-11-07
|
|{{Phabricator|T233661}}
|
|Added tls and tls_map fields
|-
|2020-01-13
|
|{{Phabricator|T239625}}
|
|Added several common search engines to the list of hard-coded search engine referers. Added a predictive referer classifier to classify non-hard-coded search engines. Introduced support for Android app referers.
|-
|2020-07-01
|
|{{Phabricator|T256514}}
|0.0.23
|Update pageview definition for new android mobile content consumption.
|-
|2020-10-20
|
|{{Phabricator|T236740}}
|0.0.24
|Remove latitude, longitude, and postal code from <code>geocode_data</code> map. These fields were inaccurate since they are based on IP address only and were unused.
|-
|2020-11-16
|
|{{Phabricator|T267008}}
|
|Grow the number of buckets by which the table is bucketed from 64 to 256. Also apply a change supposedly applied in February 2017: make <code>page_id</code> a <code>Bigint</code> instead of an <code>Integer</code>.
|-
|2022-01-26
|
|{{Phabricator/en|T299397}}
|
|Add low entropy Sec-CH-UA request headers: <code>ch_ua</code>, <code>ch_ua_mobile</code>, <code>ch_ua_platform</code>.
|-
|2022-01-26
|
|{{PhabT|300164}}
|
|Data loss has been incurred from June 4th 2021 to January 26th 2022 due to a varnishkafka install problem on some of the caching hosts.
Summary of the data loss analysis:


Queries over the full table will sometimes fail with a <code>java.io.FileNotFoundException: File does not exist</code> error during execution, in case the purging of the oldest data (after 90 days) happens before the query accesses those partitions. This can be avoided by manually excluding the oldest available dates in the WHERE clause.
* Between 2021-06-04 and 2021-11-03 we have lost on average 2.80% of webrequest-text, statv and eventlogging data (pageview impact, eqiad datacenter only)
 
* Between 2021-11-04 and 2022-01-27 we have lost on average 4.34% of webrequest-text, statv and eventlogging data (pageview impact, eqiad + ulsfo datacenters only)
== Of Interest  ==
* Between 2021-10-13 and 2021-11-03 we have lost on average 1.01% of webrequest-upload data (mediarequest impact, ulsfo datacenter only)
====Referers====
* Between 2021-11-04 and 2022-01-27 we have lost on average 2.19% of webrequest-upload data (mediarequest impact, ulsfo datacenter only)
 
|}
Research into referrer counts: {{Phabricator|T195880}}
 
== wmf_raw.webrequest (Hive) ==


Likely you do not want to query the raw tables (your queries will be slow). The refined table (see above) has a more friendly format when it comes to compression.
== wmf_raw.webrequest ==


If you do query the raw tables, add the json "serde" to your path (serde: Serializer and Deserializer):
In almost all cases, you '''should not query''' this "unrefined" data, because the "refined" data in <code>wmf.webrequest</code> is richer and much faster to query.
ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar ;


In spark: /usr/bin/spark2-shell --jars /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar
If you do need query this raw data in Hive, you will first need to add the json "serde" (serializer-deserializer) to your path: <code>ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar</code>
=== Load job (import) ===


Raw webrequest logs are currently generated by varnishkafka ([https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/cache/kafka/webrequest.pp code]) and imported using [https://github.com/linkedin/camus Camus] from [https://kafka.apache.org/ Kafka].  A Hive table with hourly partitions is mapped on top of this data.  Webrequest logs are compressed with Snappy  [http://wiki.apache.org/hadoop/SequenceFile SequenceFiles] in JSON format.  Using SequenceFiles makes it possible to use MapReduce jobs with compressed data.
You can do this in [[Analytics/Systems/Cluster/Spark|Spark]] by invoking the Spark command line tool as follows: <code>/usr/bin/spark2-shell --jars /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar</code>.
== Data statistics ==


In addition to the core data, two tables are created with detailed statistics about the hourly partition loaded.  
In addition to the core data, two tables are created with detailed statistics about the hourly partition loaded.
* wmf_raw.webrequest_sequence_stats -- Contains details for each varnish-kafka source of duplicates / missing data.
* <code>wmf_raw.webrequest_sequence_stats</code>: contains details for each varnish-kafka source of duplicates / missing data.
* wmf_raw.webrequest_sequence_stats_hourly -- Contains hourly aggregated percent_missing and percent_loss for each webrequest source.
* <code>wmf_raw.webrequest_sequence_stats_hourly</code>: contains hourly aggregated percent_missing and percent_loss for each webrequest source.
The statistics in those two tables are computed using a 'sequence' number, a counter generated by each varnish-kafka process incrementing its value for each line it outputs. This sequence number allows to track for duplicates and missing data using some arithmetics (max, min, distinct). For details look at the code for [https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql webrequest_sequence_stats] and [https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics_hourly.hql webrequest_sequence_stats_hourly]).
The statistics in those two tables are computed using a 'sequence' number, a counter generated by each varnishkafka process incrementing its value for each line it outputs. This sequence number allows to track for duplicates and missing data using some arithmetics (max, min, distinct). For details, see the code for [https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql webrequest_sequence_stats] and [https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics_hourly.hql webrequest_sequence_stats_hourly].
[[Category:Data stream]]
[[Category:Data stream]]


=== Checking statistics ===
=== Querying the statistics ===


A usual request for checking high level statistics for load jobs is
A typical query of these statistics is:<syntaxhighlight lang="sql">
SELECT
SELECT
    year,
    year,
    month,
    month,
    day,
    day,
    hour,
    hour,
    webrequest_source,
    webrequest_source,
    percent_lost,
    percent_lost,
    percent_duplicate
    percent_duplicate
FROM
FROM wmf_raw.webrequest_sequence_stats_hourly
    wmf_raw.webrequest_sequence_stats_hourly
WHERE
WHERE
    year = XXXX AND
    -- Update the restrictions to your need.
    month = XX AND
    year = XXXX
    day = XX AND
    AND month = XX
    hour = XX AND
    AND day = XX
    webrequest_source IN (XXX)
    AND hour = XX
ORDER BY
    AND webrequest_source IN (XXX)
    year,
ORDER BY
    month,
    year, month, day, hour, webrequest_source
    day,
LIMIT 1000;
    hour,
 
    webrequest_source
This can also be done in spark, see how: https://phabricator.wikimedia.org/T211000
LIMIT 1000;
 
</syntaxhighlight>For an example of querying to diagnose a data loss issue, see [[phab:T211000|T211000]].
== wmf_raw.webrequest (Kafka) ==
The quick sampling or searching through recent requests, the data can also be queried directly from Kafka. These Kafka topics are used to populate the <code>wmf_raw.webrequest</code> tables in Hive.
 
Current:
* webrequest_text
* webrequest_upload
* webrequest_misc
* webrequest_canary
 
Usage example:
 
<pre>
user@stat1004$kafkacat -C -b kafka-jumbo1005.eqiad.wmnet -c10 -t webrequest_text
</pre>
 
See also:
* [[Analytics/Data access#Stats machines|Stats machines]] (use a shell server with Hadoop access)
* [[Kafka#jumbo (eqiad)|Kafka#jumbo]] (use a kafka server from the <code>jumbo</code> group)


== API requests ==
== API requests ==
If you want to look at API use, it makes more sense to [[Analytics/Data/ApiAction|ApiAction data]], which is logged directly from MediaWiki when it responds to API requests. The benefits of that data include:
If you want to look at API use, it makes more sense to use the [[Analytics/Data Lake/Traffic/mediawiki api request|Mediawiki API request data stream]] which is logged directly from MediaWiki when it responds to API requests. The benefits of that data include:
* Includes internal API requests not routed through the Varnish servers
* Includes internal API requests not routed through the Varnish servers
* Data on API requests is not mixed in with data on regular web requests.
* Data on API requests is not mixed in with data on regular web requests.
* Contains detailed data on the content of POST API requests (they are logged in webrequests, but without the request bodies).
* Contains detailed data on the content of POST API requests (they are logged in webrequests, but without the request bodies).


== Sample queries ==
== RAW IP usage for IP Data  ==
[[Analytics/Data_Lake/Traffic/Webrequest/RawIPUsage]]


===Distinct IPs===
== Derived streams ==
<source lang="SQL" enclose="div">
The following data streams (available as [[Analytics/Data Lake/Traffic|Data Lake]] tables) are derived from webrequest:
SELECT
* [[Analytics/Data/Pageview hourly|Pageview hourly]]
  COUNT(DISTINCT ip) AS hits
* [[Analytics/Data Lake/Traffic/Pageview actor|Pageview actor]]
FROM
* [[Analytics/Data/Projectview hourly|Projectview hourly]]
  webrequest
* [[Analytics/Data Lake/Traffic/Mediacounts|Mediacounts]]
WHERE
* [[Analytics/Data Lake/Traffic/Unique Devices|Uniques devices]]
  year = 2014
*[[Analytics/Data Lake/Traffic/Browser general|Browser general]]
  AND month = 5
*[[Analytics/Data Lake/Traffic/ApiAction|Mediawiki API request]]
  AND day = 5
*[[Analytics/Data Lake/Traffic/mobile apps session metrics|Mobile apps session metrics]]
  AND hour = 13
*[[Analytics/Data Lake/Traffic/mobile apps uniques|Mobile apps uniques]]
  AND uri_host = "en.wikipedia.org"
  AND http_status = 200;
</source>
===Top raw referrals for an article ===
<source lang="SQL" enclose="div">
SELECT
  referer,
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_path = "/wiki/London"
  AND uri_host = "en.wikipedia.org"
  AND http_status = 200
GROUP BY referer
ORDER BY hits DESC
LIMIT 50;
</source>
===Top internal referrals for an article===
<source lang="SQL" enclose="div">
SELECT
  SUBSTR(referer,30) AS source,
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_path = "/wiki/London"
  AND uri_host = "en.wikipedia.org"
  AND referer LIKE "http://en.wikipedia.org/wiki/%"
  AND http_status = 200
GROUP BY
  SUBSTR(referer,30)
ORDER BY hits DESC
LIMIT 50;
</source>
===Top outbound clicks from an article===
<source lang="SQL" enclose="div">
SELECT
  SUBSTR(uri_path,7) AS target,
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_host = "en.wikipedia.org"
  AND referer LIKE "http://en.wikipedia.org/wiki/London%"
AND
  http_status = 200
GROUP BY SUBSTR(uri_path,7)
ORDER BY hits
DESC LIMIT 50;
</source>
== Hive tables derived from webrequest ==
The following [[Analytics/Data Lake/Traffic|Data Lake]] tables (accessible via Hive) are derived from the Webrequest table:
* [[Analytics/Data/Pageview hourly|pageview_hourly]]
* [[Analytics/Data/Projectview hourly|projectview_hourly]]
* [[Analytics/Data Lake/Traffic/Mediacounts|mediacounts]]
* [[Analytics/Data Lake/Traffic/Unique Devices|Uniques Devices]]
*[[Analytics/Data Lake/Traffic/Browser general|Browser General]]
*[[Analytics/Data Lake/Traffic/ApiAction|ApiAction]]
*[[Analytics/Data Lake/Traffic/mobile apps session metrics|mobile apps session metrics]]
*[[Analytics/Data Lake/Traffic/mobile apps uniques|mobile apps uniques]]
*[[Analytics/Data Lake/Traffic/Interlanguage|Interlanguage navigation]]
*[[Analytics/Data Lake/Traffic/Interlanguage|Interlanguage navigation]]
See also [[Analytics/Data Lake/Traffic#Datasets]].
== See also ==


*...
* Research into referrer counts: {{Phabricator|T195880}}
See also [[Analytics/Data_Lake/Traffic#Datasets]]


== Notes ==
== Notes ==
<references />
<references />

Latest revision as of 06:51, 10 May 2022

File:Pageview @ Wikimedia (WMF Analytics lightning talk, June 2015).pdf The webrequest stream contains data on all the hits to Wikimedia's servers. This includes requests for page HTML, images, CSS, and Javascript, as well as requests to the API.[1] For privacy reasons, this data is purged after 90 days.

There are two webrequest tables available on Hive. The wmf_raw.webrequest table is JSON data directly imported from Kafka. wmf.webrequest contains the 'refined' webrequest data. The raw table is purged more frequently than 90 days.

For many purposes, it is preferable to query the much smaller pageview_hourly and projectview_hourly tables. These are derived from webrequest by filtering to keep only pageviews and aggregating over a predefined set of dimensions.

Pipeline

The stream originates from the Varnish caching servers, which emit log data in their cache log format. A varnishkafka instance (Puppet code) pipes the Varnish log data into Kafka, in the webrequest_text and webrequest_upload topics.

Goblin then loads the data from Kafka into HDFS, where it is stored as Snappy-compressed SequenceFiles in JSON format. A Hive table with hourly partitions is mapped on top of these data files. This is wmf_raw.webrequest.

An Oozie job (code) then checks for missing or duplicated data and creates a "refined" copy which has additional fields and is stored as a Parquet file. Parquet is a column-based data storage format, which is much more efficient to query than the "raw" JSON data (see Data Format Experiments#Results for more info). This refined copy is wmf.webrequest.

wmf.webrequest

If you want to query the webrequest stream, you almost always want to use the "refined" version available in wmf.webrequest, which been enriched with extra fields and stored in a format that is much faster to query. (NOTE: descriptions below are a little brief on purpose, for precise definitions you can look at the refine step, which is a simple Hive query invoking UDFs, and follow that through by searching refinery-source)

Current Schema

col_name data_type comment
hostname string Source node hostname
sequence bigint Per-host sequence number
dt string Timestamp at cache in ISO 8601 format
time_firstbyte double Time to first byte in seconds (a sample value is .000243). The time it took from when the req was first received by varnish until it started sending response bytes back to the client.
ip string IP of packet at cache
cache_status string Cache status
http_status string HTTP status of response
response_size bigint Response size
http_method string HTTP method of request
uri_host string Host of request
uri_path string Path of request
uri_query string Query of request
content_type string Content-Type header of response
referer string Referer header of the request. "Unknown" value indicates that url in referrer was not parseable or defective, "-" indicates no referrer was received.
x_forwarded_for string X-Forwarded-For header of request
user_agent string User-Agent header of request
accept_language string Accept-Language header of request
x_analytics string X-Analytics header of response, looks like: ns=-1;special=Userlogin;WMF-Last-Access=09-May-2017;WMF-Last-Access-Global=09-May-2017;https=1
range string Range header of response
is_pageview boolean Indicates if this record was marked as a pageview during refinement
record_version string Keeps track of changes in the table content definition
client_ip string Best guess client IP address from the X-Client-IP header, set by varnish using X-Forwarded-For and other information.
geocoded_data map<string,string> Geocoded data computed during refinement using client_ip and MaxMind database contains: continent, country_code, country, subdivision, city but has nulls where information is not available
x_cache string X-Cache header of response
user_agent_map map<string,string> User-agent map with browser_family, browser_major, device_family, os_family, os_major, os_minor and wmf_app_version keys and associated values
x_analytics_map map<string,string> The X-Analytics header broken up into key-value pairs
ts timestamp Unix timestamp in milliseconds extracted from dt
access_method string Client type used to access the site (mobile app, mobile web, or desktop). Mobile app requests are identified by the user agent including WikipediaApp or Wikipedia/5.0. Mobile web requests are identified by the hostname containing a subdomain of m, zero, wap, or mobile. Any other request is classified as desktop. This is implemented in the getAccessMethod function in refinery-source.
agent_type string Categorise the agent making the webrequest as either user or spider (automatas to be added).
is_zero boolean Indicates if the webrequest is accessed through a zero provider
referer_class string The type of site found in the referer header: internal, external (search engine), external, none, or unknown.
normalized_host struct<

project_class:string, project_family:string,

project:string, qualifiers:array<string>,

tld:string >

struct containing project_class/project_familly (such as wikipedia or wikidata for instance - project_class is to be deprecated, use project_family instead), project (such as en or commons), qualifiers (a list of in-between values, such as m and/or zero) and tld (org most often)
pageview_info map<string,string> Map containing project (e.g. en.wikipedia), language_variant, and page_title (e.g. Karan_Arjun or Special:Search). Populated only for pageviews (is_pageview = TRUE). page_title is the page_title requested, not the page_title shown to the user after redirects, which is different than the logic for page_id.
page_id int MediaWiki page_id for the page served by this webrequest. This is not set for pages requested through the MediaWiki API. For MediaWiki redirects, this is set to the redirected-to page_id, i.e. the page_id of the content shown to the user, not the page_id of the originally requested title.
namespace_id int MediaWiki namespace_id for the page served by this webrequest. This is set with the same logic as page_id.
tags array<string> List containing tags qualifying the request, ex: [portal, wikidata]. Will be used to split webrequest into smaller subsets.
isp_data map<string, string> Internet Service Provider data in a map with keys isp, organization, autonomous_system_organization and autonomous_system_number. Ex: {"organization":"blah-some","autonomous_system_organization":"some","isp":"Some","autonomous_system_number":"some number"}
accept string Accept header of request
tls string TLS information of request
tls_map string Map view of TLS information (keys are vers, keyx, auth and ciph)
ch_ua string Value of the Sec-CH-UA request header
ch_ua_mobile string Value of the Sec-CH-UA-Mobile request header
ch_ua_platform string Value of the Sec-CH-UA-Platform request header
webrequest_source string Which Varnish cluster handled the request. Current values are text and upload (partition field)
year int Unpadded year of request (partition field)
month int Unpadded month of request (partition field)
day int Unpadded day of request (partition field)
hour int Unpadded hour of request (partition field)

Usage notes

Note the webrequest_source, year, month, day, and hour fields. These are Hive partitions, and are explicit mappings to hourly imports in HDFS. You must include at least one partition predicate in the where clause of your queries (even if it is just year > 0). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in data during a particular day, you could add where year = 2014 and month = 1 and day = 12. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.

The spider/user classification recorded in the agent_type field is based on the user agent, evaluated against some regexes (the standard spider detection provided by ua-parser augmented by a custom bot regex, see code for detail).

Queries over the full table will sometimes fail with a java.io.FileNotFoundException: File does not exist error during execution, in case the purging of the oldest data (after 90 days) happens before the query accesses those partitions. This can be avoided by manually excluding the oldest available dates in the WHERE clause.

Sample queries

Distinct IPs

SELECT
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_host = "en.wikipedia.org"
  AND http_status = 200;

Top raw referrals for an article

SELECT
  referer,
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_path = "/wiki/London"
  AND uri_host = "en.wikipedia.org"
  AND http_status = 200
GROUP BY referer
ORDER BY hits DESC
LIMIT 50;

Top internal referrals for an article

SELECT
  SUBSTR(referer,30) AS source,
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_path = "/wiki/London"
  AND uri_host = "en.wikipedia.org"
  AND referer LIKE "http://en.wikipedia.org/wiki/%"
  AND http_status = 200
GROUP BY
  SUBSTR(referer,30)
ORDER BY hits DESC
LIMIT 50;

Top outbound clicks from an article

SELECT
  SUBSTR(uri_path,7) AS target,
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_host = "en.wikipedia.org"
  AND referer LIKE "http://en.wikipedia.org/wiki/London%"
AND 
  http_status = 200
GROUP BY SUBSTR(uri_path,7)
ORDER BY hits
DESC LIMIT 50;

Changes and known problems since 2015-03-04

See also m:Research:Page view#Change log for changes to the page view definition itself, as well as Analytics/Data_Lake/Traffic/Pageview_hourly#Changes and known problems since 2015-06-16
Date from Date until Task record_version Details
2015-01-01 * Create table with data from 'text', 'mobile' and 'upload' request sources.
2015-01-26 * Add 'misc' request source.
2015-02-12 * Add 'bits' request source.
2015-02-27 task T90725, task T89401 0.0.1 Add record_version, client_ip and geocoded_data fields to the schema. An error is raised when trying to access those fields for data before the release date.
2015-03-30 task T91749, task T91793, task T89396 0.0.2 Add x_cache, user_agent_map and x_analytics_map fields to the schema. An error is raised when trying to access those fields for data before the release date. Note: user_agent_map and x_analytics_map are to replace original user_agent and x_analytics fields, but field replacement breaks schema backward compatibility. We will deploy this chamge when we'll be confident enough that no more backward incompatible change are needed soon.
2015-04-10 task T95178, task T94584 0.0.3 Add ts, access_method, agent_type and is_zero fields to the schema. An error is raised when trying to access those fields for data before the release date. Note: The agent_type on.y contains spider and user values for the moment. A detailed analysis on user_agent and usage patterns should be overtaken to provide new/refined values.
2015-05-07 0.0.4 Add referer_class field to the schema, with possible values internal, external or unknown. An error is raised when trying to access those fields for data before the release date. Agent_type bug is corrected, and some improvements are made in the computation of is_pageview and access_method fields.
2015-06-11 task T99932, task T99918, task T96044 0.0.5 Add normalized_host and pageview_info fields to the schema, and update user_agent_map adding wmf_app_version. normalized_host is created splitting the uri host as a struct with project_class (wikipedia, wikidata for instance), project (en, commons for instance), qualifiers (an array of in-between qualifiers, like m or zero for instance), and tld (org most of the time). pageview_info is set only when is_pageview is true, and is a map with keys project (en.wikipedia, commons.wikimedia for instance), language_variant (zh-hant for instance), and page_title when possible.
2015-08-13 task T107764, task T108833, task T108717 0.0.6 Correct bug in the use of webrequest uri hosts to compute if pageview or not. The change affect 0.2% of pageviews by removing false positives. The 'bits' webrequest_source partition and associated automatic jobs have been removed (every data that was in bits is now text partition), and a new 'maps' webrequest_source partition has been created (data for this partition starts on 2015-08-11T00:00:00).
2015-08-03, 2015-08-10 task T106581 Two separate Kafka upgrade attempts caused significant data loss. A bug in Kafka caused data not to be compressed properly. Once the bug was fixed, we spent a lot of time patching and repairing what data we could. Several hourly webrequest partitions had significant loss and could not be restored. This loss was either caused by Kafka being too overloaded to accept produce requests, or by a recovery attempt in which a single broker for whom no replicas were in sync was restarted.
Webrequest los 2015-08.png
http://debugging.wmflabs.org/#refinery-graphs-tab. See also: Incident_documentation/20150803-Kafka,Incident_documentation/20150810-Kafka
2015-08-27 task T109256, task T108866 0.0.7 Correct two bugs:
  • In is_pageview definition: outreach.wikimedia.org and donate.wikimedia.org are not pageviews anymore.
  • In page_title extraction (for pageview_info['page_title']): Consistently convert spaces into underscores (only special cases where correctly treated before).
2015-08-31 0.0.8 arbcom-*.wikimedia.org are not pageviews anymore.
2015-09-16 task T109383, task T106134, task T108598 0.0.9 Update ua-parser to an up-to-date version (better user-agent classification), improve spider tagging by enhancing the regular expression used in addition of ua-parser, update pageview definition to prevent requests having 'preview' set in x_analytics header to be counted as pageviews or appPageviews.
2015-11-19 task T119054 0.0.10 Improve spider tagging adding user agent being .*http.* flagged as spider.
2015-12-01 task T116023 0.0.11 Add the mediawiki page_id for this hit when available. For redirects this could be the page_id of the redirect or the page_id of the target. This may not always be set, even if the page is actually a pageview. Upgrade pageview_definition to better filter non-interesting path.
2015-12-15 There was an irrecoverable raw_webrequest data loss for the hour 17 UTC of: 25.6% (misc), 19.5% (mobile), 19.1% (text), 39.1% (upload). the loss was due to the enabling of IPSec. This encrypts varniskafka traffic between caches in remote datacenters and the Kafka brokers in eqiad. During a period of about 40ish minutes, no webrequest logs from remote datacenters were successfully produced to Kafka.
2015-12-18 There was another irrecoverable raw_webrequest data loss between 21:54 and 22:15. A puppet change was merged in which an important firewall rule dealing with IPSec was lost. This kept all varnishkafkas in remote datacenters from producing to Kafka during this time.
2016-02-03 task T115919 0.0.12 Referer classification has an updated version providing external search engine details (when any).
2016-03-09 task T125731 0.0.13 Improve spider tagging regular expression to match the bots that follow the conventions specified in https://meta.wikimedia.org/wiki/User-Agent_policy. Remove tagging user agents containing "WikimediaBot" as bot.
2016-03-23 task T130399 0.0.14 Add x_analytics header inclusive filtering (consider pageview if header contains <code>pageview=1</code>), update ua-parser to an up-to-date version, correct a bug in search engine referer classification.
2016-04-11 task T131824, task T128612 0.0.15 Correct x_analytics header inclusive filtering (only apply pageview=1 tagging in mobile app pageview definition, and ensure correct status_code and mime_type before checking flag), update mobile app pageview accepted user-agent to WikipediaApp|Wikipedia/5.0. to cover for iOs app 5.0 release user-agent change. Mobile-app pageviews for IOS are incorrect from mid March since the launch of 5.0 version until about April 11th when user agent is updated.
2016-05-16 task T135168 0.0.16 Correct pageview definition to flag mobile app pageviews when x_analytics header has pageview=1 set without checking for requests path (data will be backfilled in pageview table from 2015-05-02 21:00 onward). Add page title extraction for mobile apps restbase accessed pages.
2016-08-12 task T142410 0.0.16 Correct cache_status header
2017-02-09 task T156628, task T155141, task T157528 0.0.17 Update pageview definition to remove previews (POST with "action=submit" in query is now excluded from pageviews). Add DSXS (self-identified bot) to bot regex. Add namespace_id field.
2017-03-21 task T142410 0.0.18 Update pageview definition to stop filtering out test[2].wikipedia.org from pageviews.
2017-06-13 Task T118557 0.0.18 Removed x_forwarded_for from webrequest logs. This field will be NULL from here on out in webrequest data in Hive. Use client_ip.
2017-08-23 Task T168874, Task T171760 0.0.19 Add project_family in normalized_host struct in preparation for deprecation of project_class (they have the same value). Addition of the tags column, containing an array of strings (tags) allowing for easier identification of rows. First usage of tags will be to create temporary smaller tables in order to reduce the full table scans needed by many jobs.
2017-11-03 Task T177199 Changed cache_status possible values to include cache-layer information (eg: hit -> hit-front). This change is coming from the raw data and not the refinery process, therefore no record-version bump is required.
2018-02-15 task T167907 0.0.20 Add ISP data from MaxMind GeoIP2 ISP database to webrequest.
2018-04-10 Task T191940 For about 54min (from 22:46 - 23:40) on April 10, a significant portion of global traffic was not reaching our sites (incident documentation). The validity of the webrequest data itself appears to not have been affected.
2018-04 task T187014 From February 6 to April 16, 2018, the geolocation data for traffic from Opera browsers on mobile web was incorrect (wrongly labeled them as coming from the US, "Unknown" and some other countries rather than their true origin).
2018-05-02 task T191714 Add Ecosia and Startpage to list of search engines.
2018-05-21 Task T193578 Updated UA parsing source regexes. Mainly Windows NT (Windows 2000 to the present) requests now appear in the same OS family, as "Windows". Improves identification of page previews from browsers and devices, and correctly identifies devices previously marked as "Generic Android". Many requests that were marked wrongly as IE7 before are now (mostly) correctly identified as IE11.
2018-06-21 task T192779 Wikipedia Android app starts sending referer data (production release: 2018-06-27)
2018-07-26 task T188776 Update pageview definition to include foundation.wikimedia.org host
2018 -09-09 onwards task T211077 Referrer class change for Chrome Mobile
2018-10-09 Task T170606 Add accept header
2018-11-05 Task T208752 Issues with data coming from upload that might result in 'fake' traffic for some webp files
2019-01-07 Task T212862, Task T153821 Updated IPs for Labs. Wikitech added to PageviewDefinition.
2019-04-01 Task: T219842 Kafka jumbo failover produces some data loss for hour 22 UTC.
2019-06-04 task T224451 From 2019-04-25 until 2019-06-04, pages with "+" in the title were omitted from pageviews.
2019-06-05 task T224187 From now on, user agent strings longer than 400 chars are not parsed.
2019-06-28 task T225792 doc.wikimedia.org host is not flagged as pageview anymore
2019-09-18 task T212854 0.0.22 Update user-agent parser to up-to-date version. Main differences are os_family bug fixes (Windows 98 is now Windows for instance), os_major bug fixes (8.1 is now 8), Google Search App on iOS is not falsely counted at Safari anymore, and various device_family improvement.
2019-11-07 task T233661 Added tls and tls_map fields
2020-01-13 task T239625 Added several common search engines to the list of hard-coded search engine referers. Added a predictive referer classifier to classify non-hard-coded search engines. Introduced support for Android app referers.
2020-07-01 task T256514 0.0.23 Update pageview definition for new android mobile content consumption.
2020-10-20 task T236740 0.0.24 Remove latitude, longitude, and postal code from geocode_data map. These fields were inaccurate since they are based on IP address only and were unused.
2020-11-16 task T267008 Grow the number of buckets by which the table is bucketed from 64 to 256. Also apply a change supposedly applied in February 2017: make page_id a Bigint instead of an Integer.
2022-01-26 task T299397 Add low entropy Sec-CH-UA request headers: ch_ua, ch_ua_mobile, ch_ua_platform.
2022-01-26 task T300164 Data loss has been incurred from June 4th 2021 to January 26th 2022 due to a varnishkafka install problem on some of the caching hosts.

Summary of the data loss analysis:

  • Between 2021-06-04 and 2021-11-03 we have lost on average 2.80% of webrequest-text, statv and eventlogging data (pageview impact, eqiad datacenter only)
  • Between 2021-11-04 and 2022-01-27 we have lost on average 4.34% of webrequest-text, statv and eventlogging data (pageview impact, eqiad + ulsfo datacenters only)
  • Between 2021-10-13 and 2021-11-03 we have lost on average 1.01% of webrequest-upload data (mediarequest impact, ulsfo datacenter only)
  • Between 2021-11-04 and 2022-01-27 we have lost on average 2.19% of webrequest-upload data (mediarequest impact, ulsfo datacenter only)

wmf_raw.webrequest

In almost all cases, you should not query this "unrefined" data, because the "refined" data in wmf.webrequest is richer and much faster to query.

If you do need query this raw data in Hive, you will first need to add the json "serde" (serializer-deserializer) to your path: ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar

You can do this in Spark by invoking the Spark command line tool as follows: /usr/bin/spark2-shell --jars /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar.

Data statistics

In addition to the core data, two tables are created with detailed statistics about the hourly partition loaded.

  • wmf_raw.webrequest_sequence_stats: contains details for each varnish-kafka source of duplicates / missing data.
  • wmf_raw.webrequest_sequence_stats_hourly: contains hourly aggregated percent_missing and percent_loss for each webrequest source.

The statistics in those two tables are computed using a 'sequence' number, a counter generated by each varnishkafka process incrementing its value for each line it outputs. This sequence number allows to track for duplicates and missing data using some arithmetics (max, min, distinct). For details, see the code for webrequest_sequence_stats and webrequest_sequence_stats_hourly.

Querying the statistics

A typical query of these statistics is:

SELECT
    year,
    month,
    day,
    hour,
    webrequest_source,
    percent_lost,
    percent_duplicate
FROM wmf_raw.webrequest_sequence_stats_hourly
WHERE
    year = XXXX AND
    month = XX AND
    day = XX AND
    hour = XX AND
    webrequest_source IN (XXX)
ORDER BY
    year,
    month,
    day,
    hour,
    webrequest_source
LIMIT 1000;

For an example of querying to diagnose a data loss issue, see T211000.

API requests

If you want to look at API use, it makes more sense to use the Mediawiki API request data stream which is logged directly from MediaWiki when it responds to API requests. The benefits of that data include:

  • Includes internal API requests not routed through the Varnish servers
  • Data on API requests is not mixed in with data on regular web requests.
  • Contains detailed data on the content of POST API requests (they are logged in webrequests, but without the request bodies).

RAW IP usage for IP Data

Analytics/Data_Lake/Traffic/Webrequest/RawIPUsage

Derived streams

The following data streams (available as Data Lake tables) are derived from webrequest:

See also Analytics/Data Lake/Traffic#Datasets.

See also

Notes

  1. It doesn't include some internal (within-cluster) API requests which are not routed through the Varnish servers.