You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
There are two webrequest tables available on Hive. The
wmf_raw.webrequest table is JSON data directly imported from Kafka.
wmf.webrequest contains the 'refined' webrequest data.
For many purposes, it is preferable to instead query the much smaller tables pageview_hourly or projectview_hourly, which contain 'pre-aggregated' webrequest data, filtered to keep only pageviews, and aggregated over a predefined set of dimension.
wmf.webrequest table is 'refined' webrequest data. It contains the same data as the
wmf_raw.webrequest, as well as additional information that has been added during the refinement phase. It is stored in the Parquet format. Parquet is a column-based data storage format. When querying
wmf.webrequest, Hive will not have to parse JSON, and it will only have to read in data from disk for columns that you are explicitly using in your query. This should increase performance of most Hive queries. See Data Format Experiments#Results for more info.
|hostname||string||Source node hostname|
|sequence||bigint||Per-host sequence number|
|dt||string||Timestamp at cache in ISO 8601 format|
|time_firstbyte||double||Time to first byte|
|ip||string||IP of packet at cache|
|http_status||string||HTTP status of response|
|http_method||string||HTTP method of request|
|uri_host||string||Host of request|
|uri_path||string||Path of request|
|uri_query||string||Query of request|
|content_type||string||Content-Type header of response|
|referer||string||Referer header of request|
|x_forwarded_for||string||X-Forwarded-For header of request|
|user_agent||string||User-Agent header of request|
|accept_language||string||Accept-Language header of request|
|x_analytics||string||X-Analytics header of response|
|range||string||Range header of response|
|is_pageview||boolean||Indicates if this record was marked as a pageview during refinement|
|record_version||string||Keeps track of changes in the table content definition|
|client_ip||string||Client IP computed during refinement using ip and x_forwarded_for|
|geocoded_data||map<string,string>||Geoceded data computed during refinement using computed client ip and maxmind database|
|x_cache||string||X-Cache header of response|
|user_agent_map||map<string,string>||User-agent map with browser_name, browser_major, device, os_name, os_minor, os_major keys and associated values|
|x_analytics_map||map<string,string>||X_analytics map view of the x_analytics field|
|ts||timestamp||Unix timestamp in milliseconds extracted from dt|
|access_method||string||Method used to accessing the site (mobile app|mobile web|desktop)|
|agent_type||string||Categorise the agent making the webrequest as either user or spider (automatas to be added).|
|is_zero||boolean||Indicates if the webrequest is accessed through a zero provider|
|referer_class||string||Indicates if a referer is internal, external or unknown.|
|normalized_host||struct<project_class:string,project:string,qualifiers:array<string>,tld:string>||struct containing project_class (such as wikipedia or wikidata for instance), project (such as en or commons), qualifiers (a list of in-between values, such as m and/or zero) and tld (org most often)|
|pageview_info||map<string,string>||map containing project, language_variant and page_title values only when is_pageview = TRUE.|
|page_id||int||MediaWiki page_id for this page title. For redirects this could be the page_id of the redirect or the page_id of the target. This may not always be set, even if the page is actually a pageview.|
|namespace_id||int||MediaWiki namespace_id for this page title. This may not always be set, even if the page is actually a pageview.|
|webrequest_source||string||Source cluster (partition field)|
|year||int||Unpadded year of request (partition field)|
|month||int||Unpadded month of request (partition field)|
|day||int||Unpadded day of request (partition field)|
|hour||int||Unpadded hour of request (partition field)|
hour fields. These are Hive partitions, and are explicit mappings to hourly imports in HDFS. You must include at least one partition predicate in the
where clause of your queries (even if it is just
year > 0). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in data during a particular day, you could add
where year = 2014 and month = 1 and day = 12. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.
Changes and known problems since 2015-03-04
- See also m:Research:Page view#Change log for changes to the page view definition itself
|Date from||Date until||Task||record_version||Details|
|2015-01-01||*||Create table with data from 'text', 'mobile' and 'upload' request sources.|
|2015-01-26||*||Add 'misc' request source.|
|2015-02-12||*||Add 'bits' request source.|
|2015-02-27||task T90725, task T89401||0.0.1||Add record_version, client_ip and geocoded_data fields to the schema. An error is raised when trying to access those fields for data before the release date.|
|2015-03-30||task T91749, task T91793, task T89396||0.0.2||Add x_cache, user_agent_map and x_analytics_map fields to the schema. An error is raised when trying to access those fields for data before the release date. Note: user_agent_map and x_analytics_map are to replace original user_agent and x_analytics fields, but field replacement breaks schema backward compatibility. We will deploy this chamge when we'll be confident enough that no more backward incompatible change are needed soon.|
|2015-04-10||task T95178, task T94584||0.0.3||Add ts, access_method, agent_type and is_zero fields to the schema. An error is raised when trying to access those fields for data before the release date. Note: The agent_type on.y contains spider and user values for the moment. A detailed analysis on user_agent and usage patterns should be overtaken to provide new/refined values. |
|2015-05-07||0.0.4||Add referer_class field to the schema, with possible values internal, external or unknown. An error is raised when trying to access those fields for data before the release date. Agent_type bug is corrected, and some improvements are made in the computation of is_pageview and access_method fields.|
|2015-06-11||task T99932, task T99918, task T96044||0.0.5||Add normalized_host and pageview_info fields to the schema, and update user_agent_map adding wmf_app_version. normalized_host is created splitting the uri host as a struct with project_class (wikipedia, wikidata for instance), project (en, commons for instance), qualifiers (an array of in-between qualifiers, like m or zero for instance), and tld (org most of the time). pageview_info is set only when is_pageview is true, and is a map with keys project (en.wikipedia, commons.wikimedia for instance), language_variant (zh-hant for instance), and page_title when possible.|
|2015-08-13||task T107764, task T108833, task T108717||0.0.6||Correct bug in the use of webrequest uri hosts to compute if pageview or not. The change affect 0.2% of pageviews by removing false positives. The 'bits' webrequest_source partition and associated automatic jobs have been removed (every data that was in bits is now text partition), and a new 'maps' webrequest_source partition has been created (data for this partition starts on 2015-08-11T00:00:00).|
|2015-08-03, 2015-08-10||task T106581||Two separate Kafka upgrade attempts caused significant data loss. A bug in Kafka caused data not to be compressed properly. Once the bug was fixed, we spent a lot of time patching and repairing what data we could. Several hourly webrequest partitions had significant loss and could not be restored. This loss was either caused by Kafka being too overloaded to accept produce requests, or by a recovery attempt in which a single broker for whom no replicas were in sync was restarted. http://debugging.wmflabs.org/#refinery-graphs-tab. See also: Incident_documentation/20150803-Kafka,Incident_documentation/20150810-Kafka|
|2015-08-27||task T109256, task T108866||0.0.7||Correct two bugs:
|2015-08-31||0.0.8||arbcom-*.wikimedia.org are not pageviews anymore.|
|2015-09-16||task T109383, task T106134, task T108598||0.0.9||Update ua-parser to an up-to-date version (better user-agent classification), improve spider tagging by enhancing the regular expression used in addition of ua-parser, update pageview definition to prevent requests having 'preview' set in x_analytics header to be counted as pageviews or appPageviews. |
|2015-11-19||task T119054||0.0.10||Improve spider tagging adding user agent being |
|2015-12-01||task T116023||0.0.11||Add the mediawiki page_id for this hit when available. For redirects this could be the page_id of the redirect or the page_id of the target. This may not always be set, even if the page is actually a pageview. Upgrade pageview_definition to better filter non-interesting path.|
|2015-12-15||There was an irrecoverable raw_webrequest data loss for the hour 17 UTC of: 25.6% (misc), 19.5% (mobile), 19.1% (text), 39.1% (upload). the loss was due to the enabling of IPSec. This encrypts varniskafka traffic between caches in remote datacenters and the Kafka brokers in eqiad. During a period of about 40ish minutes, no webrequest logs from remote datacenters were successfully produced to Kafka.|
|2015-12-18||There was another irrecoverable raw_webrequest data loss between 21:54 and 22:15. A puppet change was merged in which an important firewall rule dealing with IPSec was lost. This kept all varnishkafkas in remote datacenters from producing to Kafka during this time.|
|2016-02-03||task T115919||0.0.12||Referer classification has an updated version providing external search engine details (when any).|
|2016-03-09||task T125731||0.0.13||Improve spider tagging regular expression to match the bots that follow the conventions specified in https://meta.wikimedia.org/wiki/User-Agent_policy. Remove tagging user agents containing "WikimediaBot" as bot.|
|2016-03-23||task T130399||0.0.14||Add x_analytics header inclusive filtering (consider pageview if header contains <code>pageview=1</code>), update ua-parser to an up-to-date version, correct a bug in search engine referer classification.|
|2016-04-11||task T131824, task T128612||0.0.15||Correct x_analytics header inclusive filtering (only apply |
|2016-05-16||task T135168||0.0.16||Correct pageview definition to flag mobile app pageviews when x_analytics header has |
|2016-08-12||task T142410||0.0.16||Correct cache_status header|
|2017-02-09||task T156628, task T155141, task T157528||0.0.17||Update pageview definition to remove previews (POST with "action=submit" in query is now excluded from pageviews). Add DSXS (self-identified bot) to bot regex. Add namespace_id field.|
|2017-03-21||task T142410||0.0.18||Update pageview definition to stop filtering out test.wikipedia.org from pageviews.|
Queries over the full table will sometimes fail with a
java.io.FileNotFoundException: File does not exist error during execution, in case the purging of the oldest data (after 60 days) happens before the query accesses those partitions. This can be avoided by manually excluding the oldest available dates in the WHERE clause.
Likely you do not want to query the raw tables (your queries will be slow) but the processed ones (see above) that have a more friendly format when it comes to compression.
If you do query these tables you need to add the json serde to your path:
ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar ;
Load job (import)
Raw Webrequest logs are currently imported using Camus from Kafka. A Hive table with hourly partitions is mapped on top of this data. Webrequest logs are Snappy compressed SequenceFiles in JSON format. Using SequenceFiles makes it possible to use MapReduce jobs with compressed data.
In addition to the core data, two tables are created with detailed statistics about the hourly partition loaded.
- wmf_raw.webrequest_sequence_stats -- Contains details for each varnish-kafka source of duplicates / missing data.
- wmf_raw.webrequest_sequence_stats_hourly -- Contains hourly aggregated percent_missing and percent_loss for each webrequest source.
The statistics in those two tables are computed using a 'sequence' number, a counter generated by each varnish-kafka process incrementing its value for each line it outputs. This sequence number allows to track for duplicates and missing data using some arithmetics (max, min, distinct). For details look at the code for webrequest_sequence_stats and webrequest_sequence_stats_hourly).
An usual request for checking high level statistics for load jobs is
SELECT year, month, day, hour, webrequest_source, percent_lost, percent_duplicate FROM wmf_raw.webrequest_sequence_stats_hourly WHERE -- Update the restrictions to your need. year = XXXX AND month = XX AND day = XX AND hour = XX AND webrequest_source IN (XXX) ORDER BY year, month, day, hour, webrequest_source LIMIT 1000;
If you want to look at API use, it makes more sense to API Action data, which is logged directly from MediaWiki when it responds to API requests. The benefits of that data include:
- Includes internal API requests not routed through the Varnish servers
- Data on API requests is not mixed in with data on regular web requests.
- Contains detailed data on the content of POST API requests (they are logged in webrequests, but without the request bodies).
- It doesn't include some internal (within-cluster) API requests which are not routed through the Varnish servers.