You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Analytics/Data Lake/Traffic/Caching
This is a restricted public snapshot of the Webrequest dataset.
The most recent release is composed of caching metrics for image web requests from one cache of upload.wikimedia.org. The data is expected to be primarily used for caching research.
Data Updates & Format
The data is updated manually and irregularly upon request. The request for the most recent release of this data can be found here.
The current release of this data, released in November 2019, is available here. It was released as a set of gzip
-compressed tsv
files. It includes a total 21 compressed files, each holding exactly 24 hours of consecutive data, about 2GB of data compressed or 7GB of data decompressed.
Each decompressed file has the following columns:
Column Name | Data Type | Notes |
---|---|---|
relative_unix | int | Seconds since start timestamp of dataset |
hashed_path_query | bigint | Salted hash of path and query of request |
image_type | string | Image type from Content-Type header of response |
response_size | int | Response size in bytes |
time_firstbyte | double | Seconds to first byte |
Privacy
Since this data has many privacy concerns, this public release applies the following changes to make the data reveal less while providing value for a public audience:
Relative timestamps
This data employs a unique timing paradigm. The relative_unix
field is equal to the seconds that have elapsed between the timestamp at which the web request occurred and a fixed randomly-selected start timestamp. This standard makes it significantly more difficult for bad actors to map this data to other publicly-released WMF data sets while maintaining the utility of timestamps for caching research as a means to determine the frequencies of these web requests.
Hashed path and query
We securely hash the path and query fields of the web requests with a salt to effectively anonymize the content. This both makes it more difficult to map this data between data sets and minimizes the general privacy concerns related to the combination of content, timestamps, and content sizes.
Previous Releases
2016
Release: analytics.wikimedia.org
Request: T128132
Description: A more detailed, but more privacy-conscious iteration of the 2007 release between July 1st, 2016 and July 10th, 2016
Fields: hashed host and path
, uri query
, content type
, response size
, time to first byte
, X-Cache
2008
Release: wikibench.eu
Description: A trace of 10% of all user requests issued to Wikipedia in all languages between September 19th, 2007 and January 2nd, 2008
Fields: request count
, unix timestamp
, request url
, database update flag