You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Traffic/Caching: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Lex Nasser
m (Split general purpose and most recent release)
 
imported>Lex Nasser
m (Grammar)
Line 1: Line 1:
This is a restricted public snapshot of the [[Analytics/Data Lake/Traffic/Webrequest|Webrequest]] dataset.
This dataset is a restricted public snapshot of the <code>[https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#wmf.webrequest wmf.webrequest]</code> table intended for caching research.


The most recent release is composed of caching metrics for image web requests from one cache of [[Upload.wikimedia.org|upload.wikimedia.org]]. The data is expected to be primarily used for caching research.
The most recent release is composed of caching data for:
# upload (image) web requests from one cache of [[Upload.wikimedia.org|upload.wikimedia.org]], [https://commons.wikimedia.org Wikimedia Commons], and
# text (HTML pageview) web requests from one cache of [https://www.wikipedia.org Wikipedia].


== Data Updates & Format ==
== Data Updates & Format ==
The data is updated manually and irregularly upon request. The request for the most recent release of this data can be found [[phab:T225538|here]].
The data is updated manually upon request. The request for the most recent release of this data can be found [[phab:T225538|here]].


The current release of this data, released in November 2019, is available [https://analytics.wikimedia.org/published/datasets/archive/public-datasets/analytics/caching/ here]. It was released as a set of  <code>gzip</code>-compressed <code>tsv</code> files. It includes a total 21 compressed files, each holding exactly 24 hours of consecutive data, about 2GB of data compressed or 7GB of data decompressed.
The current release of this data, released in November 2019, is available [https://analytics.wikimedia.org/published/datasets/archive/public-datasets/analytics/caching/ here]. It was released as a set of <code>gzip</code>-compressed <code>tsv</code> files. It includes a total of 42 compressed files, 21 upload data files and 21 text data files.  


Each decompressed file has the following columns:
=== Upload Data ===
 
Each upload data file, denoted <code>cache-u</code>, contains exactly 24 hours of consecutive data. These files are each roughly 1.5GB in size and hold roughly 4GB of decompressed data each.
 
Each decompressed upload data file has the following columns:
{| class="wikitable"
{| class="wikitable"
!Column Name
!Column Name
Line 25: Line 31:
|string
|string
|Image type from Content-Type header of response
|Image type from Content-Type header of response
|-
|response_size
|int
|Response size in bytes
|-
|time_firstbyte
|double
|Seconds to first byte
|}
=== Text Data ===
Each text data file, denoted <code>cache-t</code>, contains exactly 24 hours of consecutive data. These files are each roughly 100MB in size and hold roughly 300MB of decompressed data each.
Each decompressed upload data file has the following columns:
{| class="wikitable"
!Column Name
!Data Type
!Notes
|-
|relative_unix
|int
|Seconds since start timestamp of dataset
|-
|hashed_host_path_query
|bigint
|Salted hash of host, path, and query of request
|-
|-
|response_size
|response_size
Line 36: Line 69:


== Privacy ==
== Privacy ==
Since this data has many privacy concerns, this public release applies the following changes to make the data reveal less while providing value for a public audience:
Since this data has many privacy concerns, this public release applies several changes to make the data reveal less while providing value for a public audience.


=== Relative timestamps ===
=== Relative timestamps ===
This data employs a unique timing paradigm. The <code>relative_unix</code> field is equal to the seconds that have elapsed between the timestamp at which the web request occurred and a fixed randomly-selected start timestamp. This standard makes it significantly more difficult for bad actors to map this data to other publicly-released WMF data sets while maintaining the utility of timestamps for caching research as a means to determine the frequencies of these web requests.
This data employs a unique timing paradigm. The <code>relative_unix</code> field is equal to the seconds that have elapsed between the timestamp at which the web request occurred and a fixed randomly-selected start timestamp. This standard makes it significantly more difficult for bad actors to map this data to other publicly-released WMF data sets while maintaining the utility of timestamps for caching research as a means to determine the frequencies of these web requests.


=== Hashed path and query ===
=== Hashed host, path, and query ===
We securely hash the path and query fields of the web requests with a salt to effectively anonymize the content. This both makes it more difficult to map this data between data sets and minimizes the general privacy concerns related to the combination of content, timestamps, and content sizes.
We securely hash the host, path, and query fields of the web requests with a salt to effectively anonymize the content. This both makes it more difficult to map this data between data sets and minimizes the general privacy concerns related to the combination of content, timestamps, and content sizes.


== Previous Releases ==
== Previous Releases ==

Revision as of 05:17, 29 November 2019

This dataset is a restricted public snapshot of the wmf.webrequest table intended for caching research.

The most recent release is composed of caching data for:

  1. upload (image) web requests from one cache of upload.wikimedia.org, Wikimedia Commons, and
  2. text (HTML pageview) web requests from one cache of Wikipedia.

Data Updates & Format

The data is updated manually upon request. The request for the most recent release of this data can be found here.

The current release of this data, released in November 2019, is available here. It was released as a set of gzip-compressed tsv files. It includes a total of 42 compressed files, 21 upload data files and 21 text data files.

Upload Data

Each upload data file, denoted cache-u, contains exactly 24 hours of consecutive data. These files are each roughly 1.5GB in size and hold roughly 4GB of decompressed data each.

Each decompressed upload data file has the following columns:

Column Name Data Type Notes
relative_unix int Seconds since start timestamp of dataset
hashed_path_query bigint Salted hash of path and query of request
image_type string Image type from Content-Type header of response
response_size int Response size in bytes
time_firstbyte double Seconds to first byte

Text Data

Each text data file, denoted cache-t, contains exactly 24 hours of consecutive data. These files are each roughly 100MB in size and hold roughly 300MB of decompressed data each.

Each decompressed upload data file has the following columns:

Column Name Data Type Notes
relative_unix int Seconds since start timestamp of dataset
hashed_host_path_query bigint Salted hash of host, path, and query of request
response_size int Response size in bytes
time_firstbyte double Seconds to first byte

Privacy

Since this data has many privacy concerns, this public release applies several changes to make the data reveal less while providing value for a public audience.

Relative timestamps

This data employs a unique timing paradigm. The relative_unix field is equal to the seconds that have elapsed between the timestamp at which the web request occurred and a fixed randomly-selected start timestamp. This standard makes it significantly more difficult for bad actors to map this data to other publicly-released WMF data sets while maintaining the utility of timestamps for caching research as a means to determine the frequencies of these web requests.

Hashed host, path, and query

We securely hash the host, path, and query fields of the web requests with a salt to effectively anonymize the content. This both makes it more difficult to map this data between data sets and minimizes the general privacy concerns related to the combination of content, timestamps, and content sizes.

Previous Releases

2016

Release: analytics.wikimedia.org

Request: T128132

Description: A more detailed, but more privacy-conscious iteration of the 2007 release between July 1st, 2016 and July 10th, 2016

Fields: hashed host and path, uri query, content type, response size, time to first byte, X-Cache

2008

Release: wikibench.eu

Description: A trace of 10% of all user requests issued to Wikipedia in all languages between September 19th, 2007 and January 2nd, 2008

Fields: request count, unix timestamp, request url, database update flag