You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Data Lake/Edits/Mediawiki history reduced
This page describe the mediawiki history reduced dataset. It lives in Analytic's Hadoop cluster and in druid public cluster, and is accessible via the Hive/Beeline external table wmf.mediawiki_history_reduced
. It is a transformation of the mediawiki history dataset making it smaller and reshaping it to allow for druid fast-querying for AQS Wikistats 2 queries (see Analytics/Systems/Cluster/Mediawiki history reduced algorithm). As its parent, this dataset is updated every month, with a new snapshot=YYYY-MM
partition added to hive, and a new datasource mediawiki_history_reduced_YYYY_MM
added to druid. It is important to notice that snapshots are NOT incremental, so when querying the table you should always specify one. Also visit Analytics/Data access if you don't know how to access this data set.
Schema
You can get the canonical version of the schema by running describe wmf.mediawiki_history_reduced;
from the beeline command line.
Note that the snapshot
field is a Hive partition. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the latest snapshot date, you should always pick a single snapshot in the where
clause of your query.
col_name | data_type | comment |
---|---|---|
project | string | The project this event belongs to (en.wikipedia or wikidata for instance) |
event_entity | string | revision, user or page |
event_type | string | create, move, delete, etc with specific digest types. Detailed explanation in the docs under #Event_types |
event_timestamp | string | When this event ocurred |
user_id | string | The user_id if the user is registered, user_text otherwise (IP) of the user performing the event |
user_type | string | anonymous, group_bot, name_bot or user |
page_id | bigint | The page_id of the event |
page_namespace | int | The page namespace of the event |
page_type | string | content or non_content based on namespace being in content space or not |
other_tags | array<string> | Can contain: deleted (and deleted_day, deleted_month, deleted_year if deleted within the given time period), revetered and revert (for revisions), self_created (for users), user_first_24_hours if a revision is made during the first 24 hours of a user registration, redirect (for pages) |
text_bytes_diff | bigint | The text-bytes difference of the event (or sum in case of digests) |
text_bytes_diff_abs | bigint | The absolute value of text-bytes difference for the event (or sum in case of digests) |
revisions | bigint | 1 if the event is entity revision, or sum of revisions in case of digests |
snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) |
Important Fields
Due to the denormalization of the history data, filtering by event_entity
is mandatory not to mix incompatible data.
Similarly, event_types
filtering can/must be used depending of the analysis.
Entity | Event type | Meaning |
---|---|---|
revision | create | When a revision is created, when an edit happens. |
page | create | When the first edit to a page is done. |
move | When moving a page, changing its title. | |
delete | When deleting a page (no occurrence as of now due to a bug in history reconstruction) | |
daily_digest | Daily pre-computation (with dimension explosion) facilitating by-page activity level filtering | |
monthly_digest | Monthly pre-computation (with dimension explosion) facilitating by-page activity level filtering | |
user | create | When a new user is registered. |
rename | When the name of a user is changed. | |
altergroups | When the groups (rights) of a user are changed. | |
alterblocks | When the blocks of a user are changed. | |
daily_digest | Daily pre-computation (with dimension explosion) facilitating by-user activity level filtering | |
monthly_digest | Monthly pre-computation (with dimension explosion) facilitating by-user activity level filtering |
Changes and known problems
Date | Phab
Task |
Snapshot version | Details |
---|---|---|---|
To come | task T200270 | 2018-06 | Update page_namespace field to be an int - Previous snapshots updated |
2018-06-21 | task T192483 | 2018-06 | Make table use parquet storage instead of json (made possible thanks to druid-parquet extension) - previous snapshots backfilled |
2018-04-01 | task T192482 | 2018-04 | Make the table permanent and available in Hive (was temporary before) |