You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Data Lake/Schemas/Mediawiki history: Difference between revisions
imported>Joal m (Joal moved page Analytics/Data Lake/Mediawiki history to Analytics/Data Lake/Schemas/Mediawiki history: Organizing doc before first internal production release.) |
imported>Joal (Update for first internal productionisation) |
||
Line 1: | Line 1: | ||
This page describes the data set that stores the '''denormalized edit history''' of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via | This page describes the data set that stores the '''denormalized edit history''' of WMF's wikis. It lives in [[Analytics/Cluster|Analytic's Hadoop cluster]] and is accessible via the Hive/Beeline external table <code>wmf.mediawiki_history</code>. For more detail of the purpose of this data set, please read [[Analytics/Data Lake/Denormalization and historification|Analytics/Data_lake/Denormalization_and_historification]]. Also visit [[Analytics/Data access]] if you don't know how to access this data set. | ||
=== Schema === | === Schema === | ||
<syntaxhighlight> | |||
col_name data_type comment | |||
wiki_db string enwiki, dewiki, eswiktionary, etc. | |||
event_entity string revision, user or page | |||
event_type string create, move, delete, etc. Detailed explanation in the docs under #Event_types | |||
event_timestamp string When this event ocurred, in YYYYMMDDHHmmss format | |||
event_comment string Comment related to this event, sourced from log_comment, rev_comment, etc. | |||
event_user_id bigint Id of the user that caused the event | |||
event_user_text string Historical text of the user that caused the event | |||
event_user_text_latest string Current text of the user that caused the event | |||
event_user_blocks array<string> Historical blocks of the user that caused the event | |||
event_user_blocks_latest array<string> Current blocks of the user that caused the event | |||
event_user_groups array<string> Historical groups of the user that caused the event | |||
event_user_groups_latest array<string> Current groups of the user that caused the event | |||
event_user_is_created_by_self boolean Whether the event_user created their own account | |||
event_user_is_created_by_system boolean Whether the event_user account was created by mediawiki (eg. centralauth) | |||
event_user_is_created_by_peer boolean Whether the event_user account was created by another user | |||
event_user_is_anonymous boolean Whether the event_user is not registered | |||
event_user_is_bot_by_name boolean Whether the event_user's name matches patterns we use to identify bots | |||
event_user_creation_timestamp string Registration timestamp of the user that caused the event | |||
page_id bigint In revision/page events: id of the page | |||
page_title string In revision/page events: historical title of the page | |||
page_title_latest string In revision/page events: current title of the page | |||
page_namespace int In revision/page events: historical namespace of the page. | |||
page_namespace_is_content boolean In revision/page events: historical namespace of the page is categorized as content | |||
page_namespace_latest int In revision/page events: current namespace of the page | |||
page_namespace_is_content_latest boolean In revision/page events: current namespace of the page is categorized as content | |||
page_is_redirect_latest boolean In revision/page events: whether the page is currently a redirect | |||
page_creation_timestamp string In revision/page events: creation timestamp of the page | |||
user_id bigint In user events: id of the user | |||
user_text string In user events: historical user text | |||
user_text_latest string In user events: current user text | |||
user_blocks array<string> In user events: historical user blocks | |||
user_blocks_latest array<string> In user events: current user blocks | |||
user_groups array<string> In user events: historical user groups | |||
user_groups_latest array<string> In user events: current user groups | |||
user_is_created_by_self boolean In user events: whether the user created their own account | |||
user_is_created_by_system boolean In user events: whether the user account was created by mediawiki | |||
user_is_created_by_peer boolean In user events: whether the user account was created by another user | |||
user_is_anonymous boolean In user events: whether the user is not registered | |||
user_is_bot_by_name boolean In user events: whether the user's name matches patterns we use to identify bots | |||
user_creation_timestamp string In user events: registration timestamp of the user. | |||
revision_id bigint In revision events: id of the revision | |||
revision_parent_id bigint In revision events: id of the parent revision | |||
revision_minor_edit boolean In revision events: whether it is a minor edit or not | |||
revision_text_bytes bigint In revision events: number of bytes of revision | |||
revision_text_bytes_diff bigint In revision events: change in bytes relative to parent revision (can be negative). | |||
revision_text_sha1 string In revision events: sha1 hash of the revision | |||
revision_content_model string In revision events: content model of revision | |||
revision_content_format string In revision events: content format of revision | |||
revision_is_deleted boolean In revision events: whether this revision has been deleted (moved to archive table) | |||
revision_deleted_timestamp string In revision events: the timestamp when the revision was deleted | |||
revision_is_identity_reverted boolean In revision events: whether this revision was reverted by another future revision | |||
revision_first_identity_reverting_revision_id bigint In revision events: id of the revision that reverted this revision | |||
revision_first_identity_revert_timestamp string In revision events: timestamp of the revision that reverted this revision | |||
revision_is_productive boolean In revision events: whether this revision was reverted within 1 day | |||
revision_is_identity_revert boolean In revision events: whether this revision reverts other revisions | |||
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) | |||
# Partition Information | |||
# col_name data_type comment | |||
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) | |||
</syntaxhighlight>Note the <code>snapshot</code> field: It is a [https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-DataUnits Hive partitions]. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the <code>where</code> clause of your queries. | |||
=== Important Fields === | |||
Due to the denormalization of the history data, filtering by <code>event_entity</code> is mandatory not to mix incompatible data. | |||
Similarly, <code>event_types</code> filtering can/must be used depending of the analysis. | |||
{| class="wikitable" | {| class="wikitable" | ||
!Entity | !Entity | ||
Line 273: | Line 117: | ||
|n/a | |n/a | ||
|The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis. | |The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis. | ||
| | |||
|- | |||
|2017/03/01 | |||
|n/a | |||
|Add the <code>snapshot</code> partition, allowing to keep multiple versions of the history. Data starts to flow regularly (every month) from labs. | |||
| | | | ||
|} | |} |
Revision as of 14:32, 28 March 2017
This page describes the data set that stores the denormalized edit history of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline external table wmf.mediawiki_history
. For more detail of the purpose of this data set, please read Analytics/Data_lake/Denormalization_and_historification. Also visit Analytics/Data access if you don't know how to access this data set.
Schema
col_name data_type comment
wiki_db string enwiki, dewiki, eswiktionary, etc.
event_entity string revision, user or page
event_type string create, move, delete, etc. Detailed explanation in the docs under #Event_types
event_timestamp string When this event ocurred, in YYYYMMDDHHmmss format
event_comment string Comment related to this event, sourced from log_comment, rev_comment, etc.
event_user_id bigint Id of the user that caused the event
event_user_text string Historical text of the user that caused the event
event_user_text_latest string Current text of the user that caused the event
event_user_blocks array<string> Historical blocks of the user that caused the event
event_user_blocks_latest array<string> Current blocks of the user that caused the event
event_user_groups array<string> Historical groups of the user that caused the event
event_user_groups_latest array<string> Current groups of the user that caused the event
event_user_is_created_by_self boolean Whether the event_user created their own account
event_user_is_created_by_system boolean Whether the event_user account was created by mediawiki (eg. centralauth)
event_user_is_created_by_peer boolean Whether the event_user account was created by another user
event_user_is_anonymous boolean Whether the event_user is not registered
event_user_is_bot_by_name boolean Whether the event_user's name matches patterns we use to identify bots
event_user_creation_timestamp string Registration timestamp of the user that caused the event
page_id bigint In revision/page events: id of the page
page_title string In revision/page events: historical title of the page
page_title_latest string In revision/page events: current title of the page
page_namespace int In revision/page events: historical namespace of the page.
page_namespace_is_content boolean In revision/page events: historical namespace of the page is categorized as content
page_namespace_latest int In revision/page events: current namespace of the page
page_namespace_is_content_latest boolean In revision/page events: current namespace of the page is categorized as content
page_is_redirect_latest boolean In revision/page events: whether the page is currently a redirect
page_creation_timestamp string In revision/page events: creation timestamp of the page
user_id bigint In user events: id of the user
user_text string In user events: historical user text
user_text_latest string In user events: current user text
user_blocks array<string> In user events: historical user blocks
user_blocks_latest array<string> In user events: current user blocks
user_groups array<string> In user events: historical user groups
user_groups_latest array<string> In user events: current user groups
user_is_created_by_self boolean In user events: whether the user created their own account
user_is_created_by_system boolean In user events: whether the user account was created by mediawiki
user_is_created_by_peer boolean In user events: whether the user account was created by another user
user_is_anonymous boolean In user events: whether the user is not registered
user_is_bot_by_name boolean In user events: whether the user's name matches patterns we use to identify bots
user_creation_timestamp string In user events: registration timestamp of the user.
revision_id bigint In revision events: id of the revision
revision_parent_id bigint In revision events: id of the parent revision
revision_minor_edit boolean In revision events: whether it is a minor edit or not
revision_text_bytes bigint In revision events: number of bytes of revision
revision_text_bytes_diff bigint In revision events: change in bytes relative to parent revision (can be negative).
revision_text_sha1 string In revision events: sha1 hash of the revision
revision_content_model string In revision events: content model of revision
revision_content_format string In revision events: content format of revision
revision_is_deleted boolean In revision events: whether this revision has been deleted (moved to archive table)
revision_deleted_timestamp string In revision events: the timestamp when the revision was deleted
revision_is_identity_reverted boolean In revision events: whether this revision was reverted by another future revision
revision_first_identity_reverting_revision_id bigint In revision events: id of the revision that reverted this revision
revision_first_identity_revert_timestamp string In revision events: timestamp of the revision that reverted this revision
revision_is_productive boolean In revision events: whether this revision was reverted within 1 day
revision_is_identity_revert boolean In revision events: whether this revision reverts other revisions
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
# Partition Information
# col_name data_type comment
snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
Note the snapshot
field: It is a Hive partitions. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the where
clause of your queries.
Important Fields
Due to the denormalization of the history data, filtering by event_entity
is mandatory not to mix incompatible data.
Similarly, event_types
filtering can/must be used depending of the analysis.
Entity | Event type | Meaning |
---|---|---|
revision | create | When a revision is created, when an edit happens. |
page | create | When the first edit to a page is done. |
move | When moving a page, changing its title. | |
delete | When deleting a page. | |
user | create | When a new user is registered. |
rename | When the name of a user is changed. | |
altergroups | When the groups (rights) of a user are changed. | |
alterblocks | When the blocks of a user are changed. |
Changes and known problems
Date | Schema version | Details | Phab
Task |
---|---|---|---|
2016/10/06 | n/a | The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis. | |
2017/03/01 | n/a | Add the snapshot partition, allowing to keep multiple versions of the history. Data starts to flow regularly (every month) from labs.
|