You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Schemas/Mediawiki history: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Joal
(Update for first internal productionisation)
imported>Milimetric
Line 1: Line 1:
This page describes the data set that stores the '''denormalized edit history''' of WMF's wikis. It lives in [[Analytics/Cluster|Analytic's Hadoop cluster]] and is accessible via the Hive/Beeline external table <code>wmf.mediawiki_history</code>. For more detail of the purpose of this data set, please read [[Analytics/Data Lake/Denormalization and historification|Analytics/Data_lake/Denormalization_and_historification]]. Also visit [[Analytics/Data access]] if you don't know how to access this data set.
#REDIRECT [[Analytics/Data Lake/Edits/Mediawiki history]]
 
=== Schema ===
<syntaxhighlight>
 
col_name data_type comment
wiki_db            string              enwiki, dewiki, eswiktionary, etc.
event_entity        string              revision, user or page
event_type          string              create, move, delete, etc.  Detailed explanation in the docs under #Event_types
event_timestamp    string              When this event ocurred, in YYYYMMDDHHmmss format
event_comment      string              Comment related to this event, sourced from log_comment, rev_comment, etc.
event_user_id      bigint              Id of the user that caused the event
event_user_text    string              Historical text of the user that caused the event
event_user_text_latest string              Current text of the user that caused the event
event_user_blocks  array<string>      Historical blocks of the user that caused the event
event_user_blocks_latest array<string>      Current blocks of the user that caused the event
event_user_groups  array<string>      Historical groups of the user that caused the event
event_user_groups_latest array<string>      Current groups of the user that caused the event
event_user_is_created_by_self boolean            Whether the event_user created their own account
event_user_is_created_by_system boolean            Whether the event_user account was created by mediawiki (eg. centralauth)
event_user_is_created_by_peer boolean            Whether the event_user account was created by another user
event_user_is_anonymous boolean            Whether the event_user is not registered
event_user_is_bot_by_name boolean            Whether the event_user's name matches patterns we use to identify bots
event_user_creation_timestamp string              Registration timestamp of the user that caused the event
page_id            bigint              In revision/page events: id of the page
page_title          string              In revision/page events: historical title of the page
page_title_latest  string              In revision/page events: current title of the page
page_namespace      int                In revision/page events: historical namespace of the page.
page_namespace_is_content boolean            In revision/page events: historical namespace of the page is categorized as content
page_namespace_latest int                In revision/page events: current namespace of the page
page_namespace_is_content_latest boolean            In revision/page events: current namespace of the page is categorized as content
page_is_redirect_latest boolean            In revision/page events: whether the page is currently a redirect
page_creation_timestamp string              In revision/page events: creation timestamp of the page
user_id            bigint              In user events: id of the user
user_text          string              In user events: historical user text
user_text_latest    string              In user events: current user text
user_blocks        array<string>      In user events: historical user blocks
user_blocks_latest  array<string>      In user events: current user blocks
user_groups        array<string>      In user events: historical user groups
user_groups_latest  array<string>      In user events: current user groups
user_is_created_by_self boolean            In user events: whether the user created their own account
user_is_created_by_system boolean            In user events: whether the user account was created by mediawiki
user_is_created_by_peer boolean            In user events: whether the user account was created by another user
user_is_anonymous  boolean            In user events: whether the user is not registered
user_is_bot_by_name boolean            In user events: whether the user's name matches patterns we use to identify bots
user_creation_timestamp string              In user events: registration timestamp of the user.
revision_id        bigint              In revision events: id of the revision
revision_parent_id  bigint              In revision events: id of the parent revision
revision_minor_edit boolean            In revision events: whether it is a minor edit or not
revision_text_bytes bigint              In revision events: number of bytes of revision
revision_text_bytes_diff bigint              In revision events: change in bytes relative to parent revision (can be negative).
revision_text_sha1  string              In revision events: sha1 hash of the revision
revision_content_model string              In revision events: content model of revision
revision_content_format string              In revision events: content format of revision
revision_is_deleted boolean            In revision events: whether this revision has been deleted (moved to archive table)
revision_deleted_timestamp string              In revision events: the timestamp when the revision was deleted
revision_is_identity_reverted boolean            In revision events: whether this revision was reverted by another future revision
revision_first_identity_reverting_revision_id bigint              In revision events: id of the revision that reverted this revision
revision_first_identity_revert_timestamp string              In revision events: timestamp of the revision that reverted this revision
revision_is_productive boolean            In revision events: whether this revision was reverted within 1 day
revision_is_identity_revert boolean            In revision events: whether this revision reverts other revisions
snapshot            string              Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
# Partition Information
# col_name            data_type          comment           
snapshot            string              Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
 
</syntaxhighlight>Note the <code>snapshot</code> field: It is a [https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-DataUnits Hive partitions]. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the <code>where</code> clause of your queries.
 
=== Important Fields ===
Due to the denormalization of the history data, filtering by <code>event_entity</code> is mandatory not to mix incompatible data.
 
Similarly, <code>event_types</code> filtering can/must be used depending of the analysis.
{| class="wikitable"
!Entity
!Event type
!Meaning
|-
|revision
|create
|When a revision is created, when an edit happens.
|-
| rowspan="3" |page
|create
|When the first edit to a page is done.
|-
|move
|When moving a page, changing its title.
|-
|delete
|When deleting a page.
|-
| rowspan="4" |user
|create
|When a new user is registered.
|-
|rename
|When the name of a user is changed.
|-
|altergroups
|When the groups (rights) of a user are changed.
|-
|alterblocks
|When the blocks of a user are changed.
|}
 
===Changes and known problems===
{| class="wikitable"
!Date
!Schema version
!Details
!Phab
Task
|-
|2016/10/06
|n/a
|The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.
|
|-
|2017/03/01
|n/a
|Add the <code>snapshot</code> partition, allowing to keep multiple versions of the history. Data starts to flow regularly (every month) from labs.
|
|}

Revision as of 14:09, 7 April 2017