You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Schemas/Mediawiki history

From Wikitech-static
< Analytics‎ | Data Lake
Revision as of 12:45, 24 March 2017 by imported>Joal (Joal moved page Analytics/Data Lake/Mediawiki history to Analytics/Data Lake/Schemas/Mediawiki history: Organizing doc before first internal production release.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page describes the data set that stores the denormalized edit history of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via a Hive/Beeline external table. For more detail of the purpose of this data set, please read Analytics/Data_lake/Denormalization_and_historification. Also visit Analytics/Data access if you don't know how to access this data set.

Schema

col_name data_type comment
wiki_db string enwiki, dewiki, eswiktionary, etc.
event_entity string revision, user or page
event_type string create, move, delete, etc. Detailed explanation in the docs under #Event_types
event_timestamp string When this event ocurred, in YYYYMMDDHHmmss format
event_comment string Comment related to this event, sourced from log_comment, rev_comment, etc.
event_user_id bigint Id of the user that caused the event
event_user_text string Historical text of the user that caused the event
event_user_text_latest string Current text of the user that caused the event
event_user_blocks array<string> Historical blocks of the user that caused the event
event_user_blocks_latest array<string> Current blocks of the user that caused the event
event_user_groups array<string> Historical groups of the user that caused the event
event_user_groups_latest array<string> Current groups of the user that caused the event
event_user_is_created_by_self boolean Whether the event_user created their own account
event_user_is_created_by_system boolean Whether the event_user account was created by mediawiki (eg. centralauth)
event_user_is_created_by_peer boolean Whether the event_user account was created by another user
event_user_is_anonymous boolean Whether the event_user is not registered
event_user_is_bot_by_name boolean Whether the event_user's name matches patterns we use to identify bots
event_user_creation_timestamp string Registration timestamp of the user that caused the event
page_id bigint In revision/page events: id of the page
page_title string In revision/page events: historical title of the page
page_title_latest string In revision/page events: current title of the page
page_namespace int In revision/page events: historical namespace of the page.
page_namespace_is_content boolean In revision/page events: historical namespace of the page is categorized as content
page_namespace_latest int In revision/page events: current namespace of the page
page_namespace_is_content_latest boolean In revision/page events: current namespace of the page is categorized as content
page_is_redirect_latest boolean In revision/page events: whether the page is currently a redirect
page_creation_timestamp string In revision/page events: creation timestamp of the page
user_id bigint In user events: id of the user
user_text string In user events: historical user text
user_text_latest string In user events: current user text
user_blocks array<string> In user events: historical user blocks
user_blocks_latest array<string> In user events: current user blocks
user_groups array<string> In user events: historical user groups
user_groups_latest array<string> In user events: current user groups
user_is_created_by_self boolean In user events: whether the user created their own account
user_is_created_by_system boolean In user events: whether the user account was created by mediawiki
user_is_created_by_peer boolean In user events: whether the user account was created by another user
user_is_anonymous boolean In user events: whether the user is not registered
user_is_bot_by_name boolean In user events: whether the user's name matches patterns we use to identify bots
user_creation_timestamp string In user events: registration timestamp of the user.
revision_id bigint In revision events: id of the revision
revision_parent_id bigint In revision events: id of the parent revision
revision_minor_edit boolean In revision events: whether it is a minor edit or not
revision_text_bytes bigint In revision events: number of bytes of revision
revision_text_bytes_diff bigint In revision events: change in bytes relative to parent revision (can be negative).
revision_text_sha1 string In revision events: sha1 hash of the revision
revision_content_model string In revision events: content model of revision
revision_content_format string In revision events: content format of revision
revision_is_deleted boolean In revision events: whether this revision has been deleted (moved to archive table)
revision_deleted_timestamp string In revision events: the timestamp when the revision was deleted
revision_is_identity_reverted boolean In revision events: whether this revision was reverted by another future revision
revision_first_identity_reverting_revision_id bigint In revision events: id of the revision that reverted this revision
revision_first_identity_revert_timestamp string In revision events: timestamp of the revision that reverted this revision
revision_is_productive boolean In revision events: whether this revision was reverted within 1 day
revision_is_identity_revert boolean In revision events: whether this revision reverts other revisions

Event types

Entity Event type Meaning
revision create When a revision is created, when an edit happens.
page create When the first edit to a page is done.
move When moving a page, changing its title.
delete When deleting a page.
user create When a new user is registered.
rename When the name of a user is changed.
altergroups When the groups (rights) of a user are changed.
alterblocks When the blocks of a user are changed.

Changes and known problems

Date Schema version Details Phab

Task

2016/10/06 n/a The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.