You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Schemas/Mediawiki history

From Wikitech-static
< Analytics‎ | Data Lake
Revision as of 14:32, 28 March 2017 by imported>Joal (Update for first internal productionisation)
Jump to navigation Jump to search

This page describes the data set that stores the denormalized edit history of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline external table wmf.mediawiki_history. For more detail of the purpose of this data set, please read Analytics/Data_lake/Denormalization_and_historification. Also visit Analytics/Data access if you don't know how to access this data set.

Schema

col_name	data_type	comment
wiki_db             	string              	enwiki, dewiki, eswiktionary, etc.
event_entity        	string              	revision, user or page
event_type          	string              	create, move, delete, etc.  Detailed explanation in the docs under #Event_types
event_timestamp     	string              	When this event ocurred, in YYYYMMDDHHmmss format
event_comment       	string              	Comment related to this event, sourced from log_comment, rev_comment, etc.
event_user_id       	bigint              	Id of the user that caused the event
event_user_text     	string              	Historical text of the user that caused the event
event_user_text_latest	string              	Current text of the user that caused the event
event_user_blocks   	array<string>       	Historical blocks of the user that caused the event
event_user_blocks_latest	array<string>       	Current blocks of the user that caused the event
event_user_groups   	array<string>       	Historical groups of the user that caused the event
event_user_groups_latest	array<string>       	Current groups of the user that caused the event
event_user_is_created_by_self	boolean             	Whether the event_user created their own account
event_user_is_created_by_system	boolean             	Whether the event_user account was created by mediawiki (eg. centralauth)
event_user_is_created_by_peer	boolean             	Whether the event_user account was created by another user
event_user_is_anonymous	boolean             	Whether the event_user is not registered
event_user_is_bot_by_name	boolean             	Whether the event_user's name matches patterns we use to identify bots
event_user_creation_timestamp	string              	Registration timestamp of the user that caused the event
page_id             	bigint              	In revision/page events: id of the page
page_title          	string              	In revision/page events: historical title of the page
page_title_latest   	string              	In revision/page events: current title of the page
page_namespace      	int                 	In revision/page events: historical namespace of the page.
page_namespace_is_content	boolean             	In revision/page events: historical namespace of the page is categorized as content
page_namespace_latest	int                 	In revision/page events: current namespace of the page
page_namespace_is_content_latest	boolean             	In revision/page events: current namespace of the page is categorized as content
page_is_redirect_latest	boolean             	In revision/page events: whether the page is currently a redirect
page_creation_timestamp	string              	In revision/page events: creation timestamp of the page
user_id             	bigint              	In user events: id of the user
user_text           	string              	In user events: historical user text
user_text_latest    	string              	In user events: current user text
user_blocks         	array<string>       	In user events: historical user blocks
user_blocks_latest  	array<string>       	In user events: current user blocks
user_groups         	array<string>       	In user events: historical user groups
user_groups_latest  	array<string>       	In user events: current user groups
user_is_created_by_self	boolean             	In user events: whether the user created their own account
user_is_created_by_system	boolean             	In user events: whether the user account was created by mediawiki
user_is_created_by_peer	boolean             	In user events: whether the user account was created by another user
user_is_anonymous   	boolean             	In user events: whether the user is not registered
user_is_bot_by_name 	boolean             	In user events: whether the user's name matches patterns we use to identify bots
user_creation_timestamp	string              	In user events: registration timestamp of the user.
revision_id         	bigint              	In revision events: id of the revision
revision_parent_id  	bigint              	In revision events: id of the parent revision
revision_minor_edit 	boolean             	In revision events: whether it is a minor edit or not
revision_text_bytes 	bigint              	In revision events: number of bytes of revision
revision_text_bytes_diff	bigint              	In revision events: change in bytes relative to parent revision (can be negative).
revision_text_sha1  	string              	In revision events: sha1 hash of the revision
revision_content_model	string              	In revision events: content model of revision
revision_content_format	string              	In revision events: content format of revision
revision_is_deleted 	boolean             	In revision events: whether this revision has been deleted (moved to archive table)
revision_deleted_timestamp	string              	In revision events: the timestamp when the revision was deleted
revision_is_identity_reverted	boolean             	In revision events: whether this revision was reverted by another future revision
revision_first_identity_reverting_revision_id	bigint              	In revision events: id of the revision that reverted this revision
revision_first_identity_revert_timestamp	string              	In revision events: timestamp of the revision that reverted this revision
revision_is_productive	boolean             	In revision events: whether this revision was reverted within 1 day
revision_is_identity_revert	boolean             	In revision events: whether this revision reverts other revisions
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)

Note the snapshot field: It is a Hive partitions. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the where clause of your queries.

Important Fields

Due to the denormalization of the history data, filtering by event_entity is mandatory not to mix incompatible data.

Similarly, event_types filtering can/must be used depending of the analysis.

Entity Event type Meaning
revision create When a revision is created, when an edit happens.
page create When the first edit to a page is done.
move When moving a page, changing its title.
delete When deleting a page.
user create When a new user is registered.
rename When the name of a user is changed.
altergroups When the groups (rights) of a user are changed.
alterblocks When the blocks of a user are changed.

Changes and known problems

Date Schema version Details Phab

Task

2016/10/06 n/a The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.
2017/03/01 n/a Add the snapshot partition, allowing to keep multiple versions of the history. Data starts to flow regularly (every month) from labs.