You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Edits/Mediawiki page history

From Wikitech-static
< Analytics‎ | Data Lake‎ | Edits
Revision as of 14:09, 7 April 2017 by imported>Milimetric (Milimetric moved page Analytics/Data Lake/Schemas/Mediawiki page history to Analytics/Data Lake/Edits/Mediawiki page history: Reorganizing documentation)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page describes the data set that stores the page history of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline external table wmf.mediawiki_page_history. For more detail of the purpose of this data set, please read Analytics/Data Lake/Page and user history reconstruction. Also visit Analytics/Data access if you don't know how to access this data set.

Schema

col_name	data_type	comment
wiki_db             	string              	enwiki, dewiki, eswiktionary, etc.
page_id             	bigint              	Id of the page, as in the page table.
page_id_artificial  	string              	Generated Id for deleted pages without real Id.
page_creation_timestamp	string              	Timestamp of the page's first revision.
page_title          	string              	Historical page title.
page_title_latest   	string              	Page title as of today.
page_namespace      	int                 	Historical namespace.
page_namespace_is_content	boolean             	Whether the historical namespace is categorized as content
page_namespace_latest	int                 	Namespace as of today.
page_namespace_is_content_latest	boolean             	Whether the current namespace is categorized as content
page_is_redirect_latest	boolean             	In revision/page events: whether the page is currently a redirect
start_timestamp     	string              	Timestamp from where this state applies (inclusive).
end_timestamp       	string              	Timestamp to where this state applies (exclusive).
caused_by_event_type	string              	Event that caused this state (create, move, delete or restore).
caused_by_user_id   	bigint              	ID from the user that caused this state.
inferred_from       	string              	If non-NULL, some fields have been inferred from an inconsistency in the source data.
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)

Note the snapshot field: It is a Hive partitions. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the where clause of your queries.

Changes and known problems

Date Schema version Details Phab

Task

2016/10/06 n/a The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.
2017/03/01 n/a Add the snapshot partition, allowing to keep multiple versions of the page history. Data starts to flow regularly (every month) from labs.