You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Data Lake/Edits/Mediawiki page history
This page describes the data set that stores the page history of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline external table
wmf.mediawiki_page_history. For more detail of the purpose of this data set, please read Analytics/Data Lake/Page and user history reconstruction. Also visit Analytics/Data access if you don't know how to access this data set.
col_name data_type comment wiki_db string enwiki, dewiki, eswiktionary, etc. page_id bigint Id of the page, as in the page table. page_id_artificial string Generated Id for deleted pages without real Id. page_creation_timestamp string Timestamp of the page's first revision. page_title string Historical page title. page_title_latest string Page title as of today. page_namespace int Historical namespace. page_namespace_is_content boolean Whether the historical namespace is categorized as content page_namespace_latest int Namespace as of today. page_namespace_is_content_latest boolean Whether the current namespace is categorized as content page_is_redirect_latest boolean In revision/page events: whether the page is currently a redirect start_timestamp string Timestamp from where this state applies (inclusive). end_timestamp string Timestamp to where this state applies (exclusive). caused_by_event_type string Event that caused this state (create, move, delete or restore). caused_by_user_id bigint ID from the user that caused this state. inferred_from string If non-NULL, some fields have been inferred from an inconsistency in the source data. snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) # Partition Information # col_name data_type comment snapshot string Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
snapshot field: It is a Hive partitions. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the
where clause of your queries.
Changes and known problems
|2016/10/06||n/a||The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.|