You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Data Lake/Content/Mediawiki wikitext history
This page describes the dataset on HDFS and Hive that stores the full-historical-revision wikitext history of WMF's wikis, as provided through monthly XML Dumps. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline/spark external table wmf.mediawiki_wikitext_history
. A new monthly snapshot is being produced around the 20th of each month (last xml-dumps is made available the 16th); to check whether it is ready to be queried, one can look for the status of the mediawiki-wikitext-history-coord Oozie job. Also visit Analytics/Data access if you don't know how to access this data set.
Since 2019-10 snapshot, underlying data is stored in avro
instead of parquet
file format. This almost doesn't change data size nor processing time, and prevents memory errors due to vectorized columnar reading in parquet. Data is stored on HDFS at path following pattern: hdfs:///wmf/data/wmf/mediawiki/wikitext/history/snapshot=YYYY-MM/wiki_db=WIKI_DB
Schema
You can get the canonical version of the schema by running describe wmf.mediawiki_wikitext_history
from the hive/beeline/spark command line.
Note: The snapshot
and wiki_db
fields are Hive partitions. They explicitly map to snapshot folders in HDFS. Since the full data is present in every snapshot up to the latest snapshot date, you should always pick a single snapshot in the where
clause of your query.
col_name | data_type | comment |
---|---|---|
page_id | bigint | id of the page |
page_namespace | int | namespace of the page |
page_title | string | title of the page |
page_redirect_title | string | title of the redirected-to page |
page_restrictions | array<string> | restrictions of the page |
user_id | bigint | id of the user that made the revision (or null/0 if anonymous) |
user_text | string | text of the user that made the revision (either username or IP) |
revision_id | bigint | id of the revision |
revision_parent_id | bigint | id of the parent revision |
revision_timestamp | string | timestamp of the revision (ISO8601 format) |
revision_minor_edit | boolean | whether this revision is a minor edit or not |
revision_comment | string | Comment made with revision |
revision_text_bytes | bigint | bytes number of the revision text |
revision_text_sha1 | string | sha1 hash of the revision text |
revision_text | string | text of the revision |
revision_content_model | string | content model of the revision |
revision_content_format | string | content format of the revision |
snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular imports) |
wiki_db | string | The wiki_db project |
Changes and known problems
Date | Phab
Task |
Snapshot version | Details |
---|---|---|---|
2019-11-01 | task T236687 | 2019-10 | Change underlying file format from parquet to avro to prevent memory issues at read time.
|
2018-09-01 | task T202490 | 2018-09 | Creation of the table. Data starts to flow regularly (every month). |
XMLDumps Row Data
The mediawiki_wikitext_history dataset is computed from the pages_meta_history XML dumps. Those are imported every month onto HDFS and stored in folders following this pattern: hdfs:///wmf/data/raw/mediawiki/dumps/pages_meta_history/YYYYMMDD/WIKI_DB
Note: There is one month difference between the snapshot value of the avro-converted data and the raw data. This is because by convention in Hive we use the date for currently available data (for instance 2019-11 means that November 2019 data is present), while dumps generation date is the date of generation (20191201 means data generation has started on 2019-12-01, therefore having 2019-11 data but not 2019-12).