You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Page history

From Wikitech-static
< Analytics‎ | Data Lake
Revision as of 13:37, 6 October 2016 by imported>Mforns (initial version)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page describes the data set that stores the page history of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via a Hive/Beeline external table. For more detail of the purpose of this data set, please read Analytics/Data Lake/Page and user history reconstruction. Also visit Analytics/Data access if you don't know how to access this data set.

Schema

`start_timestamp`           string    // Timestamp from where this state applies (inclusive).
`end_timestamp`             string    // Timestamp to where this state applies (exclusive).
`wiki_db`                   string    // enwiki, dewiki, eswiktionary, etc.
`page_id`                   bigint    // ID of the page, as in the page table.
`page_id_artificial`        string    // Generated ID for deleted pages without real ID.
`page_creation_timestamp`   string    // Timestamp of the page's first revision.
`page_title`                string    // Historical page title.
`page_title_latest`         string    // Page title as of today.
`page_namespace`            int       // Historical namespace.
`page_namespace_latest`     int       // Namespace as of today.
`caused_by_event_type`      string    // Event that caused this state (create, move, delete or restore).
`caused_by_user_id`         bigint    // ID from the user that caused this state.
`inferred_from`             string    // If non-NULL, indicates that some of this state's fields have been inferred
                                      // after an inconsistency in the source data.

Changes and known problems

Date Schema version Details Phab

Task

2016/10/06 n/a The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.