You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Schemas/Mediawiki history: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Joal
m (Joal moved page Analytics/Data Lake/Mediawiki history to Analytics/Data Lake/Schemas/Mediawiki history: Organizing doc before first internal production release.)
 
imported>Joal
(Update for first internal productionisation)
Line 1: Line 1:
This page describes the data set that stores the '''denormalized edit history''' of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via a Hive/Beeline external table. For more detail of the purpose of this data set, please read [[Analytics/Data Lake/Denormalization and historification|Analytics/Data_lake/Denormalization_and_historification]]. Also visit [[Analytics/Data access]] if you don't know how to access this data set.
This page describes the data set that stores the '''denormalized edit history''' of WMF's wikis. It lives in [[Analytics/Cluster|Analytic's Hadoop cluster]] and is accessible via the Hive/Beeline external table <code>wmf.mediawiki_history</code>. For more detail of the purpose of this data set, please read [[Analytics/Data Lake/Denormalization and historification|Analytics/Data_lake/Denormalization_and_historification]]. Also visit [[Analytics/Data access]] if you don't know how to access this data set.


=== Schema ===
=== Schema ===
{| class="wikitable sortable"
<syntaxhighlight>
!col_name
 
!data_type
col_name data_type comment
!comment
wiki_db             string             enwiki, dewiki, eswiktionary, etc.
|-
event_entity       string             revision, user or page
|wiki_db
event_type         string             create, move, delete, etc. Detailed explanation in the docs under #Event_types
|string
event_timestamp     string             When this event ocurred, in YYYYMMDDHHmmss format
|enwiki, dewiki, eswiktionary, etc.
event_comment       string             Comment related to this event, sourced from log_comment, rev_comment, etc.
|-
event_user_id       bigint             Id of the user that caused the event
|event_entity
event_user_text     string             Historical text of the user that caused the event
|string
event_user_text_latest string             Current text of the user that caused the event
|revision, user or page
event_user_blocks   array<string>       Historical blocks of the user that caused the event
|-
event_user_blocks_latest array<string>       Current blocks of the user that caused the event
|event_type
event_user_groups   array<string>       Historical groups of the user that caused the event
|string
event_user_groups_latest array<string>       Current groups of the user that caused the event
|create, move, delete, etc. Detailed explanation in the docs under #Event_types
event_user_is_created_by_self boolean             Whether the event_user created their own account
|-
event_user_is_created_by_system boolean             Whether the event_user account was created by mediawiki (eg. centralauth)
|event_timestamp
event_user_is_created_by_peer boolean             Whether the event_user account was created by another user
|string
event_user_is_anonymous boolean             Whether the event_user is not registered
|When this event ocurred, in YYYYMMDDHHmmss format
event_user_is_bot_by_name boolean             Whether the event_user's name matches patterns we use to identify bots
|-
event_user_creation_timestamp string             Registration timestamp of the user that caused the event
|event_comment
page_id             bigint             In revision/page events: id of the page
|string
page_title         string             In revision/page events: historical title of the page
|Comment related to this event, sourced from log_comment, rev_comment, etc.
page_title_latest   string             In revision/page events: current title of the page
|-
page_namespace     int                 In revision/page events: historical namespace of the page.
|event_user_id
page_namespace_is_content boolean             In revision/page events: historical namespace of the page is categorized as content
|bigint
page_namespace_latest int                 In revision/page events: current namespace of the page
|Id of the user that caused the event
page_namespace_is_content_latest boolean             In revision/page events: current namespace of the page is categorized as content
|-
page_is_redirect_latest boolean             In revision/page events: whether the page is currently a redirect
|event_user_text
page_creation_timestamp string             In revision/page events: creation timestamp of the page
|string
user_id             bigint             In user events: id of the user
|Historical text of the user that caused the event
user_text           string             In user events: historical user text
|-
user_text_latest   string             In user events: current user text
|event_user_text_latest
user_blocks         array<string>       In user events: historical user blocks
|string
user_blocks_latest array<string>       In user events: current user blocks
|Current text of the user that caused the event
user_groups         array<string>       In user events: historical user groups
|-
user_groups_latest array<string>       In user events: current user groups
|event_user_blocks
user_is_created_by_self boolean             In user events: whether the user created their own account
|array<string>
user_is_created_by_system boolean             In user events: whether the user account was created by mediawiki
|Historical blocks of the user that caused the event
user_is_created_by_peer boolean             In user events: whether the user account was created by another user
|-
user_is_anonymous   boolean             In user events: whether the user is not registered
|event_user_blocks_latest
user_is_bot_by_name boolean             In user events: whether the user's name matches patterns we use to identify bots
|array<string>
user_creation_timestamp string             In user events: registration timestamp of the user.
|Current blocks of the user that caused the event
revision_id         bigint             In revision events: id of the revision
|-
revision_parent_id bigint             In revision events: id of the parent revision
|event_user_groups
revision_minor_edit boolean             In revision events: whether it is a minor edit or not
|array<string>
revision_text_bytes bigint             In revision events: number of bytes of revision
|Historical groups of the user that caused the event
revision_text_bytes_diff bigint             In revision events: change in bytes relative to parent revision (can be negative).
|-
revision_text_sha1 string             In revision events: sha1 hash of the revision
|event_user_groups_latest
revision_content_model string             In revision events: content model of revision
|array<string>
revision_content_format string             In revision events: content format of revision
|Current groups of the user that caused the event
revision_is_deleted boolean             In revision events: whether this revision has been deleted (moved to archive table)
|-
revision_deleted_timestamp string             In revision events: the timestamp when the revision was deleted
|event_user_is_created_by_self
revision_is_identity_reverted boolean             In revision events: whether this revision was reverted by another future revision
|boolean
revision_first_identity_reverting_revision_id bigint             In revision events: id of the revision that reverted this revision
|Whether the event_user created their own account
revision_first_identity_revert_timestamp string             In revision events: timestamp of the revision that reverted this revision
|-
revision_is_productive boolean             In revision events: whether this revision was reverted within 1 day
|event_user_is_created_by_system
revision_is_identity_revert boolean             In revision events: whether this revision reverts other revisions
|boolean
snapshot            string              Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
|Whether the event_user account was created by mediawiki (eg. centralauth)
|-
# Partition Information
|event_user_is_created_by_peer
# col_name            data_type          comment           
|boolean
|Whether the event_user account was created by another user
snapshot            string              Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
|-
 
|event_user_is_anonymous
</syntaxhighlight>Note the <code>snapshot</code> field: It is a [https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-DataUnits Hive partitions]. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the <code>where</code> clause of your queries.
|boolean
 
|Whether the event_user is not registered
=== Important Fields ===
|-
Due to the denormalization of the history data, filtering by <code>event_entity</code> is mandatory not to mix incompatible data.
|event_user_is_bot_by_name
|boolean
|Whether the event_user's name matches patterns we use to identify bots
|-
|event_user_creation_timestamp
|string
|Registration timestamp of the user that caused the event
|-
|page_id
|bigint
|In revision/page events: id of the page
|-
|page_title
|string
|In revision/page events: historical title of the page
|-
|page_title_latest
|string
|In revision/page events: current title of the page
|-
|page_namespace
|int
|In revision/page events: historical namespace of the page.
|-
|page_namespace_is_content
|boolean
|In revision/page events: historical namespace of the page is categorized as content
|-
|page_namespace_latest
|int
|In revision/page events: current namespace of the page
|-
|page_namespace_is_content_latest
|boolean
|In revision/page events: current namespace of the page is categorized as content
|-
|page_is_redirect_latest
|boolean
|In revision/page events: whether the page is currently a redirect
|-
|page_creation_timestamp
|string
|In revision/page events: creation timestamp of the page
|-
|user_id
|bigint
|In user events: id of the user
|-
|user_text
|string
|In user events: historical user text
|-
|user_text_latest
|string
|In user events: current user text
|-
|user_blocks
|array<string>
|In user events: historical user blocks
|-
|user_blocks_latest
|array<string>
|In user events: current user blocks
|-
|user_groups
|array<string>
|In user events: historical user groups
|-
|user_groups_latest
|array<string>
|In user events: current user groups
|-
|user_is_created_by_self
|boolean
|In user events: whether the user created their own account
|-
|user_is_created_by_system
|boolean
|In user events: whether the user account was created by mediawiki
|-
|user_is_created_by_peer
|boolean
|In user events: whether the user account was created by another user
|-
|user_is_anonymous
|boolean
|In user events: whether the user is not registered
|-
|user_is_bot_by_name
|boolean
|In user events: whether the user's name matches patterns we use to identify bots
|-
|user_creation_timestamp
|string
|In user events: registration timestamp of the user.
|-
|revision_id
|bigint
|In revision events: id of the revision
|-
|revision_parent_id
|bigint
|In revision events: id of the parent revision
|-
|revision_minor_edit
|boolean
|In revision events: whether it is a minor edit or not
|-
|revision_text_bytes
|bigint
|In revision events: number of bytes of revision
|-
|revision_text_bytes_diff
|bigint
|In revision events: change in bytes relative to parent revision (can be negative).
|-
|revision_text_sha1
|string
|In revision events: sha1 hash of the revision
|-
|revision_content_model
|string
|In revision events: content model of revision
|-
|revision_content_format
|string
|In revision events: content format of revision
|-
|revision_is_deleted
|boolean
|In revision events: whether this revision has been deleted (moved to archive table)
|-
|revision_deleted_timestamp
|string
|In revision events: the timestamp when the revision was deleted
|-
|revision_is_identity_reverted
|boolean
|In revision events: whether this revision was reverted by another future revision
|-
|revision_first_identity_reverting_revision_id
|bigint
|In revision events: id of the revision that reverted this revision
|-
|revision_first_identity_revert_timestamp
|string
|In revision events: timestamp of the revision that reverted this revision
|-
|revision_is_productive
|boolean
|In revision events: whether this revision was reverted within 1 day
|-
|revision_is_identity_revert
|boolean
|In revision events: whether this revision reverts other revisions
|}


=== Event types ===
Similarly, <code>event_types</code> filtering can/must be used depending of the analysis.
{| class="wikitable"
{| class="wikitable"
!Entity
!Entity
Line 273: Line 117:
|n/a
|n/a
|The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.
|The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.
|
|-
|2017/03/01
|n/a
|Add the <code>snapshot</code> partition, allowing to keep multiple versions of the history. Data starts to flow regularly (every month) from labs.
|
|
|}
|}

Revision as of 14:32, 28 March 2017

This page describes the data set that stores the denormalized edit history of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline external table wmf.mediawiki_history. For more detail of the purpose of this data set, please read Analytics/Data_lake/Denormalization_and_historification. Also visit Analytics/Data access if you don't know how to access this data set.

Schema

col_name	data_type	comment
wiki_db             	string              	enwiki, dewiki, eswiktionary, etc.
event_entity        	string              	revision, user or page
event_type          	string              	create, move, delete, etc.  Detailed explanation in the docs under #Event_types
event_timestamp     	string              	When this event ocurred, in YYYYMMDDHHmmss format
event_comment       	string              	Comment related to this event, sourced from log_comment, rev_comment, etc.
event_user_id       	bigint              	Id of the user that caused the event
event_user_text     	string              	Historical text of the user that caused the event
event_user_text_latest	string              	Current text of the user that caused the event
event_user_blocks   	array<string>       	Historical blocks of the user that caused the event
event_user_blocks_latest	array<string>       	Current blocks of the user that caused the event
event_user_groups   	array<string>       	Historical groups of the user that caused the event
event_user_groups_latest	array<string>       	Current groups of the user that caused the event
event_user_is_created_by_self	boolean             	Whether the event_user created their own account
event_user_is_created_by_system	boolean             	Whether the event_user account was created by mediawiki (eg. centralauth)
event_user_is_created_by_peer	boolean             	Whether the event_user account was created by another user
event_user_is_anonymous	boolean             	Whether the event_user is not registered
event_user_is_bot_by_name	boolean             	Whether the event_user's name matches patterns we use to identify bots
event_user_creation_timestamp	string              	Registration timestamp of the user that caused the event
page_id             	bigint              	In revision/page events: id of the page
page_title          	string              	In revision/page events: historical title of the page
page_title_latest   	string              	In revision/page events: current title of the page
page_namespace      	int                 	In revision/page events: historical namespace of the page.
page_namespace_is_content	boolean             	In revision/page events: historical namespace of the page is categorized as content
page_namespace_latest	int                 	In revision/page events: current namespace of the page
page_namespace_is_content_latest	boolean             	In revision/page events: current namespace of the page is categorized as content
page_is_redirect_latest	boolean             	In revision/page events: whether the page is currently a redirect
page_creation_timestamp	string              	In revision/page events: creation timestamp of the page
user_id             	bigint              	In user events: id of the user
user_text           	string              	In user events: historical user text
user_text_latest    	string              	In user events: current user text
user_blocks         	array<string>       	In user events: historical user blocks
user_blocks_latest  	array<string>       	In user events: current user blocks
user_groups         	array<string>       	In user events: historical user groups
user_groups_latest  	array<string>       	In user events: current user groups
user_is_created_by_self	boolean             	In user events: whether the user created their own account
user_is_created_by_system	boolean             	In user events: whether the user account was created by mediawiki
user_is_created_by_peer	boolean             	In user events: whether the user account was created by another user
user_is_anonymous   	boolean             	In user events: whether the user is not registered
user_is_bot_by_name 	boolean             	In user events: whether the user's name matches patterns we use to identify bots
user_creation_timestamp	string              	In user events: registration timestamp of the user.
revision_id         	bigint              	In revision events: id of the revision
revision_parent_id  	bigint              	In revision events: id of the parent revision
revision_minor_edit 	boolean             	In revision events: whether it is a minor edit or not
revision_text_bytes 	bigint              	In revision events: number of bytes of revision
revision_text_bytes_diff	bigint              	In revision events: change in bytes relative to parent revision (can be negative).
revision_text_sha1  	string              	In revision events: sha1 hash of the revision
revision_content_model	string              	In revision events: content model of revision
revision_content_format	string              	In revision events: content format of revision
revision_is_deleted 	boolean             	In revision events: whether this revision has been deleted (moved to archive table)
revision_deleted_timestamp	string              	In revision events: the timestamp when the revision was deleted
revision_is_identity_reverted	boolean             	In revision events: whether this revision was reverted by another future revision
revision_first_identity_reverting_revision_id	bigint              	In revision events: id of the revision that reverted this revision
revision_first_identity_revert_timestamp	string              	In revision events: timestamp of the revision that reverted this revision
revision_is_productive	boolean             	In revision events: whether this revision was reverted within 1 day
revision_is_identity_revert	boolean             	In revision events: whether this revision reverts other revisions
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)

Note the snapshot field: It is a Hive partitions. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the where clause of your queries.

Important Fields

Due to the denormalization of the history data, filtering by event_entity is mandatory not to mix incompatible data.

Similarly, event_types filtering can/must be used depending of the analysis.

Entity Event type Meaning
revision create When a revision is created, when an edit happens.
page create When the first edit to a page is done.
move When moving a page, changing its title.
delete When deleting a page.
user create When a new user is registered.
rename When the name of a user is changed.
altergroups When the groups (rights) of a user are changed.
alterblocks When the blocks of a user are changed.

Changes and known problems

Date Schema version Details Phab

Task

2016/10/06 n/a The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.
2017/03/01 n/a Add the snapshot partition, allowing to keep multiple versions of the history. Data starts to flow regularly (every month) from labs.