You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Edits/Mediawiki user history

From Wikitech-static
< Analytics‎ | Data Lake‎ | Edits
Revision as of 14:10, 7 April 2017 by imported>Milimetric (Milimetric moved page Analytics/Data Lake/Schemas/Mediawiki user history to Analytics/Data Lake/Edits/Mediawiki user history: Reorganizing documentation)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page describes the data set that stores the user history of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via the Hive/Beeline external table wmf.mediawiki_user_history. For more detail of the purpose of this data set, please read Analytics/Data Lake/Page and user history reconstruction. Also visit Analytics/Data access if you don't know how to access this data set.

Schema

col_name	data_type	comment
wiki_db             	string              	enwiki, dewiki, eswiktionary, etc.
user_id             	bigint              	ID of the user, as in the user table.
user_name           	string              	Historical user name.
user_name_latest    	string              	User name as of today.
user_groups         	array<string>       	Historical user groups.
user_groups_latest  	array<string>       	User groups as of today.
user_blocks         	array<string>       	Historical user blocks.
user_blocks_latest  	array<string>       	User blocks as of today.
user_registration_timestamp	string              	When the user accoung was registered, in YYYYMMDDHHmmss format.
created_by_self     	boolean             	Whether the user created their own account
created_by_system   	boolean             	Whether the user account was created by mediawiki (eg. centralauth)
created_by_peer     	boolean             	Whether the user account was created by another user
anonymous           	boolean             	Whether the user is not registered
is_bot_by_name      	boolean             	Whether the user's name matches patterns we use to identify bots
start_timestamp     	string              	Timestamp from where this state applies (inclusive).
end_timestamp       	string              	Timestamp to where this state applies (exclusive).
caused_by_event_type	string              	Event that caused this state (create, move, delete or restore).
caused_by_user_id   	bigint              	ID from the user that caused this state.
caused_by_block_expiration	string              	Block expiration timestamp, if any.
inferred_from       	string              	If non-NULL, indicates that some of this state's fields have been inferred after an inconsistency in the source data.
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)
	 	 
# Partition Information	 	 
# col_name            	data_type           	comment             
	 	 
snapshot            	string              	Versioning information to keep multiple datasets (YYYY-MM for regular labs imports)

Note the snapshot field: It is a Hive partitions. It explicitly maps to snapshot folders in HDFS. Since the full data is present in every snapshot up to the snapshot date, you should always specify a snapshot partition predicate in the where clause of your queries.

Changes and known problems

Date Schema version Details Phab

Task

2016/10/06 n/a The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.
2017/03/01 n/a Add the snapshot partition, allowing to keep multiple versions of the user history. Data starts to flow regularly (every month) from labs.