You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Schemas/Mediawiki history: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Joal
m (Joal moved page Analytics/Data Lake/Mediawiki history to Analytics/Data Lake/Schemas/Mediawiki history: Organizing doc before first internal production release.)
 
imported>MABot
m (Bot: Fixing double redirect to Analytics/Data Lake/Edits/MediaWiki history)
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This page describes the data set that stores the '''denormalized edit history''' of WMF's wikis. It lives in Analytic's Hadoop cluster and is accessible via a Hive/Beeline external table. For more detail of the purpose of this data set, please read [[Analytics/Data Lake/Denormalization and historification|Analytics/Data_lake/Denormalization_and_historification]]. Also visit [[Analytics/Data access]] if you don't know how to access this data set.
#REDIRECT [[Analytics/Data Lake/Edits/MediaWiki history]]
 
=== Schema ===
{| class="wikitable sortable"
!col_name
!data_type
!comment
|-
|wiki_db
|string
|enwiki, dewiki, eswiktionary, etc.
|-
|event_entity
|string
|revision, user or page
|-
|event_type
|string
|create, move, delete, etc. Detailed explanation in the docs under #Event_types
|-
|event_timestamp
|string
|When this event ocurred, in YYYYMMDDHHmmss format
|-
|event_comment
|string
|Comment related to this event, sourced from log_comment, rev_comment, etc.
|-
|event_user_id
|bigint
|Id of the user that caused the event
|-
|event_user_text
|string
|Historical text of the user that caused the event
|-
|event_user_text_latest
|string
|Current text of the user that caused the event
|-
|event_user_blocks
|array<string>
|Historical blocks of the user that caused the event
|-
|event_user_blocks_latest
|array<string>
|Current blocks of the user that caused the event
|-
|event_user_groups
|array<string>
|Historical groups of the user that caused the event
|-
|event_user_groups_latest
|array<string>
|Current groups of the user that caused the event
|-
|event_user_is_created_by_self
|boolean
|Whether the event_user created their own account
|-
|event_user_is_created_by_system
|boolean
|Whether the event_user account was created by mediawiki (eg. centralauth)
|-
|event_user_is_created_by_peer
|boolean
|Whether the event_user account was created by another user
|-
|event_user_is_anonymous
|boolean
|Whether the event_user is not registered
|-
|event_user_is_bot_by_name
|boolean
|Whether the event_user's name matches patterns we use to identify bots
|-
|event_user_creation_timestamp
|string
|Registration timestamp of the user that caused the event
|-
|page_id
|bigint
|In revision/page events: id of the page
|-
|page_title
|string
|In revision/page events: historical title of the page
|-
|page_title_latest
|string
|In revision/page events: current title of the page
|-
|page_namespace
|int
|In revision/page events: historical namespace of the page.
|-
|page_namespace_is_content
|boolean
|In revision/page events: historical namespace of the page is categorized as content
|-
|page_namespace_latest
|int
|In revision/page events: current namespace of the page
|-
|page_namespace_is_content_latest
|boolean
|In revision/page events: current namespace of the page is categorized as content
|-
|page_is_redirect_latest
|boolean
|In revision/page events: whether the page is currently a redirect
|-
|page_creation_timestamp
|string
|In revision/page events: creation timestamp of the page
|-
|user_id
|bigint
|In user events: id of the user
|-
|user_text
|string
|In user events: historical user text
|-
|user_text_latest
|string
|In user events: current user text
|-
|user_blocks
|array<string>
|In user events: historical user blocks
|-
|user_blocks_latest
|array<string>
|In user events: current user blocks
|-
|user_groups
|array<string>
|In user events: historical user groups
|-
|user_groups_latest
|array<string>
|In user events: current user groups
|-
|user_is_created_by_self
|boolean
|In user events: whether the user created their own account
|-
|user_is_created_by_system
|boolean
|In user events: whether the user account was created by mediawiki
|-
|user_is_created_by_peer
|boolean
|In user events: whether the user account was created by another user
|-
|user_is_anonymous
|boolean
|In user events: whether the user is not registered
|-
|user_is_bot_by_name
|boolean
|In user events: whether the user's name matches patterns we use to identify bots
|-
|user_creation_timestamp
|string
|In user events: registration timestamp of the user.
|-
|revision_id
|bigint
|In revision events: id of the revision
|-
|revision_parent_id
|bigint
|In revision events: id of the parent revision
|-
|revision_minor_edit
|boolean
|In revision events: whether it is a minor edit or not
|-
|revision_text_bytes
|bigint
|In revision events: number of bytes of revision
|-
|revision_text_bytes_diff
|bigint
|In revision events: change in bytes relative to parent revision (can be negative).
|-
|revision_text_sha1
|string
|In revision events: sha1 hash of the revision
|-
|revision_content_model
|string
|In revision events: content model of revision
|-
|revision_content_format
|string
|In revision events: content format of revision
|-
|revision_is_deleted
|boolean
|In revision events: whether this revision has been deleted (moved to archive table)
|-
|revision_deleted_timestamp
|string
|In revision events: the timestamp when the revision was deleted
|-
|revision_is_identity_reverted
|boolean
|In revision events: whether this revision was reverted by another future revision
|-
|revision_first_identity_reverting_revision_id
|bigint
|In revision events: id of the revision that reverted this revision
|-
|revision_first_identity_revert_timestamp
|string
|In revision events: timestamp of the revision that reverted this revision
|-
|revision_is_productive
|boolean
|In revision events: whether this revision was reverted within 1 day
|-
|revision_is_identity_revert
|boolean
|In revision events: whether this revision reverts other revisions
|}
 
=== Event types ===
{| class="wikitable"
!Entity
!Event type
!Meaning
|-
|revision
|create
|When a revision is created, when an edit happens.
|-
| rowspan="3" |page
|create
|When the first edit to a page is done.
|-
|move
|When moving a page, changing its title.
|-
|delete
|When deleting a page.
|-
| rowspan="4" |user
|create
|When a new user is registered.
|-
|rename
|When the name of a user is changed.
|-
|altergroups
|When the groups (rights) of a user are changed.
|-
|alterblocks
|When the blocks of a user are changed.
|}
 
===Changes and known problems===
{| class="wikitable"
!Date
!Schema version
!Details
!Phab
Task
|-
|2016/10/06
|n/a
|The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis.
|
|}

Latest revision as of 07:17, 17 January 2020