You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
The Analytics Data Lake (ADL) refers to the collection, processing, and publishing of data from Wikimedia projects. At first, the Data Lake focuses on collecting historical data about editing, including revisions, pages, users, and making it available in an analytics-friendly way for everyone, publicly. As the Data Lake matures, we will add any and all data that can be safely made public. The infrastructure will support public releases of datasets, out of the box.
Consolidating Editing Data
Millions of people edit our projects. Information about the knowledge they generate and improve is trapped in hundreds of separate mysql databases and large XML dump files. We will create analytics-friendly schemas and transform this separated data to fit those schemas. HDFS is the best storage solutions for this, so that's what we'll use. We will make the schemas and the data extraction using an append-only style, so actions like deleting pages and supressing usertext can be first class citizens. This will allow us to create redacted streams of data that can be published safely.
It will of course be important to keep this data up to date. To accomplish this we will connect to real-time systems like Event Bus to get the latest data. From time to time, we'll compare to make sure we have no replication gaps.
When storing to HDFS, we will create well documented, unified tables on top of this data. This will be useful for any batch or really long running queries.
Druid and any other Online Analytics Processing (OLAP) systems we use will serve this data to internal and maybe external users as well. This data serving layer allows us to run complicated queries that would otherwise consume massive resources in a relational database. If we're able to properly redact and re-load this data on a regular basis, we will be able to open this layer to the public.
Analytics Query Service / Dumps
We will continue and push slices of this data out to the world through our query service (AQS) which currently hosts our Pageview and Unique Devices data. We will also make the most useful forms of this data available in static file dumps. These dumps will contain strictly metadata and shouldn't be confused with the "right to fork"-oriented richer dumps. Those may be easier to generate using this system as well, see below.
Pleasant Side Effects
One potential use of this technology will be to help replace the aging Dumps process. Incremental dumps, more accurately redacted dumps, reliable re-runnable dumps should all be much easier to achieve with the Data Lake, and the data streams that feed into it, than they are with the current set of dumps scripts and manual intervention.
Various experiences on gathering and computing on full edit data history has shown that it's a bad idea to rebuild a full edit data set on regular basis in opposition to incrementally update it.
In order to get there, two core systems are needed:
- Historical data extraction system: It extracts historical data from either the mediawiki databases and/or the XML dumps and convert and refine it to the schema used (see below for schema description).
- Incremental data update system: It handles events flowing through a streaming system and updates an already existing data set by transforming and refining the events into the needed schema.
Once those two systems are built and tested, a date needs to decided upon which the data set will be built, from historical system before D, and from incremental system after D. We also plan to maintain the historical system even if its use is less regular than the incremental one, to ensure new data could be extracted historically in the future.
The plan is to use Hadoop to both store data and compute the various ETL / refinement steps (cheap, reliable and already in place).
- Mediawiki Database schema --
- Intermediate schema -- Fed transforming and enhancing raw Media wiki data. It contains
User_changetables [Still WIP]. Rows contain entity state at a moment in time, the next state change time, and event oriented information, such as who did the change, and what type of change it is.
- Event Bus schemas -- An update to this schema is being discussed and will be merged as a v1 when mediawiki code gets updated to populate those new event types.
Query data in Hadoop
- Fully denormalized schema containing event oriented rows with null values in fields not related to the current event. In hive syntax:
CREATE EXTERNAL TABLE edit_history (
-- Generic event information -- Populated for every event wiki_db string, -- revision, page, user event_entity string, -- create, move, delete, rename, etc. event_type string, -- mediawiki format yyyyMMddHHmmss event_timestamp string, -- from the revision, log, archive, etc. event_comment string, -- fields with _latest are the most recent values for this user -- the other fields are the values at the time of the event event_user_id bigint, event_user_text string, event_user_text_latest string, event_user_blocks array<string>, event_user_blocks_latest array<string>, event_user_groups array<string>, event_user_groups_latest array<string>, event_user_creation_timestamp string,
-- For page and revision events, the page that's being acted on (new revision, moved, deleted, etc.) -- fields with _latest are the most recent values for this page -- the other fields are the values at the time of the event page_id bigint, page_title string, page_title_latest string, page_namespace int, page_namespace_latest int, page_creation_timestamp string,
-- For user events, the user that's being acted on (blocked, renamed, etc.) -- fields with _latest are the most recent values for this user -- the other fields are the values at the time of the event user_id bigint, user_text string, user_text_latest string, user_blocks array<string>, user_blocks_latest array<string>, user_groups array<string>, user_groups_latest array<string>, user_creation_timestamp string,
-- For revision events, the revision that's being acted on (created, visibility chagned, etc.) revision_id bigint, revision_parent_id bigint, revision_minor_edit boolean, revision_text_bytes bigint, -- joined by rev_parent_id and got the byte difference (not all records joined cleanly) revision_text_bytes_diff bigint, revision_text_sha1 string, revision_content_model string, revision_content_format string, -- a best guess as to when this revision was deleted (if it came from the archive table) -- if this was not available, we used the timestamp of the last revision for the archived page revision_deleted_timestamp string, -- These last two are not vetted yet, but they'd be valuable for revert work (we'll give them better names too) revision_most_recent_revert_timestamp string, revision_sha1_revert boolean
-- If you'd like to actually create the table, add this before the semicolonː ROW FORMAT SERDE
STORED AS INPUTFORMAT
Query data in Druid
- Same denormalized schema as in hadoop enhanced with precomputed immutable flags if Druid Query-Time lookups can handle them.
- Schema update -- task T134502
- Mediawiki update to handle schema update -- task T137287
- New event schema to come after this set of patches
Historical data sourcing
- Hive schema creation and test using simplewiki and a set of test queries on dump generated data -- task T134793
- ETL for transforming MediaWiki database data to Hive schema for simplewiki -- task T134790
- Scalability tests to come after pipeline is built
Details not to Forget
- At page rename, there sometimes is a new page created which has the renamed page original title and redirects to the renamed page. We have left those on the side for the moment.
- There are user rename log lines that can't be linked back to an actual user. It could be because of deletions, but we're not sure. We should investigate a bit.