You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Analytics/Data Lake/Edits"

From Wikitech-static
Jump to navigation Jump to search
imported>Milimetric
m (Milimetric moved page Analytics/Data Lake to Analytics/Data Lake/Edits: Reorganizing documentation)
 
imported>Joal
(Remove redirect and provide information)
Line 1: Line 1:
The Analytics Data Lake (ADL) is a large, analytics-oriented repository of data, both raw and aggregated, about Wikimedia projects (in industry terms, a [[:en:Data_lake|data lake]]). It currently includes data on pageviews and a beta set of historical data about editing, including revisions, pages, and users  As the Data Lake matures, we will add any and all data that can be safely made public. The infrastructure will support public releases of datasets, out of the box.
This page links to detailed information about '''Edits datasets''' in the [[Analytics/Data Lake|Data Lake]].


== Initial Scope ==
In comparison to the [[Analytics/Data Lake/Traffic|traffic ones]], those datasets are not continuously updated. They are regularly updated by fully re-importing/re-building them, creating a new '''<code>snapshot</code>'''.


=== Consolidating Editing Data ===
This '''<code>snapshot</code>''' notion is key when querying the Edits datasets, since inclufing multiple snapshots doesn't sense for most queries. As of 2017-04, snapshots are provided monthly.
Millions of people edit our projects.  Information about the knowledge they generate and improve is trapped in hundreds of separate mysql databases and large XML dump files.  We will create analytics-friendly schemas and transform this separated data to fit those schemas.  HDFS is the best storage solutions for this, so that's what we'll use. We will make the schemas and the data extraction using an append-only style, so actions like deleting pages and supressing usertext can be first class citizens.  This will allow us to create redacted streams of data that can be published safely.


It will of course be important to keep this data up to date.  To accomplish this we will connect to real-time systems like Event Bus to get the latest data.  From time to time, we'll compare to make sure we have no replication gaps.
== Datasets ==


=== Hive Tables ===
=== Mediawiki raw data ===
When storing to HDFS, we will create well documented, unified tables on top of this data.  This will be useful for any batch or really long running queries.
Those are copy of mediawiki MySQL tables
* Archive
* ipblocks
* logging
* page
* revision
* user
* user_groups


=== Druid ===
=== Processed Data ===
Druid and any other Online Analytics Processing (OLAP) systems we use will serve this data to internal and maybe external users as well.  This data serving layer allows us to run complicated queries that would otherwise consume massive resources in a relational database.  If we're able to properly redact and re-load this data on a regular basis, we will be able to open this layer to the public.
* [[Analytics/Data Lake/Edits/Mediawiki user history|Mediawiki user history]] -- Dataset providing reconstructed history events of mediawiki users
* [[Analytics/Data Lake/Edits/Mediawiki page history|Mediawiki page history]] -- Dataset providing reconstructed history events of mediawiki pages
* [[Analytics/Data Lake/Edits/Mediawiki history|Mediawiki history]] -- Fully denormalized dataset containing user, page and revision processed data
* [[Analytics/Data Lake/Edits/Metrics|Metrics]] -- Dataset providing precomputed metrics over edits data


=== Analytics Query Service / Dumps ===
== Access ==
We will continue and push slices of this data out to the world through our query service (AQS) which currently hosts our Pageview and Unique Devices data.  We will also make the most useful forms of this data available in static file dumps.  These dumps will contain strictly metadata and shouldn't be confused with the "right to fork"-oriented richer dumps.  Those may be easier to generate using this system as well, see below.
Some of the data above is made public through different systems (see [[Analytics]] main page), but any data on the Data Lake is private by default. For this, reference [[Analytics/Data access]]
== Pleasant Side Effects ==
 
One potential use of this technology will be to help replace the aging Dumps process.  Incremental dumps, more accurately redacted dumps, reliable re-runnable dumps should all be much easier to achieve with the Data Lake, and the data streams that feed into it, than they are with the current set of dumps scripts and manual intervention.
 
== Project Documentation ==
 
=== Architecture ===
 
==== Systems ====
Various experiences<ref>Two historical big projects are [[m:Data_dumps|dumps generation]] and [[stats:|wikistats]], and a two new internal projects are [[Analytics/DataWarehouse|DataWarehouse]] and [[m:Research:Measuring_edit_productivity|measuring edit productivity]].</ref> on gathering and computing on full edit data history has shown that it's a bad idea to rebuild a full edit data set on regular basis in opposition to incrementally update it.
 
In order to get there, two core systems are needed:
* '''Historical data extraction system:''' It extracts historical data from either the mediawiki databases and/or the XML dumps and convert and refine it to the schema used (see below for schema description).
* '''Incremental data update system:''' It handles events flowing through a streaming system and updates an already existing data set by transforming and refining the events into the needed schema.
Once those two systems are built and tested, a date needs to decided upon which the data set will be built, from historical system before D, and from incremental system after D. We also plan to maintain the historical system even if its use is less regular than the incremental one, to ensure new data could be extracted historically in the future.
 
==== Stack ====
The plan is to use [[Analytics/Cluster/Hadoop|Hadoop]] to both store data and compute the various ETL / refinement steps (cheap, reliable and already in place).
 
Feeding systems will be [[mediawikiwiki:Manual:Database_layout|MariaDB]] for historical needs since it contains more and better quality data than xml dumps,and [[mediawikiwiki:Extension:EventBus|Kafka through EventBus]] for streaming input data.
 
Querying systems are planned to be Druid for usual / simple metrics, [[Analytics/Cluster/Hive|Hive]] and/or [[Analytics/Cluster/Spark|Spark]] for complex queries, and possibly the [[Analytics/AQS|Analytics Query Service]] to provide metrics externally.
 
=== History reconstruction ===
Some progress has been done and we have a working edit history reconstruction history (still with some caveats, but pretty much functional). The following pages describe each one of the steps of the pipeline and related topics.
 
==== Data pipeline ====
* [[Analytics/Data Lake/Data loading]]
* [[Analytics/Data Lake/Page and user history reconstruction]]
* [[Analytics/Data Lake/Denormalization and historification]]
* [[Analytics/Data Lake/Serving layer]]
 
==== Schemas ====
* [[Analytics/Data Lake/Mediawiki page history]]
* [[Analytics/Data Lake/Mediawiki user history]]
* [[Analytics/Data Lake/Mediawiki history]]
* [[Analytics/Data Lake/Metric results]]
 
==== Optimizations ====
* [[Analytics/Data Lake/History reconstruction algorithm and optimizations]]
 
==== Incremental data ====
* [https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki Event Bus schemas] -- An [[phab:T134502|update to this schema]] is being discussed and will be merged as a v1 when mediawiki code gets updated to populate those new event types.
 
==== Query data in Druid ====
 
===== Pivot sample queries =====
* [https://pivot.wikimedia.org/#mediawiki-history-beta/line-chart/2/EQUQLgxg9AqgKgYWAGgN7APYAdgC5gQAWAhgJYB2KwApgB5YBO1Azs6RpbutnsEwGZVyxALbVeAfQlhSY4AF9kwYhBkdmeANroVazsApU6jFmw55uOfETKUlxpq3adLvAUNHj8IhUt3OLZVUA/BkxACVicgBzcSUAEwBXBmI9XgAFAGYAWSpmMGorAFoARnlytCC0/Cj4o3pHMxdMKwISQ3sG0xDXfHclYTl8agA3anIwCXGZMABPXyqexfN8DDGGABtiHE6TJxXe4A3SApSNqhHiDcSvdGZqMDhZrC9gAGU4cIBJADkAcSMG2oYgmGlwmj4o1ITWAAF1FMA5i9eG8QHAFOV5PDkNoaF19s0eH1qIIBp5eAB3UgAa1IEniACMFswMAwwABBYIHZb6Fls+p7GFcFpuEkeIaQkbQ5zSOiTBmzArMemkfj8CTkRI+BHxUhMLn6eIsCDjXUxDGVBzdblEyGk4CDV5SMLiBEMxIQakPTnVHQG3gugBCHq9YCoSRS1WA6RKABFmayOf7hf4VsA+WHdo0lrb+g7yfhnbJXQk9dRk8pmCbyGbohjYchNRsNkomFKmrLaPLFSwVWqNVqUBDRtMNA3NA3JdKOJ3u0q++rNT5KrqQULgFTafSmUpjiITngAKzyIA=== Bytes added broken down by wiki for 3 months]
 
* Same denormalized schema as in hadoop enhanced with precomputed immutable flags<ref>For instance <code>is_new_editor</code>, <code>is_new_productive_editor</code>, and <code>is_new_surviving_editor</code> for users and <code>is_productive</code>, <code>is_reverted</code> and is <code>deleted</code> for revisions.</ref> if [http://druid.io/docs/latest/querying/lookups.html Druid Query-Time lookups] can handle them.
 
=== Ongoing Work ===
 
==== EventBus ====
* Schema update -- {{Phabricator|T134502}}
* Mediawiki update to handle schema update -- {{Phabricator|T137287}}
* New event schema to come after this set of patches
 
==== Historical data sourcing ====
* Hive schema creation and test using simplewiki and a set of test queries on dump generated data -- {{Phabricator|T134793}}
* ETL for transforming MediaWiki database data to Hive schema for simplewiki -- {{Phabricator|T134790}}
* Scalability tests to come after pipeline is built
 
==== Details not to Forget ====
* At page rename, there sometimes is a new page created which has the renamed page original title and redirects to the renamed page. We have left those on the side for the moment.
* There are user rename log lines that can't be linked back to an actual user. It could be because of deletions, but we're not sure. We should investigate a bit.
 
==== Full Text ====
* To measure a lot of important metrics like community backlogs, template use, category graphs, etc., we need to parse and analyze the full revision text of articles.
* To get this content, we can eitherː
** Look in dumps (Joseph is working this way now)
** Get stuff from the databases.  Quick reminder on thatː on tin, do <pre>sql metawiki -h 10.64.16.186</pre> <pre>select blob_text from blobs_cluster24 where blob_id = ...</pre>.  Here you get cluster24 from the link in the text table and that IP from looking up cluster24 on https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php.
 
----

Revision as of 11:24, 10 April 2017

This page links to detailed information about Edits datasets in the Data Lake.

In comparison to the traffic ones, those datasets are not continuously updated. They are regularly updated by fully re-importing/re-building them, creating a new snapshot.

This snapshot notion is key when querying the Edits datasets, since inclufing multiple snapshots doesn't sense for most queries. As of 2017-04, snapshots are provided monthly.

Datasets

Mediawiki raw data

Those are copy of mediawiki MySQL tables

  • Archive
  • ipblocks
  • logging
  • page
  • revision
  • user
  • user_groups

Processed Data

  • Mediawiki user history -- Dataset providing reconstructed history events of mediawiki users
  • Mediawiki page history -- Dataset providing reconstructed history events of mediawiki pages
  • Mediawiki history -- Fully denormalized dataset containing user, page and revision processed data
  • Metrics -- Dataset providing precomputed metrics over edits data

Access

Some of the data above is made public through different systems (see Analytics main page), but any data on the Data Lake is private by default. For this, reference Analytics/Data access