You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Data Lake

From Wikitech-static
< Analytics
Revision as of 21:55, 5 April 2016 by imported>Milimetric (Created page with "The Analytics Data Lake (ADL) refers to the collection, processing, and publishing of data from Wikimedia projects. At first, the Data Lake focuses on collecting historical d...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The Analytics Data Lake (ADL) refers to the collection, processing, and publishing of data from Wikimedia projects. At first, the Data Lake focuses on collecting historical data about editing, including revisions, pages, users, and making it available in an analytics-friendly way for everyone, publicly. As the Data Lake matures, we will add any and all data that can be safely made public. The infrastructure will support public releases of datasets, out of the box.


Initial Scope

Consolidating Editing Data

Millions of people edit our projects. Information about the knowledge they generate and improve is trapped in hundreds of separate mysql databases and large XML dump files. We will create analytics-friendly schemas and transform this separated data to fit those schemas. HDFS is the best storage solutions for this, so that's what we'll use. We will make the schemas and the data extraction using an append-only style, so actions like deleting pages and supressing usertext can be first class citizens. This will allow us to create redacted streams of data that can be published safely.

It will of course be important to keep this data up to date. To accomplish this we will connect to real-time systems like Event Bus to get the latest data. From time to time, we'll compare to make sure we have no replication gaps.

Hive Tables

When storing to HDFS, we will create well documented, unified tables on top of this data. This will be useful for any batch or really long running queries.

Druid

Druid and any other Online Analytics Processing (OLAP) systems we use will serve this data to internal and maybe external users as well. This data serving layer allows us to run complicated queries that would otherwise consume massive resources in a relational database. If we're able to properly redact and re-load this data on a regular basis, we will be able to open this layer to the public.

Analytics Query Service / Dumps

We will continue and push slices of this data out to the world through our query service (AQS) which currently hosts our Pageview and Unique Devices data. We will also make the most useful forms of this data available in static file dumps. These dumps will contain strictly metadata and shouldn't be confused with the "right to fork"-oriented richer dumps. Those may be easier to generate using this system as well, see below.


Pleasant Side Effects

One potential use of this technology will be to help replace the aging Dumps process. Incremental dumps, more accurately redacted dumps, reliable re-runnable dumps should all be much easier to achieve with the Data Lake, and the data streams that feed into it, than they are with the current set of dumps scripts and manual intervention.