You are browsing a read-only backup copy of Wikitech. The primary site can be found at

Analytics/Data Lake

From Wikitech-static
< Analytics
Revision as of 09:01, 5 March 2018 by imported>Neil P. Quinn-WMF (Copyedit and expand documentation)
Jump to navigation Jump to search

The Analytics Data Lake (ADL), or the Data Lake for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a data lake). All of the data it contains can be joined together.

Data available

  • Traffic data -- webrequest, pageviews, unique devices ...
  • Edits data -- Historical data about revisions, pages, and users [in beta as of 2017-04-07].

Currently, you need production data access to use this data, but as of March 2018, work is underway to make the edit history data publicly available as part of the Data Services provided to Cloud Services users (T169572).

As the Data Lake matures, we will add any and all data and try to make it public as much as possible.

Technical architecture

Main article: Analytics/Systems/Data Lake

The Analytics Hadoop clusteris the primary backend for the Data Lake.