You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Data Lake: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Neil P. Quinn-WMF
(Remove link to placeholder page)
imported>Joal
No edit summary
Line 4: Line 4:
* [[Analytics/Data Lake/Traffic|Traffic data]]  -- webrequest, pageviews, unique devices ...
* [[Analytics/Data Lake/Traffic|Traffic data]]  -- webrequest, pageviews, unique devices ...
* [[Analytics/Data Lake/Edits|Edits data]] -- Historical data about revisions, pages, and users [in beta as of 2017-04-07].
* [[Analytics/Data Lake/Edits|Edits data]] -- Historical data about revisions, pages, and users [in beta as of 2017-04-07].
*[[Analytics/Data Lake/ORES|ORES scores]] -- Machine learning predictions
* [[Event Platform|Events data]] -- Eventlogging, eventbus and eventstreams data (raw, refined, sanitized)
**[[Analytics/Data Lake/ORES|ORES scores]] -- Machine learning predictions


Currently, you need [[Analytics/Data access#Production access|production data access]] to use this data, but as of March 2018, work is underway to make the edit history data publicly available as part of the [[Portal:Data Services|Data Services provided to Cloud Services users]] ([[phab:T169572|T169572]]).
Currently, you need [[Analytics/Data access#Production access|production data access]] to use this data, but as of March 2018, work is underway to make the edit history data publicly available as part of the [[Portal:Data Services|Data Services provided to Cloud Services users]] ([[phab:T169572|T169572]]).

Revision as of 13:09, 25 February 2020

The Analytics Data Lake (ADL), or the Data Lake for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a data lake). All of the data it contains can be joined together.

Data available

  • Traffic data -- webrequest, pageviews, unique devices ...
  • Edits data -- Historical data about revisions, pages, and users [in beta as of 2017-04-07].
  • Events data -- Eventlogging, eventbus and eventstreams data (raw, refined, sanitized)

Currently, you need production data access to use this data, but as of March 2018, work is underway to make the edit history data publicly available as part of the Data Services provided to Cloud Services users (T169572).

As the Data Lake matures, we will add any and all data and try to make it public as much as possible.

Technical architecture

The Analytics cluster, which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.