You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Data Lake
< Analytics
Jump to navigation
Jump to search
Revision as of 17:13, 27 February 2020 by imported>Nuria
The Analytics Data Lake (ADL), or the Data Lake for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a data lake). All of the data it contains can be joined together.
Data available
- Traffic data -- webrequest, pageviews, unique devices ...
- Edits data -- Historical data about revisions, pages, and users [in beta as of 2017-04-07].
- Events data -- Eventlogging, eventbus and eventstreams data (raw, refined, sanitized)
- ORES scores -- Machine learning predictions [available as events as of 2020-02-27]
Currently, you need production data access to use some of this data. A lot of it is available publicy at http://dumps.wikimedia.org.
As the Data Lake matures, we will add any and all data and try to make it public as much as possible.
Technical architecture
The Analytics cluster, which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.