You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data Lake: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Nuria
No edit summary
imported>Neil P. Quinn-WMF
(Remove outdated information.)
Line 1: Line 1:
The '''Analytics Data Lake''' (ADL), or the '''Data Lake''' for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a [[w:data lake|data lake]]). All of the data it contains can be joined together.
The '''Analytics Data Lake''' (ADL), or the '''Data Lake''' for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a [[w:data lake|data lake]]).
 
Technically, data in the Data Lake is stored in HDFS (the Hadoop Distributed File System), usually in the Parquet file format. The [https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_ig_hms.html Hive metastore] is a centralized repository for metadata about these data files, and all three SQL query engines we use (Presto, Spark SQL, and Hive) rely on it.
 
== Querying ==
Data in the Data Lake can be accessed directly through the [https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html <code>hdfs</code> command line tool].
 
As of September 2020, you have a choice of three engines that can run SQL queries against the Data Lake: [[Analytics/Systems/Presto|Presto]], [[Hive]], and [[Analytics/Systems/Cluster/Spark|Spark]]. If you're not sure which to choose, Hive is good to start with. All three engines can be used from the [[Analytics/Systems/Clients|Analytics clients]].


== Data available ==
== Data available ==
* [[Analytics/Data Lake/Traffic|Traffic data]]  -- webrequest, pageviews, unique devices ...
* [[Analytics/Data Lake/Traffic|Traffic data]]  -- webrequest, pageviews, unique devices ...
* [[Analytics/Data Lake/Edits|Edits data]] -- Historical data about revisions, pages, and users [in beta as of 2017-04-07].
* [[Analytics/Data Lake/Edits|Edits data]] -- Historical data about revisions, pages, and users
* [[Event Platform|Events data]] -- Eventlogging, eventbus and eventstreams data (raw, refined, sanitized)
* [[Event Platform|Events data]] -- Eventlogging, eventbus and eventstreams data (raw, refined, sanitized)
** [[Analytics/Data Lake/ORES|ORES scores]] -- Machine learning predictions [available as events as of 2020-02-27]
** [[Analytics/Data Lake/ORES|ORES scores]] -- Machine learning predictions [available as events as of 2020-02-27]


Currently, you need [[Analytics/Data access#Production access|production data access]] to use some of this data. A lot of it is available publicy at http://dumps.wikimedia.org.
Currently, you need [[Analytics/Data access#Production access|production data access]] to use some of this data. A lot of it is available publicy at http://dumps.wikimedia.org.
As the Data Lake matures, we will add any and all data and try to make it public as much as possible.


== Technical architecture ==
== Technical architecture ==
The [[Analytics/Systems/Cluster|Analytics cluster]], which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.
The [[Analytics/Systems/Cluster|Analytics cluster]], which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.

Revision as of 14:29, 25 September 2020

The Analytics Data Lake (ADL), or the Data Lake for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a data lake).

Technically, data in the Data Lake is stored in HDFS (the Hadoop Distributed File System), usually in the Parquet file format. The Hive metastore is a centralized repository for metadata about these data files, and all three SQL query engines we use (Presto, Spark SQL, and Hive) rely on it.

Querying

Data in the Data Lake can be accessed directly through the hdfs command line tool.

As of September 2020, you have a choice of three engines that can run SQL queries against the Data Lake: Presto, Hive, and Spark. If you're not sure which to choose, Hive is good to start with. All three engines can be used from the Analytics clients.

Data available

  • Traffic data -- webrequest, pageviews, unique devices ...
  • Edits data -- Historical data about revisions, pages, and users
  • Events data -- Eventlogging, eventbus and eventstreams data (raw, refined, sanitized)
    • ORES scores -- Machine learning predictions [available as events as of 2020-02-27]

Currently, you need production data access to use some of this data. A lot of it is available publicy at http://dumps.wikimedia.org.

Technical architecture

The Analytics cluster, which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.