You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Analytics/Data Lake"

From Wikitech-static
Jump to navigation Jump to search
imported>Joal
(Add content data link)
imported>Neil P. Quinn-WMF
(Link to new page for event data #raddocs)
Line 1: Line 1:
The '''Analytics Data Lake''' (ADL), or the '''Data Lake''' for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a [[w:data lake|data lake]]).
The '''Analytics Data Lake''' (ADL), or the '''Data Lake''' for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a [[w:data lake|data lake]]).


Technically, data in the Data Lake is stored in HDFS (the Hadoop Distributed File System), usually in the Parquet file format. The [https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_ig_hms.html Hive metastore] is a centralized repository for metadata about these data files, and all three SQL query engines we use (Presto, Spark SQL, and Hive) rely on it.
== Data available ==
{{Note|content=Currently, you need [[Analytics/Data access#Production access|production data access]] to use some of this data. A lot of it is available publicly at [https://dumps.wikimedia.org/ dumps.wikimedia.org].|type=notice}}
 
;[[Analytics/Data Lake/Traffic|Traffic data]]
:[[Analytics/Data Lake/Traffic/Webrequest|Webrequest]], [[Analytics/Data Lake/Traffic/Pageviews|pageviews]], and [[Analytics/Data Lake/Traffic/Unique Devices|unique devices]]
;[[Analytics/Data Lake/Edits|Edits data]]
:Historical data about revisions, pages, and users (e.g. [[Analytics/Data Lake/Edits/MediaWiki history|MediaWiki History]])
;[[Analytics/Data Lake/Content|Content data]]
:Wikitext ([[Analytics/Data Lake/Content/Mediawiki wikitext current|latest]] & [[Analytics/Data Lake/Content/Mediawiki wikitext history|historical]]) and wikidata-entities
;[[Analytics/Data Lake/Events|Events data]]
:[[Analytics/Systems/EventLogging|EventLogging]], EventBus and event streams data (raw, refined, [[Analytics/Systems/Event Sanitization|sanitized]])
;[[Analytics/Data Lake/ORES|ORES scores]]
:Machine learning predictions (available as events as of 2020-02-27)
 
Some of these datasets (such as webrequests) are only available in Hive, while others (such as pageviews) are ''also'' available as [[mw:Wikimedia Product/Data dictionary|data cubes]] (usually in more aggregated capacity).
 
== Access ==
The main way to access the data in the Data Lake is to run queries using one of the three available SQL engines: [[Analytics/Systems/Presto|Presto]], [[Hive]], and [[Analytics/Systems/Cluster/Spark|Spark]].
 
You can access these engines through several different routes:


== Querying ==
* [[Analytics/Systems/Superset|'''Superset''']] has a graphical SQL editor where you can run [[Analytics/Systems/Presto|Presto]] queries
Data in the Data Lake can be accessed directly through the [https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html <code>hdfs</code> command line tool].
* '''[[Analytics/Cluster/Hue|Hue]]''' has a graphical SQL editor where you can run [[Analytics/Systems/Cluster/Hive|Hive]] queries
* '''Custom code''' on one of the [[Analytics/Systems/Clients|analytics clients]] (the easiest way to do this is to use our [[Analytics/Systems/Jupyter|Jupyter service]])
** for '''Python''', use the [https://github.com/wikimedia/wmfdata-python wmfdata-python] package
** for '''R''', use the [https://github.com/wikimedia/wmfdata-r wmfdata-r] package


As of September 2020, you have a choice of three engines that can run SQL queries against the Data Lake: [[Analytics/Systems/Presto|Presto]], [[Hive]], and [[Analytics/Systems/Cluster/Spark|Spark]]. If you're not sure which to choose, Hive is good to start with. All three engines can be used from the [[Analytics/Systems/Clients|Analytics clients]].
All three engines also have command-line programs which you can use on one of the [[Analytics/Systems/Clients|analytics clients]]. This is probably the least convenient way, but if you want to use it, consult the engine's documentation page.


== Data available ==
=== Differences between the SQL engines ===
* [[Analytics/Data Lake/Traffic|Traffic data]]  -- webrequest, pageviews, unique devices ...
For the most part, Presto, Hive, and Spark work the same way, but they have some differences in SQL syntax and processing power.
* [[Analytics/Data Lake/Edits|Edits data]] -- Historical data about revisions, pages, and users
*[[Analytics/Data Lake/Content|Content data]] -- Wikitext and wikidata-entities
* [[Event Platform|Events data]] -- Eventlogging, eventbus and eventstreams data (raw, refined, sanitized)
** [[Analytics/Data Lake/ORES|ORES scores]] -- Machine learning predictions [available as events as of 2020-02-27]


Currently, you need [[Analytics/Data access#Production access|production data access]] to use some of this data. A lot of it is available publicy at http://dumps.wikimedia.org.
==== Syntax differences ====
* Spark and Hive use <code>STRING</code> as the keyword for string data, while Presto uses <code>VARCHAR</code>.
*In Spark and Hive, you use the <code>SIZE</code> function to get the length of an array, while in Presto you use <code>CARDINALITY</code>.
*In Spark and Hive, double quoted text (like <code>"foo"</code>) is interpreted as a string, while in Presto it is interpreted as a column name. It's easiest to use single quoted text (like <code>'foo'</code>) for strings, since all three engines interpret it the same way.
*Spark and Hive have a <code>CONCAT_WS</code> ("concatenate with separator") function, but Presto does not.
* Presto interprets double quoted things as column names
* Presto has no FIRST and LAST functions
* If you need to use a keyword like <code>DATE</code> as a column name, you use backticks (<code>`date`</code>) in Spark and Hive, but double quotes (<code>"date"</code>) in Presto.


== Technical architecture ==
== Technical architecture ==
Data Lake datasets which are available in Hive are stored in the [[Analytics/Systems/Cluster/Hadoop|Hadoop]] Distributed File System (HDFS), usually in the [[En:Apache Parquet|Parquet]] file format. The [https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration Hive metastore] is a centralized repository for metadata about these data files, and all three SQL query engines we use (Presto, Spark SQL, and Hive) rely on it.
Some Data Lake datasets are available in [[Analytics/Systems/Druid|Druid]], which is separate from Hive and HDFS, and allows quick exploration and dashboarding of those datasets in [[Analytics/Systems/Turnilo|Turnilo]] and [[Analytics/Systems/Superset|Superset]].
The [[Analytics/Systems/Cluster|Analytics cluster]], which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.
The [[Analytics/Systems/Cluster|Analytics cluster]], which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.

Revision as of 22:02, 21 September 2021

The Analytics Data Lake (ADL), or the Data Lake for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a data lake).

Data available

Traffic data
Webrequest, pageviews, and unique devices
Edits data
Historical data about revisions, pages, and users (e.g. MediaWiki History)
Content data
Wikitext (latest & historical) and wikidata-entities
Events data
EventLogging, EventBus and event streams data (raw, refined, sanitized)
ORES scores
Machine learning predictions (available as events as of 2020-02-27)

Some of these datasets (such as webrequests) are only available in Hive, while others (such as pageviews) are also available as data cubes (usually in more aggregated capacity).

Access

The main way to access the data in the Data Lake is to run queries using one of the three available SQL engines: Presto, Hive, and Spark.

You can access these engines through several different routes:

All three engines also have command-line programs which you can use on one of the analytics clients. This is probably the least convenient way, but if you want to use it, consult the engine's documentation page.

Differences between the SQL engines

For the most part, Presto, Hive, and Spark work the same way, but they have some differences in SQL syntax and processing power.

Syntax differences

  • Spark and Hive use STRING as the keyword for string data, while Presto uses VARCHAR.
  • In Spark and Hive, you use the SIZE function to get the length of an array, while in Presto you use CARDINALITY.
  • In Spark and Hive, double quoted text (like "foo") is interpreted as a string, while in Presto it is interpreted as a column name. It's easiest to use single quoted text (like 'foo') for strings, since all three engines interpret it the same way.
  • Spark and Hive have a CONCAT_WS ("concatenate with separator") function, but Presto does not.
  • Presto interprets double quoted things as column names
  • Presto has no FIRST and LAST functions
  • If you need to use a keyword like DATE as a column name, you use backticks (`date`) in Spark and Hive, but double quotes ("date") in Presto.

Technical architecture

Data Lake datasets which are available in Hive are stored in the Hadoop Distributed File System (HDFS), usually in the file format. The Hive metastore is a centralized repository for metadata about these data files, and all three SQL query engines we use (Presto, Spark SQL, and Hive) rely on it.

Some Data Lake datasets are available in Druid, which is separate from Hive and HDFS, and allows quick exploration and dashboarding of those datasets in Turnilo and Superset.

The Analytics cluster, which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.