imported>Neil P. Quinn-WMF
|(25 intermediate revisions by 7 users not shown)|
The Analytics Data Lake (ADL)
refers to the collection, processing, and publishing of data from Wikimedia projects . At first, the Data Lake focuses on collecting historical data about editing, including revisions, pages, users, and making it available in an analytics-friendly way for everyone, publicly. As the Data Lake matures, we will add any and all data that can be safely made public. The infrastructure will support public releases of datasets, out of the box.
The Analytics Data Lake(ADL)the , , of data Wikimedia projects , data data .
== Initial Scope ==
=== Consolidating Editing Data ===
of in , '' data .
Millions of people edit our projects. Information about the knowledge they generate and improve is trapped in hundreds of separate mysql databases and large XML dump files. We will create analytics-friendly schemas and transform this separated data to fit those schemas. HDFS is the best storage solutions for this, so that' s what we' ll use. We will make the schemas and the data extraction using an append-only style, so actions like deleting pages and supressing usertext can be first class citizens. This will allow us to create redacted streams of data that can be published safely.
It will of course be important to keep this data up to date. To accomplish this we will connect to real-time systems like Event Bus to get the latest data. From time to time, we'll compare to make sure we have no replication gaps.
to data to the , .
=== Hive Tables ===
When storing to HDFS, we will create well documented, unified tables on top of this data. This will be useful for any batch or really long running queries.
=== Druid ===
Druid and any other Online Analytics Processing (OLAP) systems we use will serve this data to internal and maybe external users as well. This data serving layer allows us to run complicated queries that would otherwise consume massive resources in a relational database. If we' re able to properly redact and re-load this data on a regular basis, we will be able to open this layer to the public.
' on to this to
=== Analytics Query Service / Dumps ===
Analytics/. the to , .
We will continue and push slices of this data out to the world through our query service (AQS) which currently hosts our Pageview and Unique Devices data. We will also make the most useful forms of this data available in static file dumps. These dumps will contain strictly metadata and shouldn't be confused with the "right to fork"-oriented richer dumps. Those may be easier to generate using this system as well, see below.
Pleasant Side Effects ==
One potential use of this technology will be to help replace the aging Dumps process. Incremental dumps, more accurately redacted dumps, reliable re-runnable dumps should all be much easier to achieve with the Data Lake, and the data streams that feed into it, than they are with the current set of dumps scripts and manual intervention.
the . , ,
Data Lake , and
, of and .
The Analytics Data Lake (ADL), or the Data Lake for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a data lake).
- Traffic data
- Webrequest, pageviews, and unique devices
- Edits data
- Historical data about revisions, pages, and users (e.g. MediaWiki History)
- Content data
- Wikitext (latest & historical) and wikidata-entities
- Events data
- EventLogging, EventBus and event streams data (raw, refined, sanitized)
- ORES scores
- Machine learning predictions (available as events as of 2020-02-27)
Some of these datasets (such as webrequests) are only available in Hive, while others (such as pageviews) are also available as data cubes (usually in more aggregated capacity).
The main way to access the data in the Data Lake is to run queries using one of the three available SQL engines: Presto, Hive, and Spark.
You can access these engines through several different routes:
All three engines also have command-line programs which you can use on one of the analytics clients. This is probably the least convenient way, but if you want to use it, consult the engine's documentation page.
Differences between the SQL engines
For the most part, Presto, Hive, and Spark work the same way, but they have some differences in SQL syntax and processing power.
- Spark and Hive use
STRING as the keyword for string data, while Presto uses
- In Spark and Hive, you use the
SIZE function to get the length of an array, while in Presto you use
- In Spark and Hive, double quoted text (like
"foo") is interpreted as a string, while in Presto it is interpreted as a column name. It's easiest to use single quoted text (like
'foo') for strings, since all three engines interpret it the same way.
- Spark and Hive have a
CONCAT_WS ("concatenate with separator") function, but Presto does not.
- Spark supports both
REAL as keywords for the 32-bit floating-point number data type, while Presto supports only
- Presto has no FIRST and LAST functions
- If you need to use a keyword like
DATE as a column name, you use backticks (
`date`) in Spark and Hive, but double quotes (
"date") in Presto.
- To convert an ISO 8601 timestamp string (e.g.
"2021-11-01T01:23:02Z") to an SQL timestamp:
Data Lake datasets which are available in Hive are stored in the Hadoop Distributed File System (HDFS), usually in the Parquet file format. The Hive metastore is a centralized repository for metadata about these data files, and all three SQL query engines we use (Presto, Spark SQL, and Hive) rely on it.
Some Data Lake datasets are available in Druid, which is separate from Hive and HDFS, and allows quick exploration and dashboarding of those datasets in Turnilo and Superset.
The Analytics cluster, which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.