You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data Lake: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Milimetric
imported>Neil P. Quinn-WMF
m (→‎Technical architecture: Add missing word)
 
(22 intermediate revisions by 7 users not shown)
Line 1: Line 1:
The Analytics Data Lake (ADL) refers to the collection, processing, and publishing of data from Wikimedia projects.  At first, the Data Lake focuses on collecting historical data about editing, including revisions, pages, users, and making it available in an analytics-friendly way for everyone, publicly.  As the Data Lake matures, we will add any and all data that can be safely made public.  The infrastructure will support public releases of datasets, out of the box.
The '''Analytics Data Lake''' (ADL), or the '''Data Lake''' for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a [[w:data lake|data lake]]).


== Data available ==
{{Note|content=Currently, you need [[Analytics/Data access#Production access|production data access]] to use some of this data. A lot of it is available publicly at [https://dumps.wikimedia.org/ dumps.wikimedia.org].|type=notice}}


== Initial Scope ==
;[[Analytics/Data Lake/Traffic|Traffic data]]
:[[Analytics/Data Lake/Traffic/Webrequest|Webrequest]], [[Analytics/Data Lake/Traffic/Pageviews|pageviews]], and [[Analytics/Data Lake/Traffic/Unique Devices|unique devices]]
;[[Analytics/Data Lake/Edits|Edits data]]
:Historical data about revisions, pages, and users (e.g. [[Analytics/Data Lake/Edits/MediaWiki history|MediaWiki History]])
;[[Analytics/Data Lake/Content|Content data]]
:Wikitext ([[Analytics/Data Lake/Content/Mediawiki wikitext current|latest]] & [[Analytics/Data Lake/Content/Mediawiki wikitext history|historical]]) and wikidata-entities
;[[Analytics/Data Lake/Events|Events data]]
:[[Analytics/Systems/EventLogging|EventLogging]], EventBus and event streams data (raw, refined, [[Analytics/Systems/Event Sanitization|sanitized]])
;[[Analytics/Data Lake/ORES|ORES scores]]
:Machine learning predictions (available as events as of 2020-02-27)


=== Consolidating Editing Data ===
Some of these datasets (such as webrequests) are only available in Hive, while others (such as pageviews) are ''also'' available as [[mw:Wikimedia Product/Data dictionary|data cubes]] (usually in more aggregated capacity).
Millions of people edit our projects.  Information about the knowledge they generate and improve is trapped in hundreds of separate mysql databases and large XML dump files.  We will create analytics-friendly schemas and transform this separated data to fit those schemas.  HDFS is the best storage solutions for this, so that's what we'll use.  We will make the schemas and the data extraction using an append-only style, so actions like deleting pages and supressing usertext can be first class citizens.  This will allow us to create redacted streams of data that can be published safely.


It will of course be important to keep this data up to date.  To accomplish this we will connect to real-time systems like Event Bus to get the latest data.  From time to time, we'll compare to make sure we have no replication gaps.
== Access ==
The main way to access the data in the Data Lake is to run queries using one of the three available SQL engines: [[Analytics/Systems/Presto|Presto]], [[Hive]], and [[Analytics/Systems/Cluster/Spark|Spark]].


=== Hive Tables ===
You can access these engines through several different routes:
When storing to HDFS, we will create well documented, unified tables on top of this data.  This will be useful for any batch or really long running queries.


=== Druid ===
* [[Analytics/Systems/Superset|'''Superset''']] has a graphical SQL editor where you can run [[Analytics/Systems/Presto|Presto]] queries
Druid and any other Online Analytics Processing (OLAP) systems we use will serve this data to internal and maybe external users as well.  This data serving layer allows us to run complicated queries that would otherwise consume massive resources in a relational database.  If we're able to properly redact and re-load this data on a regular basis, we will be able to open this layer to the public.
* '''[[Analytics/Cluster/Hue|Hue]]''' has a graphical SQL editor where you can run [[Analytics/Systems/Cluster/Hive|Hive]] queries
* '''Custom code''' on one of the [[Analytics/Systems/Clients|analytics clients]] (the easiest way to do this is to use our [[Analytics/Systems/Jupyter|Jupyter service]])
** for '''Python''', use the [https://github.com/wikimedia/wmfdata-python wmfdata-python] package
** for '''R''', use the [https://github.com/wikimedia/wmfdata-r wmfdata-r] package


=== Analytics Query Service / Dumps ===
All three engines also have command-line programs which you can use on one of the [[Analytics/Systems/Clients|analytics clients]]. This is probably the least convenient way, but if you want to use it, consult the engine's documentation page.
We will continue and push slices of this data out to the world through our query service (AQS) which currently hosts our Pageview and Unique Devices data. We will also make the most useful forms of this data available in static file dumps.  These dumps will contain strictly metadata and shouldn't be confused with the "right to fork"-oriented richer dumps.  Those may be easier to generate using this system as well, see below.


=== Differences between the SQL engines ===
For the most part, Presto, Hive, and Spark work the same way, but they have some differences in SQL syntax and processing power.


== Pleasant Side Effects ==
==== Syntax differences ====
* Spark and Hive use <code>STRING</code> as the keyword for string data, while Presto uses <code>VARCHAR</code>.
*In Spark and Hive, you use the <code>SIZE</code> function to get the length of an array, while in Presto you use <code>CARDINALITY</code>.
*In Spark and Hive, double quoted text (like <code>"foo"</code>) is interpreted as a string, while in Presto it is interpreted as a column name. It's easiest to use single quoted text (like <code>'foo'</code>) for strings, since all three engines interpret it the same way.
*Spark and Hive have a <code>CONCAT_WS</code> ("concatenate with separator") function, but Presto does not.
* Spark supports both <code>FLOAT</code> and <code>REAL</code> as keywords for the 32-bit floating-point number data type, while Presto supports only <code>REAL</code>.
* Presto has no FIRST and LAST functions
* If you need to use a keyword like <code>DATE</code> as a column name, you use backticks (<code>`date`</code>) in Spark and Hive, but double quotes (<code>"date"</code>) in Presto.
*To convert an ISO 8601 timestamp string (e.g. <code>"2021-11-01T01:23:02Z"</code>) to an SQL timestamp:
**Spark: <code>TO_TIMESTAMP(dt)</code>
**Presto: <code>FROM_ISO8601_TIMESTAMP(dt)</code>
**Hive: <code>FROM_UNIXTIME(UNIX_TIMESTAMP(dt, "yyyy-MM-dd'T'HH:mm:ss'Z'"))</code>


One potential use of this technology will be to help replace the aging Dumps process. Incremental dumps, more accurately redacted dumps, reliable re-runnable dumps should all be much easier to achieve with the Data Lake, and the data streams that feed into it, than they are with the current set of dumps scripts and manual intervention.
== Technical architecture ==
Data Lake datasets which are available in Hive are stored in the [[Analytics/Systems/Cluster/Hadoop|Hadoop]] Distributed File System (HDFS), usually in the Parquet file format. The [https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration Hive metastore] is a centralized repository for metadata about these data files, and all three SQL query engines we use (Presto, Spark SQL, and Hive) rely on it.
[[En:Apache Parquet]]
Some Data Lake datasets are available in [[Analytics/Systems/Druid|Druid]], which is separate from Hive and HDFS, and allows quick exploration and dashboarding of those datasets in [[Analytics/Systems/Turnilo|Turnilo]] and [[Analytics/Systems/Superset|Superset]].


== Project Documentation ==
The [[Analytics/Systems/Cluster|Analytics cluster]], which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.
 
=== Architecture ===
 
==== Systems ====
Various experiences<ref>Two historical big projects are [[m:Data_dumps|dumps generation]] and [[stats:|wikistats]], and a two new internal projects are [[Analytics/DataWarehouse|DataWarehouse]] and [[m:Research:Measuring_edit_productivity|measuring edit productivity]].</ref> on gathering and computing on full edit data history has shown that it's a bad idea to rebuild a full edit data set on regular basis in opposition to incrementally update it.
 
In order to get there, two core systems are needed:
* '''Historical data extraction system:''' It extracts historical data from either the mediawiki databases and/or the XML dumps and convert and refine it to the schema used (see below for schema description).
* '''Incremental data update system:''' It handles events flowing through a streaming system and updates an already existing data set by transforming and refining the events into the needed schema.
Once those two systems are built and tested, a date needs to decided upon which the data set will be built, from historical system before D, and from incremental system after D. We also plan to maintain the historical system even if its use is less regular than the incremental one, to ensure new data could be extracted historically in the future.
 
==== Stack ====
The plan is to use [[Analytics/Cluster/Hadoop|Hadoop]] to both store data and compute the various ETL / refinement steps (cheap, reliable and already in place).
 
Feeding systems will be [[mediawikiwiki:Manual:Database_layout|MariaDB]] for historical needs since it contains more and better quality data than xml dumps,and [[mediawikiwiki:Extension:EventBus|Kafka through EventBus]] for streaming input data.
 
Querying systems are planned to be Druid for usual / simple metrics, [[Analytics/Cluster/Hive|Hive]] and/or [[Analytics/Cluster/Spark|Spark]] for complex queries, and possibly the [[Analytics/AQS|Analytics Query Service]] to provide metrics externally.
 
=== Data schema ===
 
==== Historical data ====
* [https://upload.wikimedia.org/wikipedia/commons/f/f7/MediaWiki_1.24.1_database_schema.svg Mediawiki Database schema]  --  <code>Revision</code>, <code>Archive</code>, <code>Page</code>, <code>User</code>, and <code>Logging</code> tables.
 
* Intermediate schema -- Fed transforming and enhancing raw Media wiki data. It contains <code>Revision_change</code>, <code>Page_change</code> and <code>User_change</code> tables [Still WIP]. Rows contain entity state at a moment in time, the next state change time, and event oriented information, such as who did the change, and what type of change it is.
 
==== Incremental data ====
* [https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki Event Bus schemas] -- An [[phab:T134502|update to this schema]] is being discussed and will be merged as a v1 when mediawiki code gets updated to populate those new event types.
 
==== Query data in Hadoop ====
* Fully denormalized schema containing event oriented rows with null values in fields not related to the current event.  In hive syntax:
 
<pre>
CREATE EXTERNAL TABLE edit_history (
 
    -- Generic event information
    -- Populated for every event
    wiki_db                                string,
    -- revision, page, user
    event_entity                            string,
    -- create, move, delete, rename, etc.
    event_type                              string,
    -- mediawiki format yyyyMMddHHmmss
    event_timestamp                        string,
    -- from the revision, log, archive, etc.
    event_comment                          string,
    -- fields with _latest are the most recent values for this user
    -- the other fields are the values at the time of the event
    event_user_id                          bigint,
    event_user_text                        string,
    event_user_text_latest                  string,
    event_user_blocks                      array<string>,
    event_user_blocks_latest                array<string>,
    event_user_groups                      array<string>,
    event_user_groups_latest                array<string>,
    event_user_creation_timestamp          string,
 
    -- For page and revision events, the page that's being acted on (new revision, moved, deleted, etc.)
    -- fields with _latest are the most recent values for this page
    -- the other fields are the values at the time of the event
    page_id                                bigint,
    page_title                              string,
    page_title_latest                      string,
    page_namespace                          int,
    page_namespace_latest                  int,
    page_creation_timestamp                string,
 
    -- For user events, the user that's being acted on (blocked, renamed, etc.)
    -- fields with _latest are the most recent values for this user
    -- the other fields are the values at the time of the event
    user_id                                bigint,
    user_text                              string,
    user_text_latest                        string,
    user_blocks                            array<string>,
    user_blocks_latest                      array<string>,
    user_groups                            array<string>,
    user_groups_latest                      array<string>,
    user_creation_timestamp                string,
 
    -- For revision events, the revision that's being acted on (created, visibility chagned, etc.)
    revision_id                            bigint,
    revision_parent_id                      bigint,
    revision_minor_edit                    boolean,
    revision_text_bytes                    bigint,
    -- joined by rev_parent_id and got the byte difference (not all records joined cleanly)
    revision_text_bytes_diff                bigint,
    revision_text_sha1                      string,
    revision_content_model                  string,
    revision_content_format                string,
    -- a best guess as to when this revision was deleted (if it came from the archive table)
    -- if this was not available, we used the timestamp of the last revision for the archived page
    revision_deleted_timestamp              string,
    -- These last two are not vetted yet, but they'd be valuable for revert work (we'll give them better names too)
    revision_most_recent_revert_timestamp  string,
    revision_sha1_revert                    boolean
)
;
 
-- If you'd like to actually create the table, add this before the semicolonː
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://analytics-hadoop/user/mforns/denormalizedFixed'
</pre>
 
==== Query data in Druid ====
* Same denormalized schema as in hadoop enhanced with precomputed immutable flags<ref>For instance <code>is_new_editor</code>, <code>is_new_productive_editor</code>, and <code>is_new_surviving_editor</code> for users and <code>is_productive</code>, <code>is_reverted</code> and is <code>deleted</code> for revisions.</ref> if [http://druid.io/docs/latest/querying/lookups.html Druid Query-Time lookups] can handle them.
 
=== Ongoing Work ===
 
==== EventBus ====
* Schema update -- {{Phabricator|T134502}}
* Mediawiki update to handle schema update -- {{Phabricator|T137287}}
* New event schema to come after this set of patches
 
==== Historical data sourcing ====
* Hive schema creation and test using simplewiki and a set of test queries on dump generated data -- {{Phabricator|T134793}}
* ETL for transforming MediaWiki database data to Hive schema for simplewiki -- {{Phabricator|T134790}}
* Scalability tests to come after pipeline is built
 
==== Details not to Forget ====
* At page rename, there sometimes is a new page created which has the renamed page original title and redirects to the renamed page. We have left those on the side for the moment.
* There are user rename log lines that can't be linked back to an actual user. It could be because of deletions, but we're not sure. We should investigate a bit.
 
 
----

Latest revision as of 17:49, 25 January 2022

The Analytics Data Lake (ADL), or the Data Lake for short, is a large, analytics-oriented repository of data about Wikimedia projects (in industry terms, a data lake).

Data available

Traffic data
Webrequest, pageviews, and unique devices
Edits data
Historical data about revisions, pages, and users (e.g. MediaWiki History)
Content data
Wikitext (latest & historical) and wikidata-entities
Events data
EventLogging, EventBus and event streams data (raw, refined, sanitized)
ORES scores
Machine learning predictions (available as events as of 2020-02-27)

Some of these datasets (such as webrequests) are only available in Hive, while others (such as pageviews) are also available as data cubes (usually in more aggregated capacity).

Access

The main way to access the data in the Data Lake is to run queries using one of the three available SQL engines: Presto, Hive, and Spark.

You can access these engines through several different routes:

All three engines also have command-line programs which you can use on one of the analytics clients. This is probably the least convenient way, but if you want to use it, consult the engine's documentation page.

Differences between the SQL engines

For the most part, Presto, Hive, and Spark work the same way, but they have some differences in SQL syntax and processing power.

Syntax differences

  • Spark and Hive use STRING as the keyword for string data, while Presto uses VARCHAR.
  • In Spark and Hive, you use the SIZE function to get the length of an array, while in Presto you use CARDINALITY.
  • In Spark and Hive, double quoted text (like "foo") is interpreted as a string, while in Presto it is interpreted as a column name. It's easiest to use single quoted text (like 'foo') for strings, since all three engines interpret it the same way.
  • Spark and Hive have a CONCAT_WS ("concatenate with separator") function, but Presto does not.
  • Spark supports both FLOAT and REAL as keywords for the 32-bit floating-point number data type, while Presto supports only REAL.
  • Presto has no FIRST and LAST functions
  • If you need to use a keyword like DATE as a column name, you use backticks (`date`) in Spark and Hive, but double quotes ("date") in Presto.
  • To convert an ISO 8601 timestamp string (e.g. "2021-11-01T01:23:02Z") to an SQL timestamp:
    • Spark: TO_TIMESTAMP(dt)
    • Presto: FROM_ISO8601_TIMESTAMP(dt)
    • Hive: FROM_UNIXTIME(UNIX_TIMESTAMP(dt, "yyyy-MM-dd'T'HH:mm:ss'Z'"))

Technical architecture

Data Lake datasets which are available in Hive are stored in the Hadoop Distributed File System (HDFS), usually in the Parquet file format. The Hive metastore is a centralized repository for metadata about these data files, and all three SQL query engines we use (Presto, Spark SQL, and Hive) rely on it. Some Data Lake datasets are available in Druid, which is separate from Hive and HDFS, and allows quick exploration and dashboarding of those datasets in Turnilo and Superset.

The Analytics cluster, which consists of Hadoop servers and related components, provides the infrastructure for the Data Lake.