You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Analytics/Data Lake/Edits"

From Wikitech-static
Jump to navigation Jump to search
imported>Milimetric
m (Milimetric moved page Analytics/Data Lake to Analytics/Data Lake/Edits: Reorganizing documentation)
 
imported>Iflorez
(Reformated summary sentences #raddocs)
 
(22 intermediate revisions by 8 users not shown)
Line 1: Line 1:
The Analytics Data Lake (ADL) is a large, analytics-oriented repository of data, both raw and aggregated, about Wikimedia projects (in industry terms, a [[:en:Data_lake|data lake]]). It currently includes data on pageviews and a beta set of historical data about editing, including revisions, pages, and users  As the Data Lake matures, we will add any and all data that can be safely made public. The infrastructure will support public releases of datasets, out of the box.
The [[Analytics/Data Lake|Analytics Data Lake]] contains a number of '''editing datasets'''.


== Initial Scope ==
To access this data see [[SRE/Production access|how to request and set up access]]. To understand the aspects of access and access guidelines see [[Analytics/Data access guidelines|Data Access Guidelines]] and [[Analytics/Data access|accessing sensitive data]].  For recipes that work with lots of data, see [[Analytics/Data Lake/Cookbook]].


=== Consolidating Editing Data ===
'''Note''': In comparison to [[Analytics/Data Lake/Traffic|traffic datasets]], edit datasets are ''not'' continuously updated. They are regularly updated by fully re-importing/re-building them, creating a new '''<code>snapshot</code>'''. This '''<code>snapshot</code>''' notion is key when querying the Edits datasets, since including multiple snapshots doesn't make sense for most queries. As of 2017-04, snapshots are provided monthly. When we import, we grab all the data available from all tables except the <code>revision</code> table, for which we filter by <code>where rev_timestamp <= <<snapshot-date>></code>. If the snapshot is a little late because of processing problems, then by the time it finishes it may have more data in tables like logging, archive, etc. These should not affect history reconstruction because we base everything on revisions, but they'll affect any queries you may run on those tables separately.
Millions of people edit our projects. Information about the knowledge they generate and improve is trapped in hundreds of separate mysql databases and large XML dump files. We will create analytics-friendly schemas and transform this separated data to fit those schemas.  HDFS is the best storage solutions for this, so that's what we'll use. We will make the schemas and the data extraction using an append-only style, so actions like deleting pages and supressing usertext can be first class citizens. This will allow us to create redacted streams of data that can be published safely.


It will of course be important to keep this data up to date.  To accomplish this we will connect to real-time systems like Event Bus to get the latest data.  From time to time, we'll compare to make sure we have no replication gaps.
The pipeline used to generate these edits datasets is described at [[Analytics/Systems/Data Lake/Edits/Pipeline]].


=== Hive Tables ===
== Datasets ==
When storing to HDFS, we will create well documented, unified tables on top of this data.  This will be useful for any batch or really long running queries.


=== Druid ===
=== Reference Data ===
Druid and any other Online Analytics Processing (OLAP) systems we use will serve this data to internal and maybe external users as well.  This data serving layer allows us to run complicated queries that would otherwise consume massive resources in a relational database.  If we're able to properly redact and re-load this data on a regular basis, we will be able to open this layer to the public.
* <code>[[Analytics/Data Lake/Edits/Mediawiki project namespace map|wmf_raw.mediawiki_project_namespace_map]]</code>


=== Analytics Query Service / Dumps ===
=== Raw Mediawiki data ===
We will continue and push slices of this data out to the world through our query service (AQS) which currently hosts our Pageview and Unique Devices data. We will also make the most useful forms of this data available in static file dumps. These dumps will contain strictly metadata and shouldn't be confused with the "right to fork"-oriented richer dumps. Those may be easier to generate using this system as well, see below.
These are unprocessed copies of the [[MariaDB]] [[mw:Manual:Database layout|application tables]] (most of them publicly available) that back our MediaWiki installations. They are stored in the <code>wmf_raw</code> database. Main difference with the original tables in databases is that the import bundle all wikis together in every table, facilitating cross-wiki queries. This means every table contains a new field <code>wiki_db</code> allowing to choose the wikis to query. Another thing to notice about this field is that it is a [[Analytics/Systems/Cluster/Hive/Queries#Always restrict queries to a date range (partitioning)|partition]] in the sense of hive tables, so a restriction on that field will make the queries a lot faster for not having to read every wiki data.
== Pleasant Side Effects ==
* <code>[[mw:Manual:Archive table|mediawiki_archive]]</code>
* <code>[[mw:Extension:CheckUser/cu changes table|mediawiki_cu_changes]]</code> (from the [[mw:Extension:CheckUser|CheckUser]] extension)
* <code>[[mw:Manual:ipblocks table|mediawiki_ipblocks]]</code>
* <code>[[mw:Manual:logging table|mediawiki_logging]]</code>
* <code>[[mw:Manual:page table|mediawiki_page]]</code>
* <code>[[mw:Manual:pagelinks table|mediawiki_pagelinks]]</code>
* <code>[[mw:Manual:redirect table|mediawiki_redirect]]</code>
* <code>[[mw:Manual:revision table|mediawiki_revision]]</code>
* <code>[[mw:Manual:user table|mediawiki_user]]</code>
* <code>[[mw:Manual:user groups table|mediawiki_user_groups]]</code>


One potential use of this technology will be to help replace the aging Dumps process. Incremental dumps, more accurately redacted dumps, reliable re-runnable dumps should all be much easier to achieve with the Data Lake, and the data streams that feed into it, than they are with the current set of dumps scripts and manual intervention.
=== Processed data ===
Those are preprocessed data, usually stored in Parquet format and sometimes containing additional fields. Those tables can be found in the <code>wmf</code> database.


== Project Documentation ==
* <code>[[Analytics/Data Lake/Edits/Mediawiki history|mediawiki_history]]</code>: fully denormalized dataset containing user, page and revision processed data
* <code>[[Analytics/Data Lake/Edits/Mediawiki history_dumps|mediawiki_history dumps]]</code>: TSV dump of the Mediawiki-History fully denormalized dataset. Available to download on [https://dumps.wikimedia.org/other/mediawiki_history Mediawiki Dumps]
* <code>[[Analytics/Data Lake/Edits/Mediawiki history reduced|mediawiki_history_reduced]]</code>: Dataset providing a reduced version of the <code>mediawiki_history</code> one, with less fields and specific precomputed events so that the datastore [[Analytics/Systems/Druid|druid]] can compute by-page and by-user activity levels.
* <code>[[Analytics/Data Lake/Edits/Mediawiki user history|mediawiki_user_history]]</code>: a subset of <code>mediawiki_history</code> containing only user events
* <code>[[Analytics/Data Lake/Edits/Mediawiki page history|mediawiki_page_history]]</code>: a subset of <code>mediawiki_history</code> containing only page events
* <code>[[Analytics/Data Lake/Edits/Metrics|mediawiki_metrics]]</code>: Dataset providing precomputed metrics over edits data (e.g. monthly new registered users or daily edits by anonymous users)
* <code>[[Analytics/Data_Lake/Content/XMLDumps/Mediawiki_wikitext_current|mediawiki_wikitext_current]]</code>: Avro version of current-page XML-Dumps (updated monthly, middle of the month). It contains the text of each page latest revision as well as some page and user information.
* <code>[[Analytics/Data_Lake/Content/XMLDumps/Mediawiki_wikitext_history|mediawiki_wikitext_history]]</code>: Avro version of all revisions history XML-Dumps (updated monthly, late in the month). It contains the text of each non-deleted revision as well as some page and user information.
* <code>[[Analytics/Data Lake/Edits/Edit hourly|edit_hourly]]</code>: Cube-like data set focused on edits. Its structure resembles the one from [[Analytics/Data Lake/Traffic/Pageview hourly|pageview_hourly]]. It has an hourly granularity and is partitioned by snapshot (as it is computed from [[Analytics/Data Lake/Edits/Mediawiki history|mediawiki_history]]).
* [[Analytics/Data_Lake/Edits/Geoeditors|Geoeditors]]: Counts of editors by project by country at different activity levels.  For reference, this is migrated from the old [[Analytics/Systems/Geowiki]].
** <code>mediawiki_geoeditors_daily</code>
** <code>mediawiki_geoeditors_monthly</code>
** <code>mediawiki_geoeditors_edits_monthly</code>
** [[Analytics/Data_Lake/Edits/Geoeditors/Public|Public bucketed version of geoeditors monthly]]
* <code>[[Analytics/Data_Lake/Edits/Wikidata_entity|Wikidata entity]]</code>: A parquet version of the wikidata json dumps. Updated weekly, partitioned by snapshot.
* <code>[[Analytics/Data_Lake/Edits/Wikidata_item_page_link|Wikidata item page link]]</code>: Links between wikidata-items and wiki pages (wiki_db, page_id). This is computed using the <code>wikidata_entity</code>, <code>mediawiki_page_history</code> and <code>project_namespace_map</code> tables every week. Warning: the page-history table is updated monthly only, so as month moves in, items to pages links get less precisely binded.


=== Architecture ===
=== Public dataset ===
Download from https://dumps.wikimedia.org/other/analytics/


==== Systems ====
=== Limitations of the historical datasets ===
Various experiences<ref>Two historical big projects are [[m:Data_dumps|dumps generation]] and [[stats:|wikistats]], and a two new internal projects are [[Analytics/DataWarehouse|DataWarehouse]] and [[m:Research:Measuring_edit_productivity|measuring edit productivity]].</ref> on gathering and computing on full edit data history has shown that it's a bad idea to rebuild a full edit data set on regular basis in opposition to incrementally update it.
Users of this data should be aware that the reconstruction process is not perfect. The resulting data is not 100% complete throughout all wiki-history. In some specific slices/dices of the data set, some fields may be missing (null) or approximated (inferred value).


In order to get there, two core systems are needed:
==== Why? ====
* '''Historical data extraction system:''' It extracts historical data from either the mediawiki databases and/or the XML dumps and convert and refine it to the schema used (see below for schema description).
* MediaWiki databases are not meant to store history (revisions yes, of course; but not user history or page history). They hold part of the history in the logging table, but it's incomplete and formatted in many different ways depending on the software version. This makes the reconstruction of MediaWiki history a really complex task. Even sometimes the data is not there, and can not be reconstructed.
* '''Incremental data update system:''' It handles events flowing through a streaming system and updates an already existing data set by transforming and refining the events into the needed schema.
* The size of the data is considerably large. The reconstruction algorithm needs to reprocess the whole database(s) at every run since the beginning of time, because MediaWiki constantly updates the old records of the logging table. This presents hard performance challenges to the reconstruction job, which made the code much more complex. We need to balance the complexity of the job with the data quality, at some point we need to add a lot of complexity to "maybe" improve quality for a small percentage of data. For example, if only 0.5% of pages have field X missing and getting the info to fix the field would make reconstruction twice as complex, it will not be corrected but rather documented as not present. This is a balance of requirements so you always let us know whether we are missing something there.
Once those two systems are built and tested, a date needs to decided upon which the data set will be built, from historical system before D, and from incremental system after D. We also plan to maintain the historical system even if its use is less regular than the incremental one, to ensure new data could be extracted historically in the future.


==== Stack ====
==== How much/Which data is missing? ====
The plan is to use [[Analytics/Cluster/Hadoop|Hadoop]] to both store data and compute the various ETL / refinement steps (cheap, reliable and already in place).
After vetting the data for some time we approximated that the recoverable data that we did not make to recover represented less than 1%. We also saw that this data corresponded mostly to the earlier years of reconstructed history (2007-2009), and especially related to deleted pages. We do not have yet an in-depth analysis of the completeness of the data, it's in our backlog, see: [[phab:T155507]]


Feeding systems will be [[mediawikiwiki:Manual:Database_layout|MariaDB]] for historical needs since it contains more and better quality data than xml dumps,and [[mediawikiwiki:Extension:EventBus|Kafka through EventBus]] for streaming input data.
==== Will there be improvements in the future to correct this missing data? ====
Yes, if we know that the improvement will have enough benefit. The mentioned task would help in measuring that.


Querying systems are planned to be Druid for usual / simple metrics, [[Analytics/Cluster/Hive|Hive]] and/or [[Analytics/Cluster/Spark|Spark]] for complex queries, and possibly the [[Analytics/AQS|Analytics Query Service]] to provide metrics externally.
==== Examples ====
 
''History of deleted pages that are (re)created:'' Correctly identifying a page as deleted and recreated might be straightforward for small sets of pages. It might also be simplified by "recreated" not meaning the page was undeleted by an administrator. As mentioned above, how MediaWiki logs data changes over time. This further complicates the identification process, particularly on a scale of "across all wikis". You might therefore find examples of pages that were recreated with the same page ID, namespace, and title. This can result in their creation and deletion timestamps in the history table appearing to be incorrect. If you're looking to run analysis on those kind of cases, further narrowing of the dataset (e.g. by time) might allow for correct processing of those.
=== History reconstruction ===
Some progress has been done and we have a working edit history reconstruction history (still with some caveats, but pretty much functional). The following pages describe each one of the steps of the pipeline and related topics.
 
==== Data pipeline ====
* [[Analytics/Data Lake/Data loading]]
* [[Analytics/Data Lake/Page and user history reconstruction]]
* [[Analytics/Data Lake/Denormalization and historification]]
* [[Analytics/Data Lake/Serving layer]]
 
==== Schemas ====
* [[Analytics/Data Lake/Mediawiki page history]]
* [[Analytics/Data Lake/Mediawiki user history]]
* [[Analytics/Data Lake/Mediawiki history]]
* [[Analytics/Data Lake/Metric results]]
 
==== Optimizations ====
* [[Analytics/Data Lake/History reconstruction algorithm and optimizations]]
 
==== Incremental data ====
* [https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki Event Bus schemas] -- An [[phab:T134502|update to this schema]] is being discussed and will be merged as a v1 when mediawiki code gets updated to populate those new event types.
 
==== Query data in Druid ====
 
===== Pivot sample queries =====
* [https://pivot.wikimedia.org/#mediawiki-history-beta/line-chart/2/EQUQLgxg9AqgKgYWAGgN7APYAdgC5gQAWAhgJYB2KwApgB5YBO1Azs6RpbutnsEwGZVyxALbVeAfQlhSY4AF9kwYhBkdmeANroVazsApU6jFmw55uOfETKUlxpq3adLvAUNHj8IhUt3OLZVUA/BkxACVicgBzcSUAEwBXBmI9XgAFAGYAWSpmMGorAFoARnlytCC0/Cj4o3pHMxdMKwISQ3sG0xDXfHclYTl8agA3anIwCXGZMABPXyqexfN8DDGGABtiHE6TJxXe4A3SApSNqhHiDcSvdGZqMDhZrC9gAGU4cIBJADkAcSMG2oYgmGlwmj4o1ITWAAF1FMA5i9eG8QHAFOV5PDkNoaF19s0eH1qIIBp5eAB3UgAa1IEniACMFswMAwwABBYIHZb6Fls+p7GFcFpuEkeIaQkbQ5zSOiTBmzArMemkfj8CTkRI+BHxUhMLn6eIsCDjXUxDGVBzdblEyGk4CDV5SMLiBEMxIQakPTnVHQG3gugBCHq9YCoSRS1WA6RKABFmayOf7hf4VsA+WHdo0lrb+g7yfhnbJXQk9dRk8pmCbyGbohjYchNRsNkomFKmrLaPLFSwVWqNVqUBDRtMNA3NA3JdKOJ3u0q++rNT5KrqQULgFTafSmUpjiITngAKzyIA=== Bytes added broken down by wiki for 3 months]
 
* Same denormalized schema as in hadoop enhanced with precomputed immutable flags<ref>For instance <code>is_new_editor</code>, <code>is_new_productive_editor</code>, and <code>is_new_surviving_editor</code> for users and <code>is_productive</code>, <code>is_reverted</code> and is <code>deleted</code> for revisions.</ref> if [http://druid.io/docs/latest/querying/lookups.html Druid Query-Time lookups] can handle them.
 
=== Ongoing Work ===
 
==== EventBus ====
* Schema update -- {{Phabricator|T134502}}
* Mediawiki update to handle schema update -- {{Phabricator|T137287}}
* New event schema to come after this set of patches
 
==== Historical data sourcing ====
* Hive schema creation and test using simplewiki and a set of test queries on dump generated data -- {{Phabricator|T134793}}
* ETL for transforming MediaWiki database data to Hive schema for simplewiki -- {{Phabricator|T134790}}
* Scalability tests to come after pipeline is built
 
==== Details not to Forget ====
* At page rename, there sometimes is a new page created which has the renamed page original title and redirects to the renamed page. We have left those on the side for the moment.
* There are user rename log lines that can't be linked back to an actual user. It could be because of deletions, but we're not sure. We should investigate a bit.
 
==== Full Text ====
* To measure a lot of important metrics like community backlogs, template use, category graphs, etc., we need to parse and analyze the full revision text of articles.
* To get this content, we can eitherː
** Look in dumps (Joseph is working this way now)
** Get stuff from the databases. Quick reminder on thatː on tin, do <pre>sql metawiki -h 10.64.16.186</pre> <pre>select blob_text from blobs_cluster24 where blob_id = ...</pre>.  Here you get cluster24 from the link in the text table and that IP from looking up cluster24 on https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php.
 
----

Latest revision as of 21:18, 21 September 2021

The Analytics Data Lake contains a number of editing datasets.

To access this data see how to request and set up access. To understand the aspects of access and access guidelines see Data Access Guidelines and accessing sensitive data. For recipes that work with lots of data, see Analytics/Data Lake/Cookbook.

Note: In comparison to traffic datasets, edit datasets are not continuously updated. They are regularly updated by fully re-importing/re-building them, creating a new snapshot. This snapshot notion is key when querying the Edits datasets, since including multiple snapshots doesn't make sense for most queries. As of 2017-04, snapshots are provided monthly. When we import, we grab all the data available from all tables except the revision table, for which we filter by where rev_timestamp <= <<snapshot-date>>. If the snapshot is a little late because of processing problems, then by the time it finishes it may have more data in tables like logging, archive, etc. These should not affect history reconstruction because we base everything on revisions, but they'll affect any queries you may run on those tables separately.

The pipeline used to generate these edits datasets is described at Analytics/Systems/Data Lake/Edits/Pipeline.

Datasets

Reference Data

Raw Mediawiki data

These are unprocessed copies of the MariaDB application tables (most of them publicly available) that back our MediaWiki installations. They are stored in the wmf_raw database. Main difference with the original tables in databases is that the import bundle all wikis together in every table, facilitating cross-wiki queries. This means every table contains a new field wiki_db allowing to choose the wikis to query. Another thing to notice about this field is that it is a partition in the sense of hive tables, so a restriction on that field will make the queries a lot faster for not having to read every wiki data.

Processed data

Those are preprocessed data, usually stored in Parquet format and sometimes containing additional fields. Those tables can be found in the wmf database.

  • mediawiki_history: fully denormalized dataset containing user, page and revision processed data
  • mediawiki_history dumps: TSV dump of the Mediawiki-History fully denormalized dataset. Available to download on Mediawiki Dumps
  • mediawiki_history_reduced: Dataset providing a reduced version of the mediawiki_history one, with less fields and specific precomputed events so that the datastore druid can compute by-page and by-user activity levels.
  • mediawiki_user_history: a subset of mediawiki_history containing only user events
  • mediawiki_page_history: a subset of mediawiki_history containing only page events
  • mediawiki_metrics: Dataset providing precomputed metrics over edits data (e.g. monthly new registered users or daily edits by anonymous users)
  • mediawiki_wikitext_current: Avro version of current-page XML-Dumps (updated monthly, middle of the month). It contains the text of each page latest revision as well as some page and user information.
  • mediawiki_wikitext_history: Avro version of all revisions history XML-Dumps (updated monthly, late in the month). It contains the text of each non-deleted revision as well as some page and user information.
  • edit_hourly: Cube-like data set focused on edits. Its structure resembles the one from pageview_hourly. It has an hourly granularity and is partitioned by snapshot (as it is computed from mediawiki_history).
  • Geoeditors: Counts of editors by project by country at different activity levels. For reference, this is migrated from the old Analytics/Systems/Geowiki.
  • Wikidata entity: A parquet version of the wikidata json dumps. Updated weekly, partitioned by snapshot.
  • Wikidata item page link: Links between wikidata-items and wiki pages (wiki_db, page_id). This is computed using the wikidata_entity, mediawiki_page_history and project_namespace_map tables every week. Warning: the page-history table is updated monthly only, so as month moves in, items to pages links get less precisely binded.

Public dataset

Download from https://dumps.wikimedia.org/other/analytics/

Limitations of the historical datasets

Users of this data should be aware that the reconstruction process is not perfect. The resulting data is not 100% complete throughout all wiki-history. In some specific slices/dices of the data set, some fields may be missing (null) or approximated (inferred value).

Why?

  • MediaWiki databases are not meant to store history (revisions yes, of course; but not user history or page history). They hold part of the history in the logging table, but it's incomplete and formatted in many different ways depending on the software version. This makes the reconstruction of MediaWiki history a really complex task. Even sometimes the data is not there, and can not be reconstructed.
  • The size of the data is considerably large. The reconstruction algorithm needs to reprocess the whole database(s) at every run since the beginning of time, because MediaWiki constantly updates the old records of the logging table. This presents hard performance challenges to the reconstruction job, which made the code much more complex. We need to balance the complexity of the job with the data quality, at some point we need to add a lot of complexity to "maybe" improve quality for a small percentage of data. For example, if only 0.5% of pages have field X missing and getting the info to fix the field would make reconstruction twice as complex, it will not be corrected but rather documented as not present. This is a balance of requirements so you always let us know whether we are missing something there.

How much/Which data is missing?

After vetting the data for some time we approximated that the recoverable data that we did not make to recover represented less than 1%. We also saw that this data corresponded mostly to the earlier years of reconstructed history (2007-2009), and especially related to deleted pages. We do not have yet an in-depth analysis of the completeness of the data, it's in our backlog, see: phab:T155507

Will there be improvements in the future to correct this missing data?

Yes, if we know that the improvement will have enough benefit. The mentioned task would help in measuring that.

Examples

History of deleted pages that are (re)created: Correctly identifying a page as deleted and recreated might be straightforward for small sets of pages. It might also be simplified by "recreated" not meaning the page was undeleted by an administrator. As mentioned above, how MediaWiki logs data changes over time. This further complicates the identification process, particularly on a scale of "across all wikis". You might therefore find examples of pages that were recreated with the same page ID, namespace, and title. This can result in their creation and deletion timestamps in the history table appearing to be incorrect. If you're looking to run analysis on those kind of cases, further narrowing of the dataset (e.g. by time) might allow for correct processing of those.