Difference between revisions of "Analytics/Data Lake/Edits"

From Wikitech-static
Jump to navigation Jump to search
imported>Joal
(Remove redirect and provide information)
imported>HaeB
(→‎Processed Data: two examples to give an idea what kind of metrics we're talking about)
Line 21: Line 21:
* [[Analytics/Data Lake/Edits/Mediawiki page history|Mediawiki page history]] -- Dataset providing reconstructed history events of mediawiki pages
* [[Analytics/Data Lake/Edits/Mediawiki page history|Mediawiki page history]] -- Dataset providing reconstructed history events of mediawiki pages
* [[Analytics/Data Lake/Edits/Mediawiki history|Mediawiki history]] -- Fully denormalized dataset containing user, page and revision processed data
* [[Analytics/Data Lake/Edits/Mediawiki history|Mediawiki history]] -- Fully denormalized dataset containing user, page and revision processed data
* [[Analytics/Data Lake/Edits/Metrics|Metrics]] -- Dataset providing precomputed metrics over edits data
* [[Analytics/Data Lake/Edits/Metrics|Metrics]] -- Dataset providing precomputed metrics over edits data (e.g. monthly new registered users or daily edits by anonymous users)


== Access ==
== Access ==
Some of the data above is made public through different systems (see [[Analytics]] main page), but any data on the Data Lake is private by default. For this, reference [[Analytics/Data access]]
Some of the data above is made public through different systems (see [[Analytics]] main page), but any data on the Data Lake is private by default. For this, reference [[Analytics/Data access]]

Revision as of 00:41, 20 May 2017

This page links to detailed information about Edits datasets in the Data Lake.

In comparison to the traffic ones, those datasets are not continuously updated. They are regularly updated by fully re-importing/re-building them, creating a new snapshot.

This snapshot notion is key when querying the Edits datasets, since inclufing multiple snapshots doesn't sense for most queries. As of 2017-04, snapshots are provided monthly.

Datasets

Mediawiki raw data

Those are copy of mediawiki MySQL tables

  • Archive
  • ipblocks
  • logging
  • page
  • revision
  • user
  • user_groups

Processed Data

  • Mediawiki user history -- Dataset providing reconstructed history events of mediawiki users
  • Mediawiki page history -- Dataset providing reconstructed history events of mediawiki pages
  • Mediawiki history -- Fully denormalized dataset containing user, page and revision processed data
  • Metrics -- Dataset providing precomputed metrics over edits data (e.g. monthly new registered users or daily edits by anonymous users)

Access

Some of the data above is made public through different systems (see Analytics main page), but any data on the Data Lake is private by default. For this, reference Analytics/Data access