You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Doc proposal: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Milimetric
mNo edit summary
 
imported>Joal
(Undo revision 1794899 by Test9999 (talk))
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
The Analytics Data Lake (ADL) is a large, analytics-oriented repository of data, both raw and aggregated, about Wikimedia projects (in industry terms, a [[:en:Data_lake|data lake]]). It currently includes data on pageviews and a beta set of historical data about editing, including revisions, pages, and users.  As the Data Lake matures, we will add any and all data that can be safely made public. The infrastructure will support public releases of datasets, out of the box.
#REDIRECT [[Analytics/Doc proposal]]
 
<syntaxhighlight>
Proposal for subpages organisation:
Analytics/Data Lake/
  - Traffic
    - Data Pipeline
      - Data Ingestion
      - Data Refinement
      - Data serving
    - Datasets
      - webrequest
      - pageview_hourly
      - projectview_hourly
      - pagecounts (legacy)
      - unique_devices
  - Edits
    - Data Pipeline
      - Data Ingestion
      - Data Refinement
      - Data serving
    - Datasets
      - Mediawiki Tables
      - Rebuilt history
      - Metrics
</syntaxhighlight>
 
== Traffic history ==
Traffic history  is currently usually named <code>pageviews</code>. Before 2015, it was names <code>pagecounts</code>and was mostly extracted from sampled data.
 
=== Data Pipeline/ ===
* Data ingestion (kafka + camus)
 
* Data refinement and extraction (webrequest + pageview_hourly + projectview_hourly + unique_devices)
* Data serving (hive, druid, AQS)
 
=== Datasets ===
* webrequest
* pageview_hourly
* projectview_hourly
* pagecounts (legacy)
* unique devices
 
== Edits History ==
 
=== Data Pipeline ===
* Data ingestion (Sqoop)
 
* Data refinement (page+user and denormalize)
* Data serving (hive, druid ???)
 
=== Datasets ===
* Mediawiki tables (sqooped tables)
* Recomputed history (page, user, denormalized)
* Metrics

Latest revision as of 18:20, 18 June 2018