You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Doc proposal

From Wikitech-static
< Analytics‎ | Data Lake
Revision as of 19:20, 30 March 2017 by imported>Milimetric
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The Analytics Data Lake (ADL) is a large, analytics-oriented repository of data, both raw and aggregated, about Wikimedia projects (in industry terms, a data lake). It currently includes data on pageviews and a beta set of historical data about editing, including revisions, pages, and users. As the Data Lake matures, we will add any and all data that can be safely made public. The infrastructure will support public releases of datasets, out of the box.

Proposal for subpages organisation:
Analytics/Data Lake/
   - Traffic
     - Data Pipeline
       - Data Ingestion
       - Data Refinement
       - Data serving
     - Datasets
       - webrequest
       - pageview_hourly
       - projectview_hourly
       - pagecounts (legacy)
       - unique_devices
  - Edits
     - Data Pipeline
       - Data Ingestion
       - Data Refinement
       - Data serving
     - Datasets
       - Mediawiki Tables
       - Rebuilt history
       - Metrics

Traffic history

Traffic history is currently usually named pageviews. Before 2015, it was names pagecountsand was mostly extracted from sampled data.

Data Pipeline/

  • Data ingestion (kafka + camus)
  • Data refinement and extraction (webrequest + pageview_hourly + projectview_hourly + unique_devices)
  • Data serving (hive, druid, AQS)

Datasets

  • webrequest
  • pageview_hourly
  • projectview_hourly
  • pagecounts (legacy)
  • unique devices

Edits History

Data Pipeline

  • Data ingestion (Sqoop)
  • Data refinement (page+user and denormalize)
  • Data serving (hive, druid ???)

Datasets

  • Mediawiki tables (sqooped tables)
  • Recomputed history (page, user, denormalized)
  • Metrics