You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Doc proposal: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Milimetric
mNo edit summary
 
imported>Joal
(Update moving pages)
Line 1: Line 1:
The Analytics Data Lake (ADL) is a large, analytics-oriented repository of data, both raw and aggregated, about Wikimedia projects (in industry terms, a [[:en:Data_lake|data lake]]). It currently includes data on pageviews and a beta set of historical data about editing, including revisions, pages, and users.  As the Data Lake matures, we will add any and all data that can be safely made public. The infrastructure will support public releases of datasets, out of the box.
The Analytics Data Lake (ADL) is a large, analytics-oriented repository of data, both raw and aggregated, about Wikimedia projects (in industry terms, a [[:en:Data_lake|data lake]]). It currently includes data on pageviews and a beta set of historical data about editing, including revisions, pages, and users.  As the Data Lake matures, we will add any and all data that can be safely made public. The infrastructure will support public releases of datasets, out of the box.


== Moving pages proposal ==
<syntaxhighlight>
<syntaxhighlight>
Proposal for subpages organisation:
Analytics/Archives/
Analytics/Data Lake/
 
  - Traffic
  2015 data warehouse experiments
    - Data Pipeline
  2015 data warehouse experiments/2014-12-02 verifications
      - Data Ingestion
  2015 data warehouse experiments/2015-01-14 verifications
      - Data Refinement
  2015 data warehouse experiments/2015-02-03 verifications
      - Data serving
 
    - Datasets
  Cluster/ETL
      - webrequest
  Cluster/Logging Solutions Overview
      - pageview_hourly
  Cluster/Logging Solutions Recommendation
      - projectview_hourly
  Cluster/Streaming -- Rename Hadoop Streaming for xml dumps?
      - pagecounts (legacy)
  Cluster/Webrequest partitions
      - unique_devices
 
   - Edits
  Dashboards (archived)
    - Data Pipeline
 
      - Data Ingestion
  Data/Mobile requests stream
      - Data Refinement
  Data/Webrequests sampled
      - Data serving
  Data/Zero webrequests
    - Datasets
 
      - Mediawiki Tables
  Data/Pagecounts-all-sites
      - Rebuilt history
  Data/Pagecounts-ez
      - Metrics
  Data/Pagecounts-raw
</syntaxhighlight>
 
  Kraken/Meetings
  Kraken/Meetings/ArchitectureReview
  Kraken/Meetings/SecurityReview
 
  Global-Dev Dashboard
 
  Limn
 
  Mingle
 
  Pageviews/Aggregation
 
  Pentaho
  Products
 
  TOC
 
  Webstatscollector
 
  Wikipedia Zero
 
  Wikistats2.0
  Wikistats2.0/Design
 
  gp.wmflabs.org
 
  statsv -- delete?
 
 
Analytics/Tutorials
 
  Dashboards
 
 
Analytics/Data Lake
  analytics.wikimedia.org
 
Analytics/Data Lake/Traffic
 
  Data access
 
  Data/Pageviews --> Pageviews
  Data/Redirects --> Pageviews/Redirects
  Bots --> Pageviews/Bots
 
  Data/UserRetention --> User Retention
 
  Datasets/  <-- core page?
    Cluster/BrowserReports --> Browser Reports + enhance
 
  Monitoring (to create)
 
  -- Remove data below
  Data/ApiAction
  Data/Browser general -- How does it relates to BrowserReports??
  Data/Cirrus -- Discuss with Discovery, delete?
  Data/Mediacounts
  Data/Pageview hourly
  Data/Pageview hourly/Fingerprinting Over Time
  Data/Pageview hourly/Identity reconstruction analysis
  Data/Pageview hourly/K Anonymity Threshold Analysis
  Data/Pageview hourly/Sanitization
  Data/Pageview hourly/Sanitization algorithm proposal
  Data/Projectview hourly
  Data/Unique Devices
  Data/Webrequest
  Data/Webrequest/RawIPUsage
  Data/mobile apps session metrics
 
  Pageviews
 
  PageviewAPI
  LegacyPageviewAPI
 
  PageviewAPI/Capacity - Delete ?
  PageviewAPI/DataStore
  PageviewAPI/RESTBase
 
  Unique Devices
  Unique Devices/Last access solution
  Unique clients/Last access solution/BotResearch
  Unique clients/Last access solution/Validation
 
 
 
Analytics/Data Lake/Edits
  Data Lake <-- Core page
  -- Remove Data Lake/Schemas/ below
  Data Lake/Schemas/Mediawiki history
  Data Lake/Schemas/Mediawiki page history
  Data Lake/Schemas/Mediawiki user history
  Data Lake/Schemas/Metric results
 
 
Analytics/Systems
 
  AQS
  AQS/Scaling
  AQS/Scaling/2016/Hardware Refresh
  AQS/Scaling/2017/Cluster Expansion
  AQS/Scaling/LoadTesting
 
  Archiva
 
  Cluster/Druid --> Druid
  Cluster/Druid/Load test --> Druid/Load test
 
  Cluster
  Cluster/Access
  Cluster/Beeline
  Cluster/Camus
  Cluster/Data Format Experiments
  Cluster/Deploy/Refinery
  Cluster/Deploy/Refinery-source
  Cluster/Deploy a fix to incorrect camus partitionning
  Cluster/Geotagging
  Cluster/Hadoop -- Update ??
  Cluster/Hadoop/Administration
  Cluster/Hadoop/Load
  Cluster/Hardware -- Update ??
  Cluster/Hive  --- Update (tables)
  Cluster/Hive/Avro
  Cluster/Hive/Compression
  Cluster/Hive/Counting uniques
  Cluster/Hive/Mediawiki --> Empty - To delete?
  Cluster/Hive/Queries
  Cluster/Hive/Queries/Wikidata
  Cluster/Hive/QueryUsingUDF
  Cluster/Hive/Schemas -- Replaced by datalake dataset - Delete?
  Cluster/Kafka/Capacity
   Kafka Udp2log --> Cluster/Kafka/Kafka Udp2log
  Cluster/Logstash -- Really ???
  Cluster/MediaWiki Avro Logging -- Split to DataLake??
  Cluster/Oozie
  Cluster/Oozie/Administration
  Cluster/Ports
  Cluster/Puppet
  Cluster/Spark -- Update oozie part
  Geolocation --> Cluster/Geolocation
 
  Conferences
  Conferences/Apache Big Data Europe - November 2016
 
  Dashiki
 
  Datastores/Evaluation
 
  -- Update Data Lake/Pipeline to mediawiki history pipeline
  Data Lake/Pipeline/Data loading
  Data Lake/Pipeline/Denormalization and historification
  Data Lake/Pipeline/Page and user history reconstruction
  Data Lake/Pipeline/Page and user history reconstruction algorithm and optimizations
  Data Lake/Pipeline/Serving layer
  -- Similarly as above, create traffic pipeline
 
  EventLogging
  EventLogging/Administration
  EventLogging/Architecture
  EventLogging/Backfilling
  EventLogging/Data representations
  EventLogging/Data retention and auto-purging
  EventLogging/Monitoring
  EventLogging/New pipeline
  EventLogging/Outages
  EventLogging/Performance
  EventLogging/Publishing
  EventLogging/Sanitization vs Aggregation
  EventLogging/Sensitive Fields
  EventLogging/TestingOnBetaCluster
 
  EventStreams <-- In systems really??
 
  Wikimetrics
  Wikimetrics/Adding New Features
  Wikimetrics/Adding New Features/CentralAuth Cohorts
  Wikimetrics/Adding New Features/Tag Cohorts
  Wikimetrics/Global metrics
 
  Geowiki
 
  Reportupdater
 
  Siege
 
  Varnishkafka
 
  Vital Signs
 
  Wikistats
 
  piwik
 
OTHERS:
 
  DataRequests --> In FAQ?
  DataResearch --> milimetric/DataResearch ?
  DataResearch/VisualEditor --> milimetric/DataResearch/VisualEditor ?
  Data Lake/Doc proposal -- Delete
 
  Datasets << Removed?
 
  FAQ
  MailingList - In FAQ ?
 
  Onboarding
 
  Oncall
 
  Team -- Move to main?
  Tier2 -- Move to main
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 


== Traffic history ==
Traffic history  is currently usually named <code>pageviews</code>. Before 2015, it was names <code>pagecounts</code>and was mostly extracted from sampled data.


=== Data Pipeline/ ===
* Data ingestion (kafka + camus)


* Data refinement and extraction (webrequest + pageview_hourly + projectview_hourly + unique_devices)
* Data serving (hive, druid, AQS)


=== Datasets ===
* webrequest
* pageview_hourly
* projectview_hourly
* pagecounts (legacy)
* unique devices


== Edits History ==


=== Data Pipeline ===
* Data ingestion (Sqoop)


* Data refinement (page+user and denormalize)
* Data serving (hive, druid ???)


=== Datasets ===
</syntaxhighlight>
* Mediawiki tables (sqooped tables)
*
* Recomputed history (page, user, denormalized)
* Metrics

Revision as of 16:38, 4 April 2017

The Analytics Data Lake (ADL) is a large, analytics-oriented repository of data, both raw and aggregated, about Wikimedia projects (in industry terms, a data lake). It currently includes data on pageviews and a beta set of historical data about editing, including revisions, pages, and users. As the Data Lake matures, we will add any and all data that can be safely made public. The infrastructure will support public releases of datasets, out of the box.

Moving pages proposal

Analytics/Archives/

  2015 data warehouse experiments
  2015 data warehouse experiments/2014-12-02 verifications
  2015 data warehouse experiments/2015-01-14 verifications
  2015 data warehouse experiments/2015-02-03 verifications
  
  Cluster/ETL
  Cluster/Logging Solutions Overview
  Cluster/Logging Solutions Recommendation
  Cluster/Streaming -- Rename Hadoop Streaming for xml dumps?
  Cluster/Webrequest partitions

  Dashboards (archived)

  Data/Mobile requests stream
  Data/Webrequests sampled
  Data/Zero webrequests

  Data/Pagecounts-all-sites
  Data/Pagecounts-ez
  Data/Pagecounts-raw

  Kraken/Meetings
  Kraken/Meetings/ArchitectureReview
  Kraken/Meetings/SecurityReview

  Global-Dev Dashboard

  Limn

  Mingle

  Pageviews/Aggregation

  Pentaho
  Products

  TOC

  Webstatscollector

  Wikipedia Zero

  Wikistats2.0
  Wikistats2.0/Design

  gp.wmflabs.org

  statsv -- delete?


Analytics/Tutorials

  Dashboards


Analytics/Data Lake
  analytics.wikimedia.org

Analytics/Data Lake/Traffic

  Data access

  Data/Pageviews --> Pageviews 
  Data/Redirects --> Pageviews/Redirects
  Bots --> Pageviews/Bots
  
  Data/UserRetention --> User Retention

  Datasets/  <-- core page?
    Cluster/BrowserReports --> Browser Reports + enhance
  
  Monitoring (to create)

  -- Remove data below
  Data/ApiAction 
  Data/Browser general -- How does it relates to BrowserReports??
  Data/Cirrus -- Discuss with Discovery, delete?
  Data/Mediacounts
  Data/Pageview hourly
  Data/Pageview hourly/Fingerprinting Over Time
  Data/Pageview hourly/Identity reconstruction analysis
  Data/Pageview hourly/K Anonymity Threshold Analysis
  Data/Pageview hourly/Sanitization
  Data/Pageview hourly/Sanitization algorithm proposal
  Data/Projectview hourly
  Data/Unique Devices
  Data/Webrequest
  Data/Webrequest/RawIPUsage
  Data/mobile apps session metrics

  Pageviews

  PageviewAPI
  LegacyPageviewAPI

  PageviewAPI/Capacity - Delete ?
  PageviewAPI/DataStore
  PageviewAPI/RESTBase

  Unique Devices
  Unique Devices/Last access solution
  Unique clients/Last access solution/BotResearch
  Unique clients/Last access solution/Validation



Analytics/Data Lake/Edits
  Data Lake <-- Core page
  -- Remove Data Lake/Schemas/ below
  Data Lake/Schemas/Mediawiki history
  Data Lake/Schemas/Mediawiki page history
  Data Lake/Schemas/Mediawiki user history
  Data Lake/Schemas/Metric results


Analytics/Systems

  AQS
  AQS/Scaling
  AQS/Scaling/2016/Hardware Refresh
  AQS/Scaling/2017/Cluster Expansion
  AQS/Scaling/LoadTesting
  
  Archiva

  Cluster/Druid --> Druid
  Cluster/Druid/Load test --> Druid/Load test

  Cluster
  Cluster/Access
  Cluster/Beeline
  Cluster/Camus
  Cluster/Data Format Experiments
  Cluster/Deploy/Refinery
  Cluster/Deploy/Refinery-source
  Cluster/Deploy a fix to incorrect camus partitionning
  Cluster/Geotagging
  Cluster/Hadoop -- Update ??
  Cluster/Hadoop/Administration
  Cluster/Hadoop/Load
  Cluster/Hardware -- Update ??
  Cluster/Hive  --- Update (tables)
  Cluster/Hive/Avro
  Cluster/Hive/Compression
  Cluster/Hive/Counting uniques
  Cluster/Hive/Mediawiki --> Empty - To delete?
  Cluster/Hive/Queries
  Cluster/Hive/Queries/Wikidata
  Cluster/Hive/QueryUsingUDF
  Cluster/Hive/Schemas -- Replaced by datalake dataset - Delete?
  Cluster/Kafka/Capacity
  Kafka Udp2log --> Cluster/Kafka/Kafka Udp2log
  Cluster/Logstash -- Really ???
  Cluster/MediaWiki Avro Logging -- Split to DataLake??
  Cluster/Oozie
  Cluster/Oozie/Administration
  Cluster/Ports
  Cluster/Puppet
  Cluster/Spark -- Update oozie part
  Geolocation --> Cluster/Geolocation

  Conferences
  Conferences/Apache Big Data Europe - November 2016

  Dashiki

  Datastores/Evaluation

  -- Update Data Lake/Pipeline to mediawiki history pipeline
  Data Lake/Pipeline/Data loading
  Data Lake/Pipeline/Denormalization and historification
  Data Lake/Pipeline/Page and user history reconstruction
  Data Lake/Pipeline/Page and user history reconstruction algorithm and optimizations
  Data Lake/Pipeline/Serving layer
  -- Similarly as above, create traffic pipeline

  EventLogging
  EventLogging/Administration
  EventLogging/Architecture
  EventLogging/Backfilling
  EventLogging/Data representations
  EventLogging/Data retention and auto-purging
  EventLogging/Monitoring
  EventLogging/New pipeline
  EventLogging/Outages
  EventLogging/Performance
  EventLogging/Publishing
  EventLogging/Sanitization vs Aggregation
  EventLogging/Sensitive Fields
  EventLogging/TestingOnBetaCluster

  EventStreams <-- In systems really??

  Wikimetrics
  Wikimetrics/Adding New Features
  Wikimetrics/Adding New Features/CentralAuth Cohorts
  Wikimetrics/Adding New Features/Tag Cohorts
  Wikimetrics/Global metrics

  Geowiki

  Reportupdater

  Siege

  Varnishkafka

  Vital Signs

  Wikistats

  piwik

OTHERS:

  DataRequests --> In FAQ? 
  DataResearch --> milimetric/DataResearch ?
  DataResearch/VisualEditor --> milimetric/DataResearch/VisualEditor ?
  Data Lake/Doc proposal -- Delete
  
  Datasets << Removed?

  FAQ
  MailingList - In FAQ ?

  Onboarding

  Oncall

  Team -- Move to main?
  Tier2 -- Move to main