You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Doc proposal

From Wikitech-static
< Analytics‎ | Data Lake
Revision as of 16:38, 4 April 2017 by imported>Joal (Update moving pages)
Jump to navigation Jump to search

The Analytics Data Lake (ADL) is a large, analytics-oriented repository of data, both raw and aggregated, about Wikimedia projects (in industry terms, a data lake). It currently includes data on pageviews and a beta set of historical data about editing, including revisions, pages, and users. As the Data Lake matures, we will add any and all data that can be safely made public. The infrastructure will support public releases of datasets, out of the box.

Moving pages proposal

Analytics/Archives/

  2015 data warehouse experiments
  2015 data warehouse experiments/2014-12-02 verifications
  2015 data warehouse experiments/2015-01-14 verifications
  2015 data warehouse experiments/2015-02-03 verifications
  
  Cluster/ETL
  Cluster/Logging Solutions Overview
  Cluster/Logging Solutions Recommendation
  Cluster/Streaming -- Rename Hadoop Streaming for xml dumps?
  Cluster/Webrequest partitions

  Dashboards (archived)

  Data/Mobile requests stream
  Data/Webrequests sampled
  Data/Zero webrequests

  Data/Pagecounts-all-sites
  Data/Pagecounts-ez
  Data/Pagecounts-raw

  Kraken/Meetings
  Kraken/Meetings/ArchitectureReview
  Kraken/Meetings/SecurityReview

  Global-Dev Dashboard

  Limn

  Mingle

  Pageviews/Aggregation

  Pentaho
  Products

  TOC

  Webstatscollector

  Wikipedia Zero

  Wikistats2.0
  Wikistats2.0/Design

  gp.wmflabs.org

  statsv -- delete?


Analytics/Tutorials

  Dashboards


Analytics/Data Lake
  analytics.wikimedia.org

Analytics/Data Lake/Traffic

  Data access

  Data/Pageviews --> Pageviews 
  Data/Redirects --> Pageviews/Redirects
  Bots --> Pageviews/Bots
  
  Data/UserRetention --> User Retention

  Datasets/  <-- core page?
    Cluster/BrowserReports --> Browser Reports + enhance
  
  Monitoring (to create)

  -- Remove data below
  Data/ApiAction 
  Data/Browser general -- How does it relates to BrowserReports??
  Data/Cirrus -- Discuss with Discovery, delete?
  Data/Mediacounts
  Data/Pageview hourly
  Data/Pageview hourly/Fingerprinting Over Time
  Data/Pageview hourly/Identity reconstruction analysis
  Data/Pageview hourly/K Anonymity Threshold Analysis
  Data/Pageview hourly/Sanitization
  Data/Pageview hourly/Sanitization algorithm proposal
  Data/Projectview hourly
  Data/Unique Devices
  Data/Webrequest
  Data/Webrequest/RawIPUsage
  Data/mobile apps session metrics

  Pageviews

  PageviewAPI
  LegacyPageviewAPI

  PageviewAPI/Capacity - Delete ?
  PageviewAPI/DataStore
  PageviewAPI/RESTBase

  Unique Devices
  Unique Devices/Last access solution
  Unique clients/Last access solution/BotResearch
  Unique clients/Last access solution/Validation



Analytics/Data Lake/Edits
  Data Lake <-- Core page
  -- Remove Data Lake/Schemas/ below
  Data Lake/Schemas/Mediawiki history
  Data Lake/Schemas/Mediawiki page history
  Data Lake/Schemas/Mediawiki user history
  Data Lake/Schemas/Metric results


Analytics/Systems

  AQS
  AQS/Scaling
  AQS/Scaling/2016/Hardware Refresh
  AQS/Scaling/2017/Cluster Expansion
  AQS/Scaling/LoadTesting
  
  Archiva

  Cluster/Druid --> Druid
  Cluster/Druid/Load test --> Druid/Load test

  Cluster
  Cluster/Access
  Cluster/Beeline
  Cluster/Camus
  Cluster/Data Format Experiments
  Cluster/Deploy/Refinery
  Cluster/Deploy/Refinery-source
  Cluster/Deploy a fix to incorrect camus partitionning
  Cluster/Geotagging
  Cluster/Hadoop -- Update ??
  Cluster/Hadoop/Administration
  Cluster/Hadoop/Load
  Cluster/Hardware -- Update ??
  Cluster/Hive  --- Update (tables)
  Cluster/Hive/Avro
  Cluster/Hive/Compression
  Cluster/Hive/Counting uniques
  Cluster/Hive/Mediawiki --> Empty - To delete?
  Cluster/Hive/Queries
  Cluster/Hive/Queries/Wikidata
  Cluster/Hive/QueryUsingUDF
  Cluster/Hive/Schemas -- Replaced by datalake dataset - Delete?
  Cluster/Kafka/Capacity
  Kafka Udp2log --> Cluster/Kafka/Kafka Udp2log
  Cluster/Logstash -- Really ???
  Cluster/MediaWiki Avro Logging -- Split to DataLake??
  Cluster/Oozie
  Cluster/Oozie/Administration
  Cluster/Ports
  Cluster/Puppet
  Cluster/Spark -- Update oozie part
  Geolocation --> Cluster/Geolocation

  Conferences
  Conferences/Apache Big Data Europe - November 2016

  Dashiki

  Datastores/Evaluation

  -- Update Data Lake/Pipeline to mediawiki history pipeline
  Data Lake/Pipeline/Data loading
  Data Lake/Pipeline/Denormalization and historification
  Data Lake/Pipeline/Page and user history reconstruction
  Data Lake/Pipeline/Page and user history reconstruction algorithm and optimizations
  Data Lake/Pipeline/Serving layer
  -- Similarly as above, create traffic pipeline

  EventLogging
  EventLogging/Administration
  EventLogging/Architecture
  EventLogging/Backfilling
  EventLogging/Data representations
  EventLogging/Data retention and auto-purging
  EventLogging/Monitoring
  EventLogging/New pipeline
  EventLogging/Outages
  EventLogging/Performance
  EventLogging/Publishing
  EventLogging/Sanitization vs Aggregation
  EventLogging/Sensitive Fields
  EventLogging/TestingOnBetaCluster

  EventStreams <-- In systems really??

  Wikimetrics
  Wikimetrics/Adding New Features
  Wikimetrics/Adding New Features/CentralAuth Cohorts
  Wikimetrics/Adding New Features/Tag Cohorts
  Wikimetrics/Global metrics

  Geowiki

  Reportupdater

  Siege

  Varnishkafka

  Vital Signs

  Wikistats

  piwik

OTHERS:

  DataRequests --> In FAQ? 
  DataResearch --> milimetric/DataResearch ?
  DataResearch/VisualEditor --> milimetric/DataResearch/VisualEditor ?
  Data Lake/Doc proposal -- Delete
  
  Datasets << Removed?

  FAQ
  MailingList - In FAQ ?

  Onboarding

  Oncall

  Team -- Move to main?
  Tier2 -- Move to main