You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Doc proposal: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Joal
(Update moving pages)
imported>Joal
Line 1: Line 1:
The Analytics Data Lake (ADL) is a large, analytics-oriented repository of data, both raw and aggregated, about Wikimedia projects (in industry terms, a [[:en:Data_lake|data lake]]). It currently includes data on pageviews and a beta set of historical data about editing, including revisions, pages, and users.  As the Data Lake matures, we will add any and all data that can be safely made public. The infrastructure will support public releases of datasets, out of the box.
#REDIRECT [[Analytics/Doc proposal]]
 
== Moving pages proposal ==
<syntaxhighlight>
Analytics/Archives/
 
  2015 data warehouse experiments
  2015 data warehouse experiments/2014-12-02 verifications
  2015 data warehouse experiments/2015-01-14 verifications
  2015 data warehouse experiments/2015-02-03 verifications
 
  Cluster/ETL
  Cluster/Logging Solutions Overview
  Cluster/Logging Solutions Recommendation
  Cluster/Streaming -- Rename Hadoop Streaming for xml dumps?
  Cluster/Webrequest partitions
 
  Dashboards (archived)
 
  Data/Mobile requests stream
  Data/Webrequests sampled
  Data/Zero webrequests
 
  Data/Pagecounts-all-sites
  Data/Pagecounts-ez
  Data/Pagecounts-raw
 
  Kraken/Meetings
  Kraken/Meetings/ArchitectureReview
  Kraken/Meetings/SecurityReview
 
  Global-Dev Dashboard
 
  Limn
 
  Mingle
 
  Pageviews/Aggregation
 
  Pentaho
  Products
 
  TOC
 
  Webstatscollector
 
  Wikipedia Zero
 
  Wikistats2.0
  Wikistats2.0/Design
 
  gp.wmflabs.org
 
  statsv -- delete?
 
 
Analytics/Tutorials
 
  Dashboards
 
 
Analytics/Data Lake
  analytics.wikimedia.org
 
Analytics/Data Lake/Traffic
 
  Data access
 
  Data/Pageviews --> Pageviews
  Data/Redirects --> Pageviews/Redirects
  Bots --> Pageviews/Bots
 
  Data/UserRetention --> User Retention
 
  Datasets/  <-- core page?
    Cluster/BrowserReports --> Browser Reports + enhance
 
  Monitoring (to create)
 
  -- Remove data below
  Data/ApiAction
  Data/Browser general -- How does it relates to BrowserReports??
  Data/Cirrus -- Discuss with Discovery, delete?
  Data/Mediacounts
  Data/Pageview hourly
  Data/Pageview hourly/Fingerprinting Over Time
  Data/Pageview hourly/Identity reconstruction analysis
  Data/Pageview hourly/K Anonymity Threshold Analysis
  Data/Pageview hourly/Sanitization
  Data/Pageview hourly/Sanitization algorithm proposal
  Data/Projectview hourly
  Data/Unique Devices
  Data/Webrequest
  Data/Webrequest/RawIPUsage
  Data/mobile apps session metrics
 
  Pageviews
 
  PageviewAPI
  LegacyPageviewAPI
 
  PageviewAPI/Capacity - Delete ?
  PageviewAPI/DataStore
  PageviewAPI/RESTBase
 
  Unique Devices
  Unique Devices/Last access solution
  Unique clients/Last access solution/BotResearch
  Unique clients/Last access solution/Validation
 
 
 
Analytics/Data Lake/Edits
  Data Lake <-- Core page
  -- Remove Data Lake/Schemas/ below
  Data Lake/Schemas/Mediawiki history
  Data Lake/Schemas/Mediawiki page history
  Data Lake/Schemas/Mediawiki user history
  Data Lake/Schemas/Metric results
 
 
Analytics/Systems
 
  AQS
  AQS/Scaling
  AQS/Scaling/2016/Hardware Refresh
  AQS/Scaling/2017/Cluster Expansion
  AQS/Scaling/LoadTesting
 
  Archiva
 
  Cluster/Druid --> Druid
  Cluster/Druid/Load test --> Druid/Load test
 
  Cluster
  Cluster/Access
  Cluster/Beeline
  Cluster/Camus
  Cluster/Data Format Experiments
  Cluster/Deploy/Refinery
  Cluster/Deploy/Refinery-source
  Cluster/Deploy a fix to incorrect camus partitionning
  Cluster/Geotagging
  Cluster/Hadoop -- Update ??
  Cluster/Hadoop/Administration
  Cluster/Hadoop/Load
  Cluster/Hardware -- Update ??
  Cluster/Hive  --- Update (tables)
  Cluster/Hive/Avro
  Cluster/Hive/Compression
  Cluster/Hive/Counting uniques
  Cluster/Hive/Mediawiki --> Empty - To delete?
  Cluster/Hive/Queries
  Cluster/Hive/Queries/Wikidata
  Cluster/Hive/QueryUsingUDF
  Cluster/Hive/Schemas -- Replaced by datalake dataset - Delete?
  Cluster/Kafka/Capacity
  Kafka Udp2log --> Cluster/Kafka/Kafka Udp2log
  Cluster/Logstash -- Really ???
  Cluster/MediaWiki Avro Logging -- Split to DataLake??
  Cluster/Oozie
  Cluster/Oozie/Administration
  Cluster/Ports
  Cluster/Puppet
  Cluster/Spark -- Update oozie part
  Geolocation --> Cluster/Geolocation
 
  Conferences
  Conferences/Apache Big Data Europe - November 2016
 
  Dashiki
 
  Datastores/Evaluation
 
  -- Update Data Lake/Pipeline to mediawiki history pipeline
  Data Lake/Pipeline/Data loading
  Data Lake/Pipeline/Denormalization and historification
  Data Lake/Pipeline/Page and user history reconstruction
  Data Lake/Pipeline/Page and user history reconstruction algorithm and optimizations
  Data Lake/Pipeline/Serving layer
  -- Similarly as above, create traffic pipeline
 
  EventLogging
  EventLogging/Administration
  EventLogging/Architecture
  EventLogging/Backfilling
  EventLogging/Data representations
  EventLogging/Data retention and auto-purging
  EventLogging/Monitoring
  EventLogging/New pipeline
  EventLogging/Outages
  EventLogging/Performance
  EventLogging/Publishing
  EventLogging/Sanitization vs Aggregation
  EventLogging/Sensitive Fields
  EventLogging/TestingOnBetaCluster
 
  EventStreams <-- In systems really??
 
  Wikimetrics
  Wikimetrics/Adding New Features
  Wikimetrics/Adding New Features/CentralAuth Cohorts
  Wikimetrics/Adding New Features/Tag Cohorts
  Wikimetrics/Global metrics
 
  Geowiki
 
  Reportupdater
 
  Siege
 
  Varnishkafka
 
  Vital Signs
 
  Wikistats
 
  piwik
 
OTHERS:
 
  DataRequests --> In FAQ?
  DataResearch --> milimetric/DataResearch ?
  DataResearch/VisualEditor --> milimetric/DataResearch/VisualEditor ?
  Data Lake/Doc proposal -- Delete
 
  Datasets << Removed?
 
  FAQ
  MailingList - In FAQ ?
 
  Onboarding
 
  Oncall
 
  Team -- Move to main?
  Tier2 -- Move to main
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
</syntaxhighlight>
*

Revision as of 14:35, 7 April 2017