You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Data Lake/Doc proposal: Difference between revisions
Jump to navigation
Jump to search
imported>Milimetric mNo edit summary |
imported>Joal (Update moving pages) |
||
Line 1: | Line 1: | ||
The Analytics Data Lake (ADL) is a large, analytics-oriented repository of data, both raw and aggregated, about Wikimedia projects (in industry terms, a [[:en:Data_lake|data lake]]). It currently includes data on pageviews and a beta set of historical data about editing, including revisions, pages, and users. As the Data Lake matures, we will add any and all data that can be safely made public. The infrastructure will support public releases of datasets, out of the box. | The Analytics Data Lake (ADL) is a large, analytics-oriented repository of data, both raw and aggregated, about Wikimedia projects (in industry terms, a [[:en:Data_lake|data lake]]). It currently includes data on pageviews and a beta set of historical data about editing, including revisions, pages, and users. As the Data Lake matures, we will add any and all data that can be safely made public. The infrastructure will support public releases of datasets, out of the box. | ||
== Moving pages proposal == | |||
<syntaxhighlight> | <syntaxhighlight> | ||
Analytics/Archives/ | |||
Analytics/Data Lake/ | |||
2015 data warehouse experiments | |||
2015 data warehouse experiments/2014-12-02 verifications | |||
2015 data warehouse experiments/2015-01-14 verifications | |||
2015 data warehouse experiments/2015-02-03 verifications | |||
Cluster/ETL | |||
Cluster/Logging Solutions Overview | |||
Cluster/Logging Solutions Recommendation | |||
Cluster/Streaming -- Rename Hadoop Streaming for xml dumps? | |||
Cluster/Webrequest partitions | |||
- | Dashboards (archived) | ||
Data/Mobile requests stream | |||
Data/Webrequests sampled | |||
Data/Zero webrequests | |||
Data/Pagecounts-all-sites | |||
Data/Pagecounts-ez | |||
Data/Pagecounts-raw | |||
Kraken/Meetings | |||
Kraken/Meetings/ArchitectureReview | |||
Kraken/Meetings/SecurityReview | |||
Global-Dev Dashboard | |||
Limn | |||
Mingle | |||
Pageviews/Aggregation | |||
Pentaho | |||
Products | |||
TOC | |||
Webstatscollector | |||
Wikipedia Zero | |||
Wikistats2.0 | |||
Wikistats2.0/Design | |||
gp.wmflabs.org | |||
statsv -- delete? | |||
Analytics/Tutorials | |||
Dashboards | |||
Analytics/Data Lake | |||
analytics.wikimedia.org | |||
Analytics/Data Lake/Traffic | |||
Data access | |||
Data/Pageviews --> Pageviews | |||
Data/Redirects --> Pageviews/Redirects | |||
Bots --> Pageviews/Bots | |||
Data/UserRetention --> User Retention | |||
Datasets/ <-- core page? | |||
Cluster/BrowserReports --> Browser Reports + enhance | |||
Monitoring (to create) | |||
-- Remove data below | |||
Data/ApiAction | |||
Data/Browser general -- How does it relates to BrowserReports?? | |||
Data/Cirrus -- Discuss with Discovery, delete? | |||
Data/Mediacounts | |||
Data/Pageview hourly | |||
Data/Pageview hourly/Fingerprinting Over Time | |||
Data/Pageview hourly/Identity reconstruction analysis | |||
Data/Pageview hourly/K Anonymity Threshold Analysis | |||
Data/Pageview hourly/Sanitization | |||
Data/Pageview hourly/Sanitization algorithm proposal | |||
Data/Projectview hourly | |||
Data/Unique Devices | |||
Data/Webrequest | |||
Data/Webrequest/RawIPUsage | |||
Data/mobile apps session metrics | |||
Pageviews | |||
PageviewAPI | |||
LegacyPageviewAPI | |||
PageviewAPI/Capacity - Delete ? | |||
PageviewAPI/DataStore | |||
PageviewAPI/RESTBase | |||
Unique Devices | |||
Unique Devices/Last access solution | |||
Unique clients/Last access solution/BotResearch | |||
Unique clients/Last access solution/Validation | |||
Analytics/Data Lake/Edits | |||
Data Lake <-- Core page | |||
-- Remove Data Lake/Schemas/ below | |||
Data Lake/Schemas/Mediawiki history | |||
Data Lake/Schemas/Mediawiki page history | |||
Data Lake/Schemas/Mediawiki user history | |||
Data Lake/Schemas/Metric results | |||
Analytics/Systems | |||
AQS | |||
AQS/Scaling | |||
AQS/Scaling/2016/Hardware Refresh | |||
AQS/Scaling/2017/Cluster Expansion | |||
AQS/Scaling/LoadTesting | |||
Archiva | |||
Cluster/Druid --> Druid | |||
Cluster/Druid/Load test --> Druid/Load test | |||
Cluster | |||
Cluster/Access | |||
Cluster/Beeline | |||
Cluster/Camus | |||
Cluster/Data Format Experiments | |||
Cluster/Deploy/Refinery | |||
Cluster/Deploy/Refinery-source | |||
Cluster/Deploy a fix to incorrect camus partitionning | |||
Cluster/Geotagging | |||
Cluster/Hadoop -- Update ?? | |||
Cluster/Hadoop/Administration | |||
Cluster/Hadoop/Load | |||
Cluster/Hardware -- Update ?? | |||
Cluster/Hive --- Update (tables) | |||
Cluster/Hive/Avro | |||
Cluster/Hive/Compression | |||
Cluster/Hive/Counting uniques | |||
Cluster/Hive/Mediawiki --> Empty - To delete? | |||
Cluster/Hive/Queries | |||
Cluster/Hive/Queries/Wikidata | |||
Cluster/Hive/QueryUsingUDF | |||
Cluster/Hive/Schemas -- Replaced by datalake dataset - Delete? | |||
Cluster/Kafka/Capacity | |||
Kafka Udp2log --> Cluster/Kafka/Kafka Udp2log | |||
Cluster/Logstash -- Really ??? | |||
Cluster/MediaWiki Avro Logging -- Split to DataLake?? | |||
Cluster/Oozie | |||
Cluster/Oozie/Administration | |||
Cluster/Ports | |||
Cluster/Puppet | |||
Cluster/Spark -- Update oozie part | |||
Geolocation --> Cluster/Geolocation | |||
Conferences | |||
Conferences/Apache Big Data Europe - November 2016 | |||
Dashiki | |||
Datastores/Evaluation | |||
-- Update Data Lake/Pipeline to mediawiki history pipeline | |||
Data Lake/Pipeline/Data loading | |||
Data Lake/Pipeline/Denormalization and historification | |||
Data Lake/Pipeline/Page and user history reconstruction | |||
Data Lake/Pipeline/Page and user history reconstruction algorithm and optimizations | |||
Data Lake/Pipeline/Serving layer | |||
-- Similarly as above, create traffic pipeline | |||
EventLogging | |||
EventLogging/Administration | |||
EventLogging/Architecture | |||
EventLogging/Backfilling | |||
EventLogging/Data representations | |||
EventLogging/Data retention and auto-purging | |||
EventLogging/Monitoring | |||
EventLogging/New pipeline | |||
EventLogging/Outages | |||
EventLogging/Performance | |||
EventLogging/Publishing | |||
EventLogging/Sanitization vs Aggregation | |||
EventLogging/Sensitive Fields | |||
EventLogging/TestingOnBetaCluster | |||
EventStreams <-- In systems really?? | |||
Wikimetrics | |||
Wikimetrics/Adding New Features | |||
Wikimetrics/Adding New Features/CentralAuth Cohorts | |||
Wikimetrics/Adding New Features/Tag Cohorts | |||
Wikimetrics/Global metrics | |||
Geowiki | |||
Reportupdater | |||
Siege | |||
Varnishkafka | |||
Vital Signs | |||
Wikistats | |||
piwik | |||
OTHERS: | |||
DataRequests --> In FAQ? | |||
DataResearch --> milimetric/DataResearch ? | |||
DataResearch/VisualEditor --> milimetric/DataResearch/VisualEditor ? | |||
Data Lake/Doc proposal -- Delete | |||
Datasets << Removed? | |||
FAQ | |||
MailingList - In FAQ ? | |||
Onboarding | |||
Oncall | |||
Team -- Move to main? | |||
Tier2 -- Move to main | |||
</syntaxhighlight> | |||
* | * | ||
Revision as of 16:38, 4 April 2017
The Analytics Data Lake (ADL) is a large, analytics-oriented repository of data, both raw and aggregated, about Wikimedia projects (in industry terms, a data lake). It currently includes data on pageviews and a beta set of historical data about editing, including revisions, pages, and users. As the Data Lake matures, we will add any and all data that can be safely made public. The infrastructure will support public releases of datasets, out of the box.
Moving pages proposal
Analytics/Archives/
2015 data warehouse experiments
2015 data warehouse experiments/2014-12-02 verifications
2015 data warehouse experiments/2015-01-14 verifications
2015 data warehouse experiments/2015-02-03 verifications
Cluster/ETL
Cluster/Logging Solutions Overview
Cluster/Logging Solutions Recommendation
Cluster/Streaming -- Rename Hadoop Streaming for xml dumps?
Cluster/Webrequest partitions
Dashboards (archived)
Data/Mobile requests stream
Data/Webrequests sampled
Data/Zero webrequests
Data/Pagecounts-all-sites
Data/Pagecounts-ez
Data/Pagecounts-raw
Kraken/Meetings
Kraken/Meetings/ArchitectureReview
Kraken/Meetings/SecurityReview
Global-Dev Dashboard
Limn
Mingle
Pageviews/Aggregation
Pentaho
Products
TOC
Webstatscollector
Wikipedia Zero
Wikistats2.0
Wikistats2.0/Design
gp.wmflabs.org
statsv -- delete?
Analytics/Tutorials
Dashboards
Analytics/Data Lake
analytics.wikimedia.org
Analytics/Data Lake/Traffic
Data access
Data/Pageviews --> Pageviews
Data/Redirects --> Pageviews/Redirects
Bots --> Pageviews/Bots
Data/UserRetention --> User Retention
Datasets/ <-- core page?
Cluster/BrowserReports --> Browser Reports + enhance
Monitoring (to create)
-- Remove data below
Data/ApiAction
Data/Browser general -- How does it relates to BrowserReports??
Data/Cirrus -- Discuss with Discovery, delete?
Data/Mediacounts
Data/Pageview hourly
Data/Pageview hourly/Fingerprinting Over Time
Data/Pageview hourly/Identity reconstruction analysis
Data/Pageview hourly/K Anonymity Threshold Analysis
Data/Pageview hourly/Sanitization
Data/Pageview hourly/Sanitization algorithm proposal
Data/Projectview hourly
Data/Unique Devices
Data/Webrequest
Data/Webrequest/RawIPUsage
Data/mobile apps session metrics
Pageviews
PageviewAPI
LegacyPageviewAPI
PageviewAPI/Capacity - Delete ?
PageviewAPI/DataStore
PageviewAPI/RESTBase
Unique Devices
Unique Devices/Last access solution
Unique clients/Last access solution/BotResearch
Unique clients/Last access solution/Validation
Analytics/Data Lake/Edits
Data Lake <-- Core page
-- Remove Data Lake/Schemas/ below
Data Lake/Schemas/Mediawiki history
Data Lake/Schemas/Mediawiki page history
Data Lake/Schemas/Mediawiki user history
Data Lake/Schemas/Metric results
Analytics/Systems
AQS
AQS/Scaling
AQS/Scaling/2016/Hardware Refresh
AQS/Scaling/2017/Cluster Expansion
AQS/Scaling/LoadTesting
Archiva
Cluster/Druid --> Druid
Cluster/Druid/Load test --> Druid/Load test
Cluster
Cluster/Access
Cluster/Beeline
Cluster/Camus
Cluster/Data Format Experiments
Cluster/Deploy/Refinery
Cluster/Deploy/Refinery-source
Cluster/Deploy a fix to incorrect camus partitionning
Cluster/Geotagging
Cluster/Hadoop -- Update ??
Cluster/Hadoop/Administration
Cluster/Hadoop/Load
Cluster/Hardware -- Update ??
Cluster/Hive --- Update (tables)
Cluster/Hive/Avro
Cluster/Hive/Compression
Cluster/Hive/Counting uniques
Cluster/Hive/Mediawiki --> Empty - To delete?
Cluster/Hive/Queries
Cluster/Hive/Queries/Wikidata
Cluster/Hive/QueryUsingUDF
Cluster/Hive/Schemas -- Replaced by datalake dataset - Delete?
Cluster/Kafka/Capacity
Kafka Udp2log --> Cluster/Kafka/Kafka Udp2log
Cluster/Logstash -- Really ???
Cluster/MediaWiki Avro Logging -- Split to DataLake??
Cluster/Oozie
Cluster/Oozie/Administration
Cluster/Ports
Cluster/Puppet
Cluster/Spark -- Update oozie part
Geolocation --> Cluster/Geolocation
Conferences
Conferences/Apache Big Data Europe - November 2016
Dashiki
Datastores/Evaluation
-- Update Data Lake/Pipeline to mediawiki history pipeline
Data Lake/Pipeline/Data loading
Data Lake/Pipeline/Denormalization and historification
Data Lake/Pipeline/Page and user history reconstruction
Data Lake/Pipeline/Page and user history reconstruction algorithm and optimizations
Data Lake/Pipeline/Serving layer
-- Similarly as above, create traffic pipeline
EventLogging
EventLogging/Administration
EventLogging/Architecture
EventLogging/Backfilling
EventLogging/Data representations
EventLogging/Data retention and auto-purging
EventLogging/Monitoring
EventLogging/New pipeline
EventLogging/Outages
EventLogging/Performance
EventLogging/Publishing
EventLogging/Sanitization vs Aggregation
EventLogging/Sensitive Fields
EventLogging/TestingOnBetaCluster
EventStreams <-- In systems really??
Wikimetrics
Wikimetrics/Adding New Features
Wikimetrics/Adding New Features/CentralAuth Cohorts
Wikimetrics/Adding New Features/Tag Cohorts
Wikimetrics/Global metrics
Geowiki
Reportupdater
Siege
Varnishkafka
Vital Signs
Wikistats
piwik
OTHERS:
DataRequests --> In FAQ?
DataResearch --> milimetric/DataResearch ?
DataResearch/VisualEditor --> milimetric/DataResearch/VisualEditor ?
Data Lake/Doc proposal -- Delete
Datasets << Removed?
FAQ
MailingList - In FAQ ?
Onboarding
Oncall
Team -- Move to main?
Tier2 -- Move to main