Difference between revisions of "Analytics Engineering"

From Wikitech-static
Jump to navigation Jump to search
imported>ODimitrijevic
imported>ODimitrijevic
(ODimitrijevic moved page Analytics Engineering to Data Engineering)
 
Line 1: Line 1:
The Wikimedia Foundation's '''Analytics Engineering team''' is part of the [[mw:Wikimedia Technology|Technology department]].
#REDIRECT [[Data Engineering]]
 
The Analytics Engineering Team's primary responsibility is to "empower and support data informed decision making across the Foundation and the Community".
 
We make Wikimedia related data available for querying and analysis to both WMF and the different Wiki communities and stakeholders.
 
We develop infrastructure so all our users, both within the Foundation and within the different communities, can access data in a self-service fashion that is consistent with the values of the movement.
 
We keep all our documentation here on Wikitech. See also [[Analytics/FAQ|this FAQ]].
== About us - [[Analytics/Team]] ==
 
=== Contact ===
If you have questions about our work or the infrastructure we provide, you can contact us in two ways:
 
* on our public mailing list, [Mailto:analytics@lists.wikimedia.org '''analytics@lists.wikimedia.org'''] ([[mail:analytics|subscribe, archives]])
* in our public [[meta:IRC|IRC channel]], {{Irc|wikimedia-analytics}}. You can use the keyword ''a-team'' to ping us, so we notice your question.
* during our [[Analytics/Team/Office_Hours|office hours]], which we host as of 2019, January 14th on the second Monday of every month.  [https://calendar.google.com/event?action=TEMPLATE&tmeid=M2Q4NXUxdDRoMGg3Z3RqcG9qNDlwbHU2bmdfMjAxOTAxMTRUMTUwMDAwWiB3aWtpbWVkaWEub3JnX2NiMzdtdTQ4Y25odGQ3aHJuYThzMjdvbmFvQGc&tmsrc=wikimedia.org_cb37mu48cnhtd7hrna8s27onao%40group.calendar.google.com&scp=ALL Add to your calendar] or let us know if that time is too early for you, and we can hold a second session when needed.
 
=== Work organization ===
The analytics team uses '''Phabricator''' to track its projects.
* https://phabricator.wikimedia.org/tag/analytics/ for '''backlog''' triage
* https://phabricator.wikimedia.org/tag/analytics-kanban/ for '''in progress''' tasks
 
=== Prioritization ===
 
== Datasets ==
 
* [[Analytics/Data_Lake/Traffic/Webrequest|Webrequests]] [Traffic logs] and [[Analytics/Data_Lake/Traffic/Webrequest#Hive_tables_derived_from_webrequest|derived tables]], including:
** [[Analytics/Pageviews|Pageviews]] [Filtered traffic logs] [TODO - Revamp and add various systems and key differences in schema and usage]
** [[Analytics/Data_Lake/Traffic/Interlanguage|Inter-language]] [Traffic between different languages of the same project family]
** [[Analytics/Data_Lake/Traffic/Unique_Devices|Unique Devices]] Estimates of unique devices at the project or project family level
* Mediawiki raw databases
* EventLogging (in the event database in hive)
* [[Analytics/Data Lake/Edits/Mediawiki history|Edits history]], [[Analytics/Data Lake/Edits/Mediawiki page history|Page history]], [[Analytics/Data Lake/Edits/Mediawiki user history|User history]]
* Other reports
* [[metawiki:Research:Wikipedia_clickstream#Releases|Clickstream]]
 
==[[Analytics/Systems|Systems]] ==
We maintain various systems to allow querying of our datasets in different fashion.
{| class="wikitable sortable"
!System name and link
!Type
![[Analytics/Data access|Accessibility]]
|-
|[[Analytics/Systems/Archiva|Archiva]]
|Repository for Java archives
|Private
|-
|[[Analytics/Systems/AQS|AQS - <u>A</u>nalytics <u>Q</u>uery <u>S</u>ervice]]
|REST API for analytics data
|Public
|-
|[[Analytics/Systems/Clients|Clients (stat100X)]]
|Analytics client nodes to access Hadoop and various services
|Private
|-
|-
|[[Analytics/Systems/Cluster|Cluster (Hadoop, Gobblin, Hive, Oozie, Spark...)]]
|Hadoop
|Private
|-
|[[Analytics/Systems/Dashiki|Dashiki]]
|Framework for building dashboards
|Public
|-
|[[Analytics/Systems/Druid|Druid]]
|Data storage engine optimized for exploratory analytics
|Private
|-
|[[Analytics/Systems/EventLogging|EventLogging]]
|Ad-hoc streaming pipeline
|Private
|-
|[[Analytics/Systems/EventStreams|EventStreams]]
|Mediawiki events streams
|Public
|-
|[[Analytics/Cluster/Hue|Hue]]
|Web interface for Hive, Oozie, and other Cluster services
|Private
|-
|[[Kafka]]
|Data transport and streaming system
|Private
|-
|[[Analytics/Systems/MariaDB|MariaDB]]
|Data storage for MediaWiki replicas and EventLogging
|Private
|-
|[[Analytics/Systems/Piwik|Matomo]] (formerly known as Piwik)
|Small-scale web analytics platform
|Private
|-
|[[Analytics/Systems/Reportupdater|ReportUpdater]]
|Job Scheduler
|Private
|-
|[[Analytics/Systems/Superset|Superset]]
|Web interface for data visualization and exploration
|Private
|-
|[[Analytics/Systems/Jupyter|Jupyter]]
|Hosted notebooks for data analysis
|Private
|-
|[[Analytics/Systems/Turnilo-Pivot|Turnilo]]
|Web interface for exploring data stored in Druid
|Private
|-
|[[Analytics/Systems/Wikistats|Wikistats]] (1 and 2)
|Community Dashboard with high-level metrics
|Public
|}
The list of scheduled manual maintenance tasks are documented [[Analytics/Systems/Manual maintenance|here]].
 
== Try it out! [[Analytics/Tutorials]] ==
We'd rather have you having fun with our data :)
 
Please check the link above for something that might help you, and let us know if you don't find what you're after.
 
== Table of Content ==
 
Go to the [[Analytics/TOC]] page to have a list of all pages we have under Analytics.

Latest revision as of 20:24, 22 October 2021

Redirect to: