You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Data Engineering"

From Wikitech-static
Jump to navigation Jump to search
imported>ODimitrijevic
imported>ODimitrijevic
 
Line 1: Line 1:
The Wikimedia Foundation's '''Data Engineering team''' is part of the [[mw:Wikimedia Technology|Technology department]].
The Wikimedia Foundation's '''Data Engineering team''' is part of the [[mw:Wikimedia Technology|Technology department]].


We provide the Wikimedia analytics data platform, making Wikimedia related data available for querying and analysis to both WMF and the different Wiki communities and stakeholders.  
We provide the Wikimedia analytics data platform, making Wikimedia related data available for querying and analysis to both WMF and the different Wiki communities and stakeholders. We develop infrastructure so all our users can access data in a self-service fashion that is consistent with the values of the movement.
 
We develop infrastructure so all our users, both within the Foundation and within the different communities, can access data in a self-service fashion that is consistent with the values of the movement.


We keep all our documentation here on Wikitech. See also [[Analytics/FAQ|this FAQ]].  
We keep all our documentation here on Wikitech. See also [[Analytics/FAQ|this FAQ]].  
== About us - [[Analytics/Team]] ==
== About us - [[Data Engineering/Team]] ==


=== Our Mission ===
=== Our Mission ===
Line 17: Line 15:
*on our public mailing list, [Mailto:analytics@lists.wikimedia.org '''analytics@lists.wikimedia.org'''] ([[mail:analytics|subscribe, archives]])
*on our public mailing list, [Mailto:analytics@lists.wikimedia.org '''analytics@lists.wikimedia.org'''] ([[mail:analytics|subscribe, archives]])
*in our public [[meta:IRC|IRC channel]], {{Irc|wikimedia-analytics}}. You can use the keyword ''a-team'' to ping us, so we notice your question.
*in our public [[meta:IRC|IRC channel]], {{Irc|wikimedia-analytics}}. You can use the keyword ''a-team'' to ping us, so we notice your question.
*during our [[Analytics/Team/Office_Hours|office hours]], which we host as of 2019, January 14th on the second Monday of every month.  [https://calendar.google.com/event?action=TEMPLATE&tmeid=M2Q4NXUxdDRoMGg3Z3RqcG9qNDlwbHU2bmdfMjAxOTAxMTRUMTUwMDAwWiB3aWtpbWVkaWEub3JnX2NiMzdtdTQ4Y25odGQ3aHJuYThzMjdvbmFvQGc&tmsrc=wikimedia.org_cb37mu48cnhtd7hrna8s27onao%40group.calendar.google.com&scp=ALL Add to your calendar] or let us know if that time is too early for you, and we can hold a second session when needed.


===Work organization===
===Work organization===

Latest revision as of 19:26, 12 November 2021

The Wikimedia Foundation's Data Engineering team is part of the Technology department.

We provide the Wikimedia analytics data platform, making Wikimedia related data available for querying and analysis to both WMF and the different Wiki communities and stakeholders. We develop infrastructure so all our users can access data in a self-service fashion that is consistent with the values of the movement.

We keep all our documentation here on Wikitech. See also this FAQ.

About us - Data Engineering/Team

Our Mission

Our team provides a self-service, privacy-aware data platform that empowers people to gain data-driven insights and build better product experiences for Wikimedia communities.

Contact

If you have questions about our work or the infrastructure we provide, you can contact us in two ways:

Work organization

The analytics team uses Phabricator to track its projects.

Datasets

Systems

We maintain the big data platform including the data lake, ingestion and processing pipelines, as well as a number of systems to explore and visualize the data.

File:WMF Analytics Data Platform 2021 v1.png

System name and link Type Accessibility
Archiva Repository for Java archives Private
AQS - Analytics Query Service REST API for analytics data Public
Clients (stat100X) Analytics client nodes to access Hadoop and various services Private
Cluster (Hadoop, Gobblin, Hive, Oozie, Spark...) Hadoop Private
Dashiki Framework for building dashboards Public
Druid Data storage engine optimized for exploratory analytics Private
EventLogging Ad-hoc streaming pipeline Private
EventStreams Mediawiki events streams Public
Hue Web interface for Hive, Oozie, and other Cluster services Private
Kafka Data transport and streaming system Private
MariaDB Data storage for MediaWiki replicas and EventLogging Private
Matomo (formerly known as Piwik) Small-scale web analytics platform Private
Presto Big data high performance sql query engine Private
ReportUpdater Job Scheduler Private
Superset Web interface for data visualization and exploration Private
Jupyter Hosted notebooks for data analysis Private
Turnilo Web interface for exploring data stored in Druid Private
Wikistats (1 and 2) Community Dashboard with high-level metrics Public

The list of scheduled manual maintenance tasks are documented here.


Try it out! Analytics/Tutorials

We'd rather have you having fun with our data :)

Please check the link above for something that might help you, and let us know if you don't find what you're after.

Table of Content

Go to the Analytics/TOC page to have a list of all pages we have under Analytics.