You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Data Catalog Application Evaluation Rubric

From Wikitech-static
Revision as of 23:26, 17 December 2021 by imported>Razzi (→‎Cloud Services Evaluation)
Jump to navigation Jump to search

Evaluating potential data catalogs for https://phabricator.wikimedia.org/T293643.

Read the Data-as-a-Service Execution plan here.


Requirements

Since this is a central and critical part of a healthy data culture, a multitude of requirements with complex inter-relationships factor into our decision. This section attempts to highlight the requirements that we think would most impact us over the first year that the data catalog is available to users.

  • Easily ingest metadata from various parts of our Shared Infrastructure
    • lineage: Airflow, Gobblin, custom ingestion (Flink, Spark)
    • tables and columns: Hive metastore, Druid, Cassandra, Elastic, custom jobs writing to HDFS (Spark (also relevant to our Airflow dag strategy))
    • lots of other metadata: ownership, location, quality, automatic classifications. All tools have places to store these, but automation here is key, as it saves the souls of Data Stewards.
  • Authentication. Which of our many sign-on mechanisms are we going to use and does the tool support it?
  • Authorization. Are we going to need fine-grained control on data, are we going to deploy Apache Ranger? Do we need to reflect the way we treat data in the metadata layer? Can we allow anyone to edit any metadata?
  • UX. We are not a mature data savvy organization. This will be a lot of people's first experience with our data landscape. We need to make it pleasant to deliver on our goals of making data a first class citizen here at WMF.
  • Search. This is part of UX but it's a major component of most of the candidates, and, for example, Atlas is supposed to have not so great search. This makes it an important stand-alone consideration.
  • Speed of Ingestion. Do we need real-time updates to our metadata? Are we planning automated responses to certain changes in a way that would be easily centralized on top of the metadata catalog?
  • Privacy (data retention, transformations, compliance). One of the things we pride ourselves in is privacy, retaining only what we need for as long as we need. An overall picture of how compliant we are could be different from our self-assessment, so this seems like an important consideration. Maybe more generally, we need a high level overview of data we keep and how it meets or fails to meet a privacy budget.

This list means we're not focusing on some of the other aspects of data governance right now. Things like spelling out policies and tracking compliance, following strict processes, and so on. These seem possible to build on top of most of the solutions we're looking at.

General Considerations

  • Ingestion is going to be a big deal. For the parts of our data platform that do not expose metadata in a convenient way, we need to build custom metadata ingestion. Some of the tools above make this easier than others.
  • We should carefully survey the list of connectors. Everyone says "we have a flexible connector architecture", "the community builds lots of high quality connectors", etc, but when you look closer you might find poor support for something we need. Like this lack of support for Spark in Atlas.
  • Give extra points for tight integrations like using DataHub as a Lineage backend for AirFlow.

Cloud Services Evaluation

We are currently evaluating potential data catalogs on data-catalog-evaluation.analytics.eqiad1.wikimedia.cloud.

See the notes on how the server is configured here.

Some of the data sources we'd like to simulate there:

  • mariadb
  • hive

If you have access to the analytics horizon project, access the host as follows:

ssh data-catalog-evaluation.analytics.eqiad1.wikimedia.cloud

Or if you haven't configured ssh to use the bastion proxy jump, use:

ssh -J bastion.wmcloud.org data-catalog-evaluation.analytics.eqiad1.wikimedia.cloud

If your username is different than your local username, you'll need to specify your user:

ssh -J myusername@bastion.wmcloud.org myusername@data-catalog-evaluation.analytics.eqiad1.wikimedia.cloud

We currently have the following web proxies configured:

https://data-catalog-evaluation.wmcloud.org/ - home page that lists the data catalog services

https://atlas-demo.wmcloud.org/ - Atlas running on port 21000

Candidates

Click on each header name to see the in-depth evaluation for each application as a separate article.

Name Amundsen Atlas DataHub Egeria Marquez
Tagline Open source data discovery and metadata engine Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem. The Metadata Platform for the Modern Data Stack Open metadata and governance for enterprises - automatically capturing, managing and exchanging metadata between tools and platforms, no matter the vendor. An open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more.
Release Date 2019 2015 2019 2018 2018
Website https://www.amundsen.io https://atlas.apache.org https://datahubproject.io https://odpi.github.io/egeria-docs/ https://marquezproject.github.io/marquez/
Repository https://github.com/amundsen-io/amundsen https://github.com/apache/atlas https://github.com/linkedin/datahub
Author Lyft Authored by Hortonworks, managed by Apache LinkedIn LFAI WeWork
License Apache 2.0 Apache 2.0 Apache 2.0 Apache 2.0 Apache 2.0
UX Very in depth UI Initially not considered, now basic exploration is available.
Robustness (criteria TBD)
Comment Sits on top of Atlas
  • Very flexible distributed deployment
  • Big names involved (IBM, ING, etc)
  • Good but almost too extensive documentation
  • Great community, but definitely more corporate, more in the Microsoft / IBM open source style
Risks
  • Dependency on other Kafka ecosystem tools like Schema Registry, KafkaStreams, etc. These may not be tightly coupled but LinkedIn doesn't have any incentive to stay away from the Confluent licenses that we can't use, so at any point we could run into a problem here.
  • Authorization seems to be just in the RFC phase, with no LDAP support in the first planned phase.
  • Really solid candidate, lots of stuff like "The OMAG Server Platform is a multi-tenant platform that supports horizontal scale-out in Kubernetes and yet is light enough to run as an edge server on a Raspberry Pi. This platform is used to host the actual metadata integration and automation capabilities."

Other Candidates

With reasons they were not more seriously considered:

Name Metacat Select Star Dataverse CKAN Mediawiki
Tagline Metacat is a unified metadata exploration API service. You can explore Hive, RDS, Teradata, Redshift, S3 and Cassandra. Metacat provides you information about what data you have, where it resides and how to process it. Metadata in the end is really data about the data. So the primary purpose of Metacat is to give a place to describe the data so that we could do more useful things with it. Beyond a data catalog, Select Star is an intelligent data discovery platform that helps you understand your data. Open source research data repository software The world’s leading open source data management system MediaWiki is a collaboration and documentation platform brought to you by a vibrant community.
Link https://github.com/Netflix/metacat https://www.selectstar.com/ https://dataverse.org/ https://ckan.org/ (Licensed GNU AGPL 3.0) https://mediawiki.org
Disqualifying Reasons Documentation is still in the "TODO" phase, no references to community or the kind of organization that Apache projects enjoy, and somewhat limited scope. Closed Source, useful for comparisons This is more of a research sharing tool, not used for generic data governance. See code at https://github.com/IQSS/dataverse and related documentation. CKAN is meant to work at a very large scale, governments with multiple branches collaborating on data hubs. As such, most of the integrations are meant to be done manually, with only minimal automation support. Details can be found in their docs, but it doesn't seem to meet our requirements, it's maybe something to consider for something bigger like an Open Knowledge Data Portal shared with our other open knowledge partners. While building a metadata ingestion and storage layer on top of Mediawiki would be a fun side project, it is clear that the complexity of the competitors' UIs means this would only work if a lightweight, manual process was viable.


Still, high level catalog information should be documented on wikitech.