You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Data Catalog Application Evaluation Rubric"

From Wikitech-static
Jump to navigation Jump to search
imported>Milimetric
imported>Razzi
(Move mediawiki to "other options")
Line 10: Line 10:
! [[Data Catalog Application Evaluation Rubric/Amundsen|Amundsen]]!! [[Data Catalog Application Evaluation Rubric/Atlas|Atlas]]!! [[Data Catalog Application Evaluation Rubric/DataHub|DataHub]]
! [[Data Catalog Application Evaluation Rubric/Amundsen|Amundsen]]!! [[Data Catalog Application Evaluation Rubric/Atlas|Atlas]]!! [[Data Catalog Application Evaluation Rubric/DataHub|DataHub]]
![[Data Catalog Application Evaluation Rubric/Egeria|Egeria]]
![[Data Catalog Application Evaluation Rubric/Egeria|Egeria]]
![[Data Catalog Application Evaluation Rubric/Marquez|Marquez]]!! [[Data Catalog Application Evaluation Rubric/Mediawiki|Mediawiki]]
![[Data Catalog Application Evaluation Rubric/Marquez|Marquez]]
|-
|-
|Tagline
|Tagline
Line 18: Line 18:
|Open metadata and governance for enterprises - automatically capturing, managing and exchanging metadata between tools and platforms, no matter the vendor.
|Open metadata and governance for enterprises - automatically capturing, managing and exchanging metadata between tools and platforms, no matter the vendor.
|An open source '''metadata service''' for the '''collection''', '''aggregation''', and '''visualization''' of a data ecosystem’s metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more.
|An open source '''metadata service''' for the '''collection''', '''aggregation''', and '''visualization''' of a data ecosystem’s metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more.
|MediaWiki is a collaboration and documentation platform brought to you by a vibrant community.
|-
|-
|Release Date
|Release Date
Line 26: Line 25:
|2018
|2018
|2018
|2018
|2003
|-
|-
| Website || [https://www.amundsen.io/ https://www.amundsen.io] || https://atlas.apache.org ||[https://datahubproject.io/ https://datahubproject.io]
| Website || [https://www.amundsen.io/ https://www.amundsen.io] || https://atlas.apache.org ||[https://datahubproject.io/ https://datahubproject.io]
|https://odpi.github.io/egeria-docs/
|https://odpi.github.io/egeria-docs/
|https://marquezproject.github.io/marquez/
|https://marquezproject.github.io/marquez/
|https://mediawiki.org
|-
|-
|Repository
|Repository
Line 37: Line 34:
|https://github.com/apache/atlas
|https://github.com/apache/atlas
|https://github.com/linkedin/datahub
|https://github.com/linkedin/datahub
|
|
|
|
|
Line 47: Line 43:
|LFAI
|LFAI
|WeWork
|WeWork
|Wikimedia
|-
|-
| License || Apache 2.0|| Apache 2.0||Apache 2.0
| License || Apache 2.0|| Apache 2.0||Apache 2.0
|Apache 2.0
|Apache 2.0
|Apache 2.0
|Apache 2.0
|
|-
|-
| UX || || ||Very in depth UI
| UX || || ||Very in depth UI
|
|
|
|
|
|-
|-
| Robustness (criteria TBD)
| Robustness (criteria TBD)
|
|
|
|
|
Line 79: Line 71:
* Requires PostgreSQL
* Requires PostgreSQL
* uses [https://openlineage.io/ OpenLineage]
* uses [https://openlineage.io/ OpenLineage]
|The dogfooding approach: run our data catalong on wikitech. Mediawiki by itself could be used and manually updated, and any programmatic data access could be accomplished using mediawiki extensions.
|-
|-
|Risks
|Risks
Line 89: Line 80:
|
|
* Really solid candidate, lots of stuff like "The <abbr>OMAG</abbr> Server Platform is a multi-tenant platform that supports horizontal scale-out in Kubernetes and yet is light enough to run as an edge server on a Raspberry Pi. This platform is used to host the actual metadata integration and automation capabilities."
* Really solid candidate, lots of stuff like "The <abbr>OMAG</abbr> Server Platform is a multi-tenant platform that supports horizontal scale-out in Kubernetes and yet is light enough to run as an edge server on a Raspberry Pi. This platform is used to host the actual metadata integration and automation capabilities."
|
|
|
|}
|}
Line 109: Line 99:
![[Data Catalog Application Evaluation Rubric/Dataverse|Dataverse]]
![[Data Catalog Application Evaluation Rubric/Dataverse|Dataverse]]
![[Data Catalog Application Evaluation Rubric/CKAN|CKAN]]
![[Data Catalog Application Evaluation Rubric/CKAN|CKAN]]
![[Data Catalog Application Evaluation Rubric/Mediawiki|Mediawiki]]
|-
|-
|Tagline
|Tagline
Line 115: Line 106:
|Open source research data repository software
|Open source research data repository software
|The world’s leading open source data management system
|The world’s leading open source data management system
|MediaWiki is a collaboration and documentation platform brought to you by a vibrant community.
|-
|-
|Link
|Link
Line 121: Line 113:
|https://dataverse.org/
|https://dataverse.org/
|https://ckan.org/ (Licensed GNU AGPL 3.0)
|https://ckan.org/ (Licensed GNU AGPL 3.0)
|https://mediawiki.org
|-
|-
|Disqualifying Reasons
|Disqualifying Reasons
Line 127: Line 120:
|This is more of a research sharing tool, not used for generic data governance.  See code at https://github.com/IQSS/dataverse and related documentation.
|This is more of a research sharing tool, not used for generic data governance.  See code at https://github.com/IQSS/dataverse and related documentation.
|CKAN is meant to work at a very large scale, governments with multiple branches collaborating on data hubs.  As such, most of the integrations are meant to be done manually, with only minimal automation support.  Details can be found in their docs, but it doesn't seem to meet our requirements, it's maybe something to consider for something bigger like an Open Knowledge Data Portal shared with our other open knowledge partners.
|CKAN is meant to work at a very large scale, governments with multiple branches collaborating on data hubs.  As such, most of the integrations are meant to be done manually, with only minimal automation support.  Details can be found in their docs, but it doesn't seem to meet our requirements, it's maybe something to consider for something bigger like an Open Knowledge Data Portal shared with our other open knowledge partners.
|While building a metadata ingestion and storage layer on top of Mediawiki would be a fun side project, it is clear that the complexity of the competitors' UIs means this would only work if a lightweight, manual process was viable.
Still, high level catalog information should be documented on wikitech.
|}
|}

Revision as of 09:12, 1 December 2021

Evaluating potential data catalogs for https://phabricator.wikimedia.org/T293643.

Read the Data-as-a-Service Execution plan here.

Click on each header name to see the in-depth evaluation for each application as a separate article.

Name Amundsen Atlas DataHub Egeria Marquez
Tagline Open source data discovery and metadata engine Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem. The Metadata Platform for the Modern Data Stack Open metadata and governance for enterprises - automatically capturing, managing and exchanging metadata between tools and platforms, no matter the vendor. An open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more.
Release Date 2019 2015 2019 2018 2018
Website https://www.amundsen.io https://atlas.apache.org https://datahubproject.io https://odpi.github.io/egeria-docs/ https://marquezproject.github.io/marquez/
Repository https://github.com/amundsen-io/amundsen https://github.com/apache/atlas https://github.com/linkedin/datahub
Author Lyft Authored by Hortonworks, managed by Apache LinkedIn LFAI WeWork
License Apache 2.0 Apache 2.0 Apache 2.0 Apache 2.0 Apache 2.0
UX Very in depth UI
Robustness (criteria TBD)
Comment Sits on top of Atlas
Risks
  • Dependency on other Kafka ecosystem tools like Schema Registry, KafkaStreams, etc. These may not be tightly coupled but LinkedIn doesn't have any incentive to stay away from the Confluent licenses that we can't use, so at any point we could run into a problem here.
  • Authorization seems to be just in the RFC phase, with no LDAP support in the first planned phase.
  • Really solid candidate, lots of stuff like "The OMAG Server Platform is a multi-tenant platform that supports horizontal scale-out in Kubernetes and yet is light enough to run as an edge server on a Raspberry Pi. This platform is used to host the actual metadata integration and automation capabilities."

General Considerations

  • Ingestion is going to be a big deal. For the parts of our data platform that do not expose metadata in a convenient way, we need to build custom metadata ingestion. Some of the tools above make this easier than others.
  • We should carefully survey the list of connectors. Everyone says "we have a flexible connector architecture", "the community builds lots of high quality connectors", etc, but when you look closer you might find poor support for something we need. Like this lack of support for Spark in Atlas.
  • Give extra points for tight integrations like using DataHub as a Lineage backend for AirFlow.

Other Candidates

With reasons they were not more seriously considered:

Name Metacat Select Star Dataverse CKAN Mediawiki
Tagline Metacat is a unified metadata exploration API service. You can explore Hive, RDS, Teradata, Redshift, S3 and Cassandra. Metacat provides you information about what data you have, where it resides and how to process it. Metadata in the end is really data about the data. So the primary purpose of Metacat is to give a place to describe the data so that we could do more useful things with it. Beyond a data catalog, Select Star is an intelligent data discovery platform that helps you understand your data. Open source research data repository software The world’s leading open source data management system MediaWiki is a collaboration and documentation platform brought to you by a vibrant community.
Link https://github.com/Netflix/metacat https://www.selectstar.com/ https://dataverse.org/ https://ckan.org/ (Licensed GNU AGPL 3.0) https://mediawiki.org
Disqualifying Reasons Documentation is still in the "TODO" phase, no references to community or the kind of organization that Apache projects enjoy, and somewhat limited scope. Closed Source, useful for comparisons This is more of a research sharing tool, not used for generic data governance. See code at https://github.com/IQSS/dataverse and related documentation. CKAN is meant to work at a very large scale, governments with multiple branches collaborating on data hubs. As such, most of the integrations are meant to be done manually, with only minimal automation support. Details can be found in their docs, but it doesn't seem to meet our requirements, it's maybe something to consider for something bigger like an Open Knowledge Data Portal shared with our other open knowledge partners. While building a metadata ingestion and storage layer on top of Mediawiki would be a fun side project, it is clear that the complexity of the competitors' UIs means this would only work if a lightweight, manual process was viable.


Still, high level catalog information should be documented on wikitech.