You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Data Catalog Application Evaluation Rubric"

From Wikitech-static
Jump to navigation Jump to search
imported>Milimetric
imported>Razzi
Line 6: Line 6:
|-
|-
!'''''Name'''''  
!'''''Name'''''  
! Amundsen !! Altas !! DataHub !! Mediawiki
! Amundsen !! Altas !! DataHub  
!Egeria
!Marquez!! Mediawiki
|-
|-
|Tagline
|Tagline
Line 12: Line 14:
|Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.
|Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.
|The Metadata Platform for the Modern Data Stack
|The Metadata Platform for the Modern Data Stack
|Open metadata and governance for enterprises - automatically capturing, managing and exchanging metadata between tools and platforms, no matter the vendor.
|An open source '''metadata service''' for the '''collection''', '''aggregation''', and '''visualization''' of a data ecosystem’s metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more.
|MediaWiki is a collaboration and documentation platform brought to you by a vibrant community.
|MediaWiki is a collaboration and documentation platform brought to you by a vibrant community.
|-
|Wikipedia Page
|
|
|
|
|-
|-
|Release Date
|Release Date
|
|2019
|
|2015
|
|2019
|
|2018
|2018
|2003
|-
|-
| Website || [https://www.amundsen.io/ https://www.amundsen.io] || https://atlas.apache.org ||[https://datahubproject.io/ https://datahubproject.io]
| Website || [https://www.amundsen.io/ https://www.amundsen.io] || https://atlas.apache.org ||[https://datahubproject.io/ https://datahubproject.io]
|https://odpi.github.io/egeria-docs/
|https://marquezproject.github.io/marquez/
|https://mediawiki.org
|https://mediawiki.org
|-
|-
|Author
|Repository
|https://github.com/amundsen-io/amundsen
|https://github.com/apache/atlas
|https://github.com/linkedin/datahub
|
|
|
|
|
|
|-
|Author
|Lyft
|Authored by Hortonworks, managed by Apache
|LinkedIn
|LFAI
|WeWork
|Wikimedia
|-
| License || Apache 2.0|| Apache 2.0||Apache 2.0
|Apache 2.0
|Apache 2.0
|
|
|-
|-
|Owner
| UX || || ||Very in depth UI
|
|
|
|
|
|
|
|-
|-
| License || || ||
| Robustness (criteria TBD)
|
|
|-
| UX || || ||
|
|
|-
| Robustness (criteria TBD)
|
|
|
|
Line 54: Line 66:
|-
|-
|Comment
|Comment
|Sits on top of Atlas
|
|
|
* [https://datahubproject.io/docs/slack Friendly community]
* [https://datahubproject.io/docs/townhall-history Well organized town halls]
* [https://datahubproject.io/docs/rfc/active/2042-graphql_frontend/queries RFC process]
* [https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained "third generation"] (they argue that event-sourcing metadata is key)
|
|
|
|
* Requires PostgreSQL
* uses [https://openlineage.io/ OpenLineage]
|The dogfooding approach: run our data catalong on wikitech. Mediawiki by itself could be used and manually updated, and any programmatic data access could be accomplished using mediawiki extensions.
|The dogfooding approach: run our data catalong on wikitech. Mediawiki by itself could be used and manually updated, and any programmatic data access could be accomplished using mediawiki extensions.
|}
|}
Some other candidates with reasons they were not more seriously considered:
 
=== General Considerations ===
 
* Ingestion is going to be a big deal.  For the parts of our data platform that do not expose metadata in a convenient way, we need to build custom metadata ingestion.  Some of the tools above make this easier than others.
* We should carefully survey the list of connectors.  Everyone says "we have a flexible connector architecture", "the community builds lots of high quality connectors", etc, but when you look closer you might find poor support for something we need.  Like this [http://mail-archives.apache.org/mod_mbox/atlas-user/202110.mbox/browser lack of support for Spark in Atlas].
*Give extra points for tight integrations like [https://datahubproject.io/docs/metadata-ingestion#setting-up-airflow-to-use-datahub-as-lineage-backend using DataHub as a Lineage backend for AirFlow].
*
 
=== Other Candidates ===
With reasons they were not more seriously considered:
{| class="wikitable"
{| class="wikitable"
|+
|+
!''Name''
!''Name''
!Metacat
!Metacat
!Select Star
|-
|-
|Tagline
|Tagline
|Metacat is a unified metadata exploration API service. You can explore Hive, RDS, Teradata, Redshift, S3 and Cassandra. Metacat provides you information about what data you have, where it resides and how to process it. Metadata in the end is really data about the data. So the primary purpose of Metacat is to give a place to describe the data so that we could do more useful things with it.
|Metacat is a unified metadata exploration API service. You can explore Hive, RDS, Teradata, Redshift, S3 and Cassandra. Metacat provides you information about what data you have, where it resides and how to process it. Metadata in the end is really data about the data. So the primary purpose of Metacat is to give a place to describe the data so that we could do more useful things with it.
|Beyond a data catalog, Select Star is an intelligent data discovery platform that helps you understand your data.
|-
|-
|Link
|Link
|https://github.com/Netflix/metacat
|https://github.com/Netflix/metacat
|https://www.selectstar.com/
|-
|-
|Disqualifying Reasons
|Disqualifying Reasons
|Documentation is still in the "TODO" phase, no references to community or the kind of organization that Apache projects enjoy, and somewhat limited scope.
|Documentation is still in the "TODO" phase, no references to community or the kind of organization that Apache projects enjoy, and somewhat limited scope.
|Closed Source, useful for comparisons
|}
|}

Revision as of 20:01, 16 November 2021

Evaluating potential data catalogs for https://phabricator.wikimedia.org/T293643.

Read the Data-as-a-Service Execution plan here.

Name Amundsen Altas DataHub Egeria Marquez Mediawiki
Tagline Open source data discovery and metadata engine Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem. The Metadata Platform for the Modern Data Stack Open metadata and governance for enterprises - automatically capturing, managing and exchanging metadata between tools and platforms, no matter the vendor. An open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more. MediaWiki is a collaboration and documentation platform brought to you by a vibrant community.
Release Date 2019 2015 2019 2018 2018 2003
Website https://www.amundsen.io https://atlas.apache.org https://datahubproject.io https://odpi.github.io/egeria-docs/ https://marquezproject.github.io/marquez/ https://mediawiki.org
Repository https://github.com/amundsen-io/amundsen https://github.com/apache/atlas https://github.com/linkedin/datahub
Author Lyft Authored by Hortonworks, managed by Apache LinkedIn LFAI WeWork Wikimedia
License Apache 2.0 Apache 2.0 Apache 2.0 Apache 2.0 Apache 2.0
UX Very in depth UI
Robustness (criteria TBD)
Comment Sits on top of Atlas The dogfooding approach: run our data catalong on wikitech. Mediawiki by itself could be used and manually updated, and any programmatic data access could be accomplished using mediawiki extensions.

General Considerations

  • Ingestion is going to be a big deal. For the parts of our data platform that do not expose metadata in a convenient way, we need to build custom metadata ingestion. Some of the tools above make this easier than others.
  • We should carefully survey the list of connectors. Everyone says "we have a flexible connector architecture", "the community builds lots of high quality connectors", etc, but when you look closer you might find poor support for something we need. Like this lack of support for Spark in Atlas.
  • Give extra points for tight integrations like using DataHub as a Lineage backend for AirFlow.

Other Candidates

With reasons they were not more seriously considered:

Name Metacat Select Star
Tagline Metacat is a unified metadata exploration API service. You can explore Hive, RDS, Teradata, Redshift, S3 and Cassandra. Metacat provides you information about what data you have, where it resides and how to process it. Metadata in the end is really data about the data. So the primary purpose of Metacat is to give a place to describe the data so that we could do more useful things with it. Beyond a data catalog, Select Star is an intelligent data discovery platform that helps you understand your data.
Link https://github.com/Netflix/metacat https://www.selectstar.com/
Disqualifying Reasons Documentation is still in the "TODO" phase, no references to community or the kind of organization that Apache projects enjoy, and somewhat limited scope. Closed Source, useful for comparisons