You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Data Catalog Application Evaluation: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Btullis
m (→‎What is a Data Catalog?: Add link to wikipedia page for fair data)
imported>Neil P. Quinn-WMF
(Neil P. Quinn-WMF moved page Data Catalog Application Evaluation to 2021 data catalog selection: Clarify this is not a living document and use title case)
 
Line 1: Line 1:
 
#REDIRECT [[2021 data catalog selection]]
==Data Catalog Evaluation==
 
===What is a Data Catalog?===
 
A data catalog is an inventory of data asset metadata that allows data consumers to discover and evaluate data for analytical and Product uses. Data Catalogs focus on addressing the issues of findability, accessibility, interoperability, and re-use – the four principles of [[:en:FAIR_data|FAIR data]] - which have proven to be critical bottlenecks in data management if left unaddressed. In addition it also proves a valuable tool for enabling data governance and data management by providing an interface for Data Definitions, Provenance, and Access Control.
 
===Problem Statement===
 
Our data lake has served as the primary repository of data stored in its raw and processed formats at the foundation. It has enabled us to store and analyze the vast amounts of data that result from our users interacting with our projects. However, simply centralizing and storing our data in our data lake has not solved-for our critical data management challenges such as data findability, accessibility, interoperability, and re-use. Currently we try to meet these needs with dataset documentation on Wikitech and metadata descriptions in schemas and in Hive tables. This has become costly to maintain and ultimately insufficient for enabling the FAIR data principles as we scale our data practices.
 
Interest in the data collected by our systems has been growing dramatically in the past few years with the introduction of new features and an increasing focus on evidencing our decisions using data. This has increased the urgency for an enhanced set of data management tools to address these challenges. One such tool that we are investigating is a Data Catalog.
 
===Impact Hypothesis===
 
By successfully implementing and integrating a catalog solution as part of our data management strategy, we would bring our data ecosystem more inline with the FAIR data principles which would enable more of the organization to be less reliant on our analytics teams.
 
===Evaluation Candidates===
 
 
 
{| style="border-spacing:0;width:6.5in;"
|- style="border: 1pt solid #000000; padding:0.0694in;"
| align=center| Atlas
| align=center| Amundsen
| align=center| Datahub
| align=center| Open Metadata
|-
|}
 
 
===WMF Functional Requirements===
 
For a solution to be considered complete it would need to support us in achieving the following functional requirements, now or in the future. Solutions will be evaluated in how they do against these functionality requirements which will form the basis of our evaluation. To do this we plan to run a timeboxed MVP that will test a solution and establish how many of these requirements can be met.
 
 
{| style="border-spacing:0;width:6.5in;"
|-
| style="border:1.5pt solid #000000;padding:0.0278in;" | '''Functional'''
| align=center style="border: 1.5pt solid #000000; padding:0.0278in;" | '''Key Functionality Requirements'''
| align=center style="border: 1.5pt solid #000000; padding:0.0278in;" | '''Description'''
|-
| align=center style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Ingestion
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Integration with underlying data stores to import metadata through data connectors
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | The Data Catalog should be able to ingest structured or semi-structured metadata, and must support ingesting metadata from the entire organization.
|-
|
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Ability to connect to the catalog via API for integration with automated processes and applications
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Data catalog should support automated discovery and ingestion of data sets, both for initial catalog build and ongoing discovery of new data sets.
|-
|
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Track data lineage
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Ability to trace data from the original source, through analysis and reporting processes
|-
|
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Track data usage
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Should support the ability to collect information about each data set including: Who has used the data set? For what use cases has it been used? How frequently is it used? With what other data sets is it typically used or combined?
|-
|
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Track metadata changes across dataset versions
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Track the changes and provide a version history of any dataset included in the catalog.
|-
| align=center style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Usability
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Dataset Evaluation
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Add annotations, create custom metadata fields, add search terms and tags, identify stewards and SMEs, tag security and compliance sensitive data fields.
|-
|
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Dataset Visibility
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | The ability to manually add, hide or remove datasets.
|-
|
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Capture User Feedback
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Enable social capabilities such as Org-Sourcing of metadata, sharing features, posting of user ratings and reviews, and capture of user feedback.
|-
|
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Ability to search for datasets
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Robust search capabilities include search by facets, keywords, and business terms. Natural language search capabilities are especially valuable for non-technical users. Ranking of search results by relevance and by frequency of use are particularly useful and beneficial features.
|-
|
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Interface Usability
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Include capabilities to preview a dataset, view data profiles, see user ratings, read user reviews and curator annotations, and view data quality information.
|-
| align=center style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Security
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Dataset access management
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Data access should be imposed at dataset level, record/row level, column/field level, and by value.
|-
|
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Fine grained ACL (access control lists) for catalog metadata access including data masking
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | User security should at minimum distinguish between administrative users, and analytic users and data stewards - all of which should have their own security profile
|-
|
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Ability to provide public access to discover all datasets
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Public datasets are useful to our technical community members who build critical infrastructure. Public datasets are also useful to the broader research network outside WMF, and a critical part of the free knowledge ecosystem. Public information about private datasets is useful in delineating what projects can happen without formal WMF collaboration.
|-
|}
<div style="color:#434343;"></div>
 
===MVP Goals===
 
'''Scope:''' Deploy a data catalog solution cataloging Hive datasets and Kafka datasets streamed through the Event Platform
 
 
{| style="border-spacing:0;width:6.5in;"
|- style="border:1pt solid #ffffff;padding:0.0694in;"
|| '''Functional Requirements'''
 
[Primary] Searching and filtering options to allow users to quickly find relevant sets of data for analytics or data engineering requirements.
 
[Extended] Provide a way for subject matter experts to contribute business knowledge eg. Glossary, tags, associations, user-defined annotations, classifications, ratings, etc.
|| '''Technical Requirements:'''
 
[Primary] Have the complete Hive Metastore imported into the Data Catalog
 
[Extended] Event Platform Schemas and Streams imported into the Data Catalog
 
[Stretch] Airflow lineage included
|-
|}
<div style="margin-left:0in;margin-right:0in;"></div>
 
<div style="margin-left:0in;margin-right:0in;">'''Milestones:'''</div>
 
 
{| style="border-spacing:0;width:6.3125in;"
|- style="background-color:#cccccc;border:0.75pt solid #cccccc;padding:0.0278in;"
| align=center| '''Milestone'''
| align=center| '''Details'''
|- style="background-color:#cfe2f3;border:0.75pt solid #cccccc;padding:0.0278in;"
|| Complete feature matrix
|| [https://phabricator.wikimedia.org/T299887 https://phabricator.wikimedia.org/T299887]
|- style="background-color:#d9ead3;border:0.75pt solid #cccccc;padding:0.0278in;"
|| Plan for Productionising Complete.
|| [https://phabricator.wikimedia.org/T299888 https://phabricator.wikimedia.org/T299888]
|- style="background-color:#d9ead3;border:0.75pt solid #cccccc;padding:0.0278in;"
|| Have the selected solution deployed and connect one dataset to it.
|| [https://phabricator.wikimedia.org/T299897 https://phabricator.wikimedia.org/T299897]
|- style="background-color:#ead1dc;border:0.75pt solid #cccccc;padding:0.0278in;"
|| Connect remaining data stores and test required functionality
|| [https://phabricator.wikimedia.org/T299899 https://phabricator.wikimedia.org/T299899]
|- style="background-color:#fce5cd;border:0.75pt solid #cccccc;padding:0.0278in;"
|| Demo Solution
|| [https://phabricator.wikimedia.org/T299910 https://phabricator.wikimedia.org/T299910]
|-
|}
<div style="color:#434343;"></div>
 
 
 
 
===Technical MVP Evaluation===
 
{| style="border-spacing:0;width:6.5625in;"
|- style="border:2.25pt solid #000000;padding:0.0278in;"
| colspan="6"  align=center| '''Implementation Considerations'''
|-
| colspan="2"  style="border:2.25pt solid #000000;padding:0.0278in;" | '''Requirement'''
| style="border: 2.25pt solid #000000; padding:0.0278in;" | '''Atlas'''
| style="border: 2.25pt solid #000000; padding:0.0278in;" | '''DataHub'''
| style="border: 2.25pt solid #000000; padding:0.0278in;" | '''OpenMetadata'''
| style="border: 2.25pt solid #000000; padding:0.0278in;" | '''Amundsen'''
|-
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Sync Hive
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | How often can changes get synced
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Continuous'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Continuous'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#980000;" | '''Almost daily'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#ff9900;" | '''Every 2 hours'''
|-
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Sync Airflow
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | How often can changes get pushed
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Continuous'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Continuous'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Continuous'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | [https://github.com/amundsen-io/amundsen/search?p=4&q=airflow Unknown]
|-
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Automated Classifier
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Ingestion and changes to the the state are automatically synced
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes*'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#ff9900;" | '''Limited'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#980000;" | '''No'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes*'''
|-
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Months to productionize
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Ready to T2 Service
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | 9 to 12
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | 4 to 6
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | 4 to 6
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | 6 to 8
|-
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Community now
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | How quickly the community responds
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#980000;" | '''Inactive'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Active'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Active'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#ff9900;" | '''Uncertain'''
|- style="border: 0.75pt solid #cccccc; padding:0.0278in;"
| colspan="6"  align=center| '''Search Capabilities'''
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | '''Requirement'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | '''Atlas'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | '''DataHub'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | '''OpenMetadata'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | '''Amundsen'''
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Imported metadata fields
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes*'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | System (eg: classifiers)
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes*'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Description text
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes*'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Popularity, rating, etc.
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes*'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
|- style="border: 0.75pt solid #cccccc; padding:0.0278in;"
| colspan="6"  align=center| '''Possible from a GUI'''
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | '''Requirement'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | '''Atlas'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | '''DataHub'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | '''OpenMetadata'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | '''Amundsen'''
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Manage stewardship
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes*'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | [https://github.com/amundsen-io/amundsen/blob/main/frontend/amundsen_application/api/metadata/v0.py#L189 Yes]
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Report quality issue
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes*'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#ff0000;" | '''No***'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | [https://docs.open-metadata.org/features#data-reliability Yes]
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | [https://github.com/amundsen-io/amundsen/blob/main/frontend/amundsen_application/api/quality/v0.py No?]
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | See Quality in Lineage
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes*'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#ff9900;" | '''Limited'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#ff0000;" | '''[https://github.com/open-metadata/OpenMetadata/issues/1311 Planned]'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | [https://github.com/amundsen-io/amundsen/blob/main/frontend/amundsen_application/api/metadata/v0.py#L1025 Yes]
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | See Classifiers in Lineage
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes*'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#ff9900;" | '''Limited'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | [https://github.com/amundsen-io/amundsen/blob/main/frontend/amundsen_application/api/metadata/v0.py#L1025 Yes]
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | See Dashboards in Lineage
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes*'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Glossary: use and update
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes*'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Planned'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#ff9900;" | '''Atlas'''
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Superset integration
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#ff9900;" | '''Limited*'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
|- style="border: 0.75pt solid #cccccc; padding:0.0278in;"
| colspan="6"  align=center| '''MVP Stretch Goals Features'''
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | '''Requirement'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | '''Atlas'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | '''DataHub'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | '''OpenMetadata'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;" | '''Amundsen'''
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Metadata Ingestion: MySQL
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#ff0000;" | No
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Metadata Ingestion: Hive metastore, Kafka topic metadata, Druid, Cassandra, dashboard metadata
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#ff9900;" | '''Limited'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Column-level lineage
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#ff9900;" |
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Planned'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''[https://github.com/open-metadata/OpenMetadata/issues/2931 Planned]'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes'''
|-
| colspan="2"  style="border: 0.75pt solid #cccccc; padding:0.0278in;" | Any access-related requirement
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes*'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#6aa84f;" | '''Yes**'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#ff9900;" | '''Planned'''
| style="border: 0.75pt solid #cccccc; padding:0.0278in;color:#ff9900;" | Atlas
|-
|}
 
 
 
{| style="border-spacing:0;width:6.4167in;"
|- style="border:0.75pt solid #ffffff;padding:0.0278in;"
|| <nowiki>*Not Supported with our current stack</nowiki>
|| <nowiki>**Supports LDAP, fine grained on roadmap</nowiki>
|| <nowiki>***Coming Soon</nowiki>
|-
|}
 
===Notes on the Candidates===
 
 
<div style="color:#434343;">DataHub</div>
 
We chose DataHub for our MVP because it fit best in our current environment. We have OpenSearch deployments, a MariaDB cluster it’s compatible with, we already have Kafka deployed, and so on. Ingestion for the metadata we care about was easy and flexible. We like that Kafka holds everything together because we have to allow public access to the catalog in some way. The main hesitation with DataHub is around the pieces of the LinkedIn / Confluent ecosystem that we are not using. Pegasus is used internally for schemas, and we shouldn’t have to interface with it, but JSON Schema would’ve been easier. Confluent Schema Registry or Karapace are dependencies now, and ideally we wouldn’t have to set those up, but there’s an open question with the DataHub community as to whether they can eliminate the dependency.
 
 
<div style="color:#434343;">Amundsen</div>
 
Amundsen is a great candidate, a simple set of python services provide all the functionality and make deployment simple. Ultimately the sources we want to ingest metadata from were just a bit harder to configure. But there’s a lot of great UX in Amundsen that’s worth revisiting; from social to features connected with good data governance practice. There’s a sense in Amundsen that folks experienced with data governance are steering the product in the right direction. This evaluation taught us so much, and we’re thankful for all the valuable content we found, like [https://www.stemma.ai/blog/making-sense-of-metadata-ingestion this great write-up] on snapshot extraction vs data extraction, and how data catalogs tend to fail or succeed.
 
 
<div style="color:#434343;">OpenMetadata</div>
 
A really good solution, very easy to deploy with or without docker. Ultimately it just didn’t fit as well in our environment. It needs MySQL 8+ and uses features that we don’t have in MariaDB 10.4, so we would have to set up a separate cluster and support it ourselves (we’re a very small team doing much more than the data catalog). Hive ingestion is [https://github.com/open-metadata/OpenMetadata/issues/2533 getting better], but not quite ready for our use case. Lots of good things to revisit here. Reliance on simpler open standards like JSON Schema, amazing and responsive community fixing issues as fast as we reported them, great UI with a great user experience, and clearly an eye towards a simpler data governance solution.
 
 
<div style="color:#434343;">Atlas</div>
 
Honestly we had high hopes for Atlas, but the community seems mostly unresponsive and they have no backwards compatibility with the version of Hive we use, so the hurdles were too big.
 
 
 
Some more notes from this evaluation process and other candidates we looked at available here: [[Data_Catalog_Application_Evaluation/Rubric]]

Latest revision as of 18:04, 4 July 2022