You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
2021 data catalog selection/Rubric
Since this is a central and critical part of a healthy data culture, a multitude of requirements with complex inter-relationships factor into our decision. This section attempts to highlight the requirements that we think would most impact us over the first year that the data catalog is available to users.
- Easily ingest metadata from various parts of our Shared Infrastructure
- lineage: Airflow, Gobblin, custom ingestion (Flink, Spark)
- tables and columns: Hive metastore, Druid, Cassandra, Elastic, custom jobs writing to HDFS (Spark (also relevant to our Airflow dag strategy))
- lots of other metadata: ownership, location, quality, automatic classifications. All tools have places to store these, but automation here is key, as it saves the souls of Data Stewards.
- Authentication. Which of our many sign-on mechanisms are we going to use and does the tool support it?
- Authorization. Are we going to need fine-grained control on data, are we going to deploy Apache Ranger? Do we need to reflect the way we treat data in the metadata layer? Can we allow anyone to edit any metadata?
- UX. We are not a mature data savvy organization. This will be a lot of people's first experience with our data landscape. We need to make it pleasant to deliver on our goals of making data a first class citizen here at WMF.
- Search. This is part of UX but it's a major component of most of the candidates, and, for example, Atlas is supposed to have not so great search. This makes it an important stand-alone consideration.
- Speed of Ingestion. Do we need real-time updates to our metadata? Are we planning automated responses to certain changes in a way that would be easily centralized on top of the metadata catalog?
- Privacy (data retention, transformations, compliance). One of the things we pride ourselves in is privacy, retaining only what we need for as long as we need. An overall picture of how compliant we are could be different from our self-assessment, so this seems like an important consideration. Maybe more generally, we need a high level overview of data we keep and how it meets or fails to meet a privacy budget.
This list means we're not focusing on some of the other aspects of data governance right now. Things like spelling out policies and tracking compliance, following strict processes, and so on. These seem possible to build on top of most of the solutions we're looking at.
- Ingestion is going to be a big deal. For the parts of our data platform that do not expose metadata in a convenient way, we need to build custom metadata ingestion. Some of the tools above make this easier than others.
- We should carefully survey the list of connectors. Everyone says "we have a flexible connector architecture", "the community builds lots of high quality connectors", etc, but when you look closer you might find poor support for something we need. Like this lack of support for Spark in Atlas.
- Give extra points for tight integrations like using DataHub as a Lineage backend for AirFlow.
Click on each header name to see the in-depth evaluation for each application as a separate article.
|Tagline||Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.||Open source data discovery and metadata engine||The Metadata Platform for the Modern Data Stack||Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.||Open metadata and governance for enterprises - automatically capturing, managing and exchanging metadata between tools and platforms, no matter the vendor.||An open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more.|
|Author||Authored by Hortonworks, managed by Apache||Lyft||Former Hortonworks and Uber employee. Suresh Srinivas||LFAI||WeWork|
|License||Apache 2.0||Apache 2.0||Apache 2.0||Apache 2.0||Apache 2.0||Apache 2.0|
|UX||Java application via Jetty
New UI is default in version 2.2.0
Legacy UI still available||Flask application
|Java application via Jetty
|Initially not considered, now basic exploration is available.|
|Robustness (criteria TBD)||Community support is lacking||Ingestion components seem unfinished||Difficult to ascertain. No significant issues detected so far.||Fairly nascent project|
|Comment||Has a variety of back-end storage options including BerkeleyDB, HBASE, Cassandra||Has a variety of back-end storage options, including RDBMS and Neo4J.
Can also make use of Atlas, but it is not a requirement.
|Requires MySQL 8.0||
With reasons they were not more seriously considered:
|Tagline||Metacat is a unified metadata exploration API service. You can explore Hive, RDS, Teradata, Redshift, S3 and Cassandra. Metacat provides you information about what data you have, where it resides and how to process it. Metadata in the end is really data about the data. So the primary purpose of Metacat is to give a place to describe the data so that we could do more useful things with it.||Beyond a data catalog, Select Star is an intelligent data discovery platform that helps you understand your data.||Open source research data repository software||The world’s leading open source data management system||MediaWiki is a collaboration and documentation platform brought to you by a vibrant community.|
|Link||https://github.com/Netflix/metacat||https://www.selectstar.com/||https://dataverse.org/||https://ckan.org/ (Licensed GNU AGPL 3.0)||https://mediawiki.org|
|Disqualifying Reasons||Documentation is still in the "TODO" phase, no references to community or the kind of organization that Apache projects enjoy, and somewhat limited scope.||Closed Source, useful for comparisons||This is more of a research sharing tool, not used for generic data governance. See code at https://github.com/IQSS/dataverse and related documentation.||CKAN is meant to work at a very large scale, governments with multiple branches collaborating on data hubs. As such, most of the integrations are meant to be done manually, with only minimal automation support. Details can be found in their docs, but it doesn't seem to meet our requirements, it's maybe something to consider for something bigger like an Open Knowledge Data Portal shared with our other open knowledge partners.||While building a metadata ingestion and storage layer on top of Mediawiki would be a fun side project, it is clear that the complexity of the competitors' UIs means this would only work if a lightweight, manual process was viable.
We decided to create four evaluation deployments of different candidates, in order to assess their suitability for this requirement.
|Ticket||Version(s) Deployed||Location||Backend Components||Hive Ingested||Druid Ingested||Kafka Ingested||Notes|
||No - Hive version incompatible.||Not attempted.||Not attempted.|
||Yes, production. Using a Kerberos
secured connection to the hive-server2 service via pyhive.
|Yes, both public and
|Yes, topics only.
Not associated with schemas yet
||Yes, test. Using a Kerberos
secured connection to the hive-server2 service via pyhive.
|Not attempted.||Not attempted|
||Yes, production. Using a MySQL
connection to an-coord1001.
|Yes, analytics cluster.||Not attempted|