You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Data Catalog Application Evaluation Rubric
Evaluating potential data catalogs for https://phabricator.wikimedia.org/T293643.
Read the Data-as-a-Service Execution plan here.
Since this is a central and critical part of a healthy data culture, a multitude of requirements with complex inter-relationships factor into our decision. This section attempts to highlight the requirements that we think would most impact us over the first year that the data catalog is available to users.
- Easily ingest metadata from various parts of our Shared Infrastructure
- lineage: Airflow, Gobblin, custom ingestion (Flink, Spark)
- tables and columns: Hive metastore, Druid, Cassandra, Elastic, custom jobs writing to HDFS (Spark (also relevant to our Airflow dag strategy))
- lots of other metadata: ownership, location, quality, automatic classifications. All tools have places to store these, but automation here is key, as it saves the souls of Data Stewards.
- Authentication. Which of our many sign-on mechanisms are we going to use and does the tool support it?
- Authorization. Are we going to need fine-grained control on data, are we going to deploy Apache Ranger? Do we need to reflect the way we treat data in the metadata layer? Can we allow anyone to edit any metadata?
- UX. We are not a mature data savvy organization. This will be a lot of people's first experience with our data landscape. We need to make it pleasant to deliver on our goals of making data a first class citizen here at WMF.
- Search. This is part of UX but it's a major component of most of the candidates, and, for example, Atlas is supposed to have not so great search. This makes it an important stand-alone consideration.
- Speed of Ingestion. Do we need real-time updates to our metadata? Are we planning automated responses to certain changes in a way that would be easily centralized on top of the metadata catalog?
- Privacy (data retention, transformations, compliance). One of the things we pride ourselves in is privacy, retaining only what we need for as long as we need. An overall picture of how compliant we are could be different from our self-assessment, so this seems like an important consideration. Maybe more generally, we need a high level overview of data we keep and how it meets or fails to meet a privacy budget.
This list means we're not focusing on some of the other aspects of data governance right now. Things like spelling out policies and tracking compliance, following strict processes, and so on. These seem possible to build on top of most of the solutions we're looking at.
- Ingestion is going to be a big deal. For the parts of our data platform that do not expose metadata in a convenient way, we need to build custom metadata ingestion. Some of the tools above make this easier than others.
- We should carefully survey the list of connectors. Everyone says "we have a flexible connector architecture", "the community builds lots of high quality connectors", etc, but when you look closer you might find poor support for something we need. Like this lack of support for Spark in Atlas.
- Give extra points for tight integrations like using DataHub as a Lineage backend for AirFlow.
Cloud Services Evaluation
We are currently evaluating potential data catalogs on data-catalog-evaluation.analytics.eqiad1.wikimedia.cloud.
Some of the data sources we'd like to simulate there:
If you have access to the analytics horizon project, access the host as follows:
Or if you haven't configured ssh to use the bastion proxy jump, use:
ssh -J bastion.wmcloud.org data-catalog-evaluation.analytics.eqiad1.wikimedia.cloud
If your username is different than your local username, you'll need to specify your user:
ssh -J firstname.lastname@example.org email@example.com
We currently have the following web proxies configured:
https://data-catalog-evaluation.wmcloud.org/ - home page that lists the data catalog services
https://atlas-demo.wmcloud.org/ - Atlas running on port 21000
Click on each header name to see the in-depth evaluation for each application as a separate article.
|Tagline||Open source data discovery and metadata engine||Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.||The Metadata Platform for the Modern Data Stack||Open metadata and governance for enterprises - automatically capturing, managing and exchanging metadata between tools and platforms, no matter the vendor.||An open source metadata service for the collection, aggregation, and visualization of a data ecosystem’s metadata. It maintains the provenance of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more.|
|Author||Lyft||Authored by Hortonworks, managed by Apache||LFAI||WeWork|
|License||Apache 2.0||Apache 2.0||Apache 2.0||Apache 2.0||Apache 2.0|
|UX||Very in depth UI||Initially not considered, now basic exploration is available.|
|Robustness (criteria TBD)|
|Comment||Sits on top of Atlas||Takes a really long time to start up (~12 minutes). For the first 5 minutes, it outputs a series of dots (which may not print due to buffering, so it looks frozen):
apache-atlas-docker-atlas-server-1 | ..................
Then it prints repeatedly
waiting for atlas to be ready
for another 8 minutes or so. This makes iteration really tedious.
With reasons they were not more seriously considered:
|Tagline||Metacat is a unified metadata exploration API service. You can explore Hive, RDS, Teradata, Redshift, S3 and Cassandra. Metacat provides you information about what data you have, where it resides and how to process it. Metadata in the end is really data about the data. So the primary purpose of Metacat is to give a place to describe the data so that we could do more useful things with it.||Beyond a data catalog, Select Star is an intelligent data discovery platform that helps you understand your data.||Open source research data repository software||The world’s leading open source data management system||MediaWiki is a collaboration and documentation platform brought to you by a vibrant community.|
|Link||https://github.com/Netflix/metacat||https://www.selectstar.com/||https://dataverse.org/||https://ckan.org/ (Licensed GNU AGPL 3.0)||https://mediawiki.org|
|Disqualifying Reasons||Documentation is still in the "TODO" phase, no references to community or the kind of organization that Apache projects enjoy, and somewhat limited scope.||Closed Source, useful for comparisons||This is more of a research sharing tool, not used for generic data governance. See code at https://github.com/IQSS/dataverse and related documentation.||CKAN is meant to work at a very large scale, governments with multiple branches collaborating on data hubs. As such, most of the integrations are meant to be done manually, with only minimal automation support. Details can be found in their docs, but it doesn't seem to meet our requirements, it's maybe something to consider for something bigger like an Open Knowledge Data Portal shared with our other open knowledge partners.||While building a metadata ingestion and storage layer on top of Mediawiki would be a fun side project, it is clear that the complexity of the competitors' UIs means this would only work if a lightweight, manual process was viable.