You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
2021 data catalog selection/Rubric/DataHub
Core Service and Dependency Setup
DataHub was downloaded from https://github.com/linkedin/datahub/ onto stat1008 and tag 0.8.24 was checked out.
The build process required internet access so there were several places where the web proxy settings were required. Generally the build process was using gradle and was something like this:
./gradlew -Dhttp.proxyHost=webproxy -Dhttp.proxyPort=8080 -Dhttps.proxyHost=webproxy -Dhttps.proxyPort=8080 "-Dhttp.nonProxyHosts=127.0.0.1|localhost|*.wmnet" build
Any problems with the build were worked around using a build carried out on a workstation.
DataHub do not have a supported deployment method that doesn't use containers (i.e. docker) so in order to complete the setup of each of the required components, the steps from each Dockerfile were carried out manually.
Core DataHub Services
All DataHub components have an option to enable a prometheus JMX exporter, but this was not configured as part of the evaluation.
Metadata Service (GMS)
This runs as a Jetty web application on port 8080. The daemon is managed by a systemd user service that uses the following key configuration.
EnvironmentFile=/home/btullis/src/datahub/datahub/docker/datahub-gms/env/local.env ExecStart=/usr/bin/java $JAVA_OPTS $JMX_OPTS -jar jetty-runner.jar --jar jetty-util.jar --jar jetty-jmx.jar ./war.war WorkingDirectory=/home/btullis/src/datahub/datahub/docker/datahub-gms
This service listens on port 8080.
This is a combination of a Play Framework application with a React frontend. Similarly to the GMS service, it is controlled by a systemd user service with the following configuration.
EnvironmentFile=/home/btullis/src/datahub/datahub/docker/datahub-frontend/env/local.env ExecStart=/home/btullis/src/datahub/datahub/docker/datahub-frontend/datahub-frontend/bin/playBinary WorkingDirectory=/home/btullis/src/datahub/datahub/docker/datahub-frontend
This service listens on port 9000
It uses JAAS for authentication. Initially it uses a flat file with a fixed
datahub username password, but we could use LDAP for this, or possibly CAS.
Metadata Change Event (MCE) Consumer Job
This is a Kafka consumer that works on the ingestion side for DataHub. It reads jobs from a Kafka topic and applies the change to the persistent storage back-end. It then enqueues another job for the MAE consumer job to pick up.
Metadata Audit Event (MAE) Consumer Job
This is a Kafka consumer that is more related to the serving side of DataHub. It picks up MAE jobs from the Kafka topic and updates the search indices and the graph database.
Confluent Platform Services
A binary release of the Confluent Platform 5.4 was extracted to
/home/btullis/src/datahub/confluent on stat1008 and this was used to run the following components with a default configuration.
We ran a local zookeeper instance on stat1008 using a systemd user configuration with the following configuration:
Environment="JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64" ExecStart=/home/btullis/src/datahub/confluent/bin/zookeeper-server-start etc/kafka/zookeeper.properties ExecStop=/home/btullis/src/datahub/confluent/bin/zookeeper-server-stop WorkingDirectory=/home/btullis/src/datahub/confluent/
We ran a standalone broker on stat1008 using a systemd user service with the following configuration:
Environment="JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64" ExecStart=/home/btullis/src/datahub/confluent/bin/kafka-server-start etc/kafka/server.properties ExecStop=/home/btullis/src/datahub/confluent/bin/kafka-server-stop WorkingDirectory=/home/btullis/src/datahub/confluent/
The required topics were created using the steps recorded here
A Schema Registry is a required component of DataHub, but we have an issue because the Confluent version does not use a compatible license. It is a Confluent Community License which is cost-free, but not sufficiently permissive for us to be able to use it. We have been discussing with DataHub themselves whether there is a workaround for this requirement. They have suggested one workaround, which is to use Karapace and they have also created a feature request to obviate the requirement.
Karapace was initially researched, but for the purposes of this evaluation we proceeded with the Confluent Schema Registry. This was set up as a systemd user unit in the same way as the other Confluent components.
A binary distribution of OpenSearch 1.2.4 was extracted to
/home/btullis/src/datahub/opensearch on stat1008.
This was configured to run as a systemd user service, which simply executed
The only configuration required was to disable security.
Indices were pre-created according to the steps recorded here
The only issue was that the index lifecycle policy could not be applied, but this would not necessarily pose a significant problem for a prototype. We would be able to work out similar settings for a production version.
Graph Database Services
A binary distribution of Neo4J community edition version 4.0.6 was extracted to
/home/btullis/src/datahub/neo4j on stat1008.
This was configured to run as a systemd user service, which simply executed bin/neo4j console
The only configuration required was to set the default username/password to
datahub and to enable the bolt authentication mechanism.
Once all of the services were running, we could move onto the ingestion side. The recipes for ingestion are clearly explained and appear well-polished in comparison with the other systems evaluated.
All ingestion components use Python, so a simple conda environment was created and all plugins were install using pip, for example:
pip install 'acryl-datahub[datahub-rest]'
It required a working pyhive configuration and used a user's own Kerberos ticket.
The following recipe ingested all of our Hive tables, with the exception of one which caused an error and had to be omitted.
source: type: "hive" config: host_port: analytics-hive.eqiad.wmnet:10000 options: connect_args: auth: 'KERBEROS' kerberos_service_name: hive table_pattern: deny: - 'gage.webrequest_bad_json' sink: type: "datahub-rest" config: server: 'http://localhost:8080'
This pipeline was then executed with the command
datahub ingest -c hive.yml
We ingested kafka topic names from the kafka-jumbo cluster. No schema information was accociated with the toic names, although automatic schema association might be possible to add if we make more effective use of the schema registry component. The following recipe ingested the kafka topics.
source: type: "kafka" config: connection: bootstrap: "kafka-jumbo1001.eqiad.wmnet:9092" schema_registry_url: http://localhost:8081 sink: type: "datahub-rest" config: server: 'http://localhost:8080'
Druid was particularly simple to ingest, given the lack of authentication on the data sources at the moment. We ingested data from both the analytics cluster and from the public cluster.
The result was 27 datasets with full schema information.
File:Datahub Evaluation Progress.png Progress with DataHub was good. It would have ben nice to have been able to spend a little more time looking at the Airflow based ingestion and the support for lineage, but we moved onto other evaluation candidates after successfully ingesting Hive, Kafka, and Druid.
DataHub seems like a well-managed project with a vibrant community and solid backing from a commercial entity (LinkedIn) who has a proven track-record of open-source project support.
It's true that it's a pre-1.0 product and that some features such as data quality reporting and granular authorization are not yet finished, but the project's roadmap shows that they are high on the agenda.
The community has been responsive to all of our questions and has offered to make their engineering staff available to us in a private Slack channel in support of our MVP.
DataHub has been proposed as the primary candiate to be taken forward to a full MVP phase and hopefully a subsequent production deployment.