You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Data Catalog Application Evaluation/Rubric/DataHub: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Milimetric
 
imported>Neil P. Quinn-WMF
(Neil P. Quinn-WMF moved page Data Catalog Application Evaluation/Rubric/DataHub to 2021 data catalog selection/Rubric/DataHub: Clarify this is not a living document and use title case)
 
Line 1: Line 1:
=== Core Service and Dependency Setup ===
#REDIRECT [[2021 data catalog selection/Rubric/DataHub]]
[[File:Datahub architecture.png|frameless|600x600px|right]]DataHub was downloaded from https://github.com/linkedin/datahub/ onto stat1008 and tag 0.8.24 was checked out.
 
The build process required internet access so there were several places where the web proxy settings were required. Generally the build process was using gradle and was something like this:
 
<code>./gradlew -Dhttp.proxyHost=webproxy -Dhttp.proxyPort=8080 -Dhttps.proxyHost=webproxy -Dhttps.proxyPort=8080 "-Dhttp.nonProxyHosts=127.0.0.1|localhost|*.wmnet" build</code>
 
Any problems with the build were worked around using a build carried out on a workstation.
 
DataHub do not have a supported deployment method that doesn't use containers (i.e. docker) so in order to complete the setup of each of the required components, the steps from each Dockerfile were carried out manually.
 
==== Core DataHub Services ====
All DataHub components have an option to enable a prometheus JMX exporter, but this was not configured as part of the evaluation.
 
===== Metadata Service (GMS) =====
This runs as a Jetty web application on port 8080. The daemon is managed by a systemd user service that uses the following key configuration.
EnvironmentFile=/home/btullis/src/datahub/datahub/docker/datahub-gms/env/local.env
ExecStart=/usr/bin/java $JAVA_OPTS $JMX_OPTS -jar jetty-runner.jar --jar jetty-util.jar --jar jetty-jmx.jar ./war.war
WorkingDirectory=/home/btullis/src/datahub/datahub/docker/datahub-gms
This service listens on port 8080.
 
===== Frontend Service =====
This is a combination of a [https://www.playframework.com/ Play Framework] application with a React frontend. Similarly to the GMS service, it is controlled by a systemd user service with the following configuration.
EnvironmentFile=/home/btullis/src/datahub/datahub/docker/datahub-frontend/env/local.env
ExecStart=/home/btullis/src/datahub/datahub/docker/datahub-frontend/datahub-frontend/bin/playBinary
WorkingDirectory=/home/btullis/src/datahub/datahub/docker/datahub-frontend
This service listens on port 9000
 
It uses JAAS for authentication. Initially it uses a flat file with a fixed <code>datahub</code>/<code>datahub</code> username password, but we could use LDAP for this, or possibly CAS.
 
===== Metadata Change Event (MCE) Consumer Job =====
[[File:Datahub-ingestion-architecture.png|thumb|DataHub ingestion architecture]]
This is a Kafka consumer that works on the ingestion side for DataHub. It reads jobs from a Kafka topic and applies the change to the persistent storage back-end. It then enqueues another job for the MAE consumer job to pick up.
 
===== Metadata Audit Event (MAE) Consumer Job =====
This is a Kafka consumer that is more related to the serving side of DataHub. It picks up MAE jobs from the Kafka topic and updates the search indices and the graph database.
[[File:Datahub-serving.png|thumb|DataHub Serving Architecture]]
 
==== Confluent Platform Services ====
A binary release of the Confluent Platform 5.4 was extracted to <code>/home/btullis/src/datahub/confluent</code> on stat1008 and this was used to run the following components with a default configuration.
 
===== Zookeeper =====
We ran a local zookeeper instance on stat1008 using a systemd user configuration with the following configuration:
Environment="JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64"
ExecStart=/home/btullis/src/datahub/confluent/bin/zookeeper-server-start etc/kafka/zookeeper.properties
ExecStop=/home/btullis/src/datahub/confluent/bin/zookeeper-server-stop
WorkingDirectory=/home/btullis/src/datahub/confluent/
 
===== Kafka =====
We ran a standalone broker on stat1008 using a systemd user service with the following configuration:
Environment="JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64"
ExecStart=/home/btullis/src/datahub/confluent/bin/kafka-server-start etc/kafka/server.properties
ExecStop=/home/btullis/src/datahub/confluent/bin/kafka-server-stop
WorkingDirectory=/home/btullis/src/datahub/confluent/
The required topics were created using the steps recorded [[phab:T299703#7659303|here]]
 
===== Schema Registry =====
A ''Schema Registry'' is a required component of DataHub, but we have an issue because the Confluent version does not use a compatible license. It is a Confluent Community License which is cost-free, but not sufficiently permissive for us to be able to use it. We have been discussing with DataHub themselves whether there is a workaround for this requirement. They have suggested one workaround, which is to use [https://github.com/aiven/karapace Karapace] and they have also created a [https://feature-requests.datahubproject.io/b/Developer-Experience/p/remove-required-dependency-on-confluent-schema-registry feature request to obviate the requirement].
 
Karapace was initially researched, but for the purposes of this evaluation we proceeded with the Confluent Schema Registry. This was set up as a systemd user unit in the same way as the other Confluent components.
 
==== Search Services ====
A binary distribution of OpenSearch 1.2.4 was extracted to <code>/home/btullis/src/datahub/opensearch</code> on stat1008.
 
This was configured to run as a systemd user service, which simply executed <code>bin/elasticsearch</code>
 
The only configuration required was to disable security.
 
Indices were pre-created according to the steps recorded [[phab:T299703#7659361|here]]
 
The only issue was that the index lifecycle policy could not be applied, but this would not necessarily pose a significant problem for a prototype. We would be able to work out similar settings for a production version.
 
==== Graph Database Services ====
A binary distribution of Neo4J community edition version 4.0.6 was extracted to <code>/home/btullis/src/datahub/neo4j</code> on stat1008.
 
This was configured to run as a systemd user service, which simply executed bin/neo4j console
 
The only configuration required was to set the default username/password to <code>neo4j</code>/<code>datahub</code> and to enable the bolt authentication mechanism.
 
=== Ingestion Configuration ===
Once all of the services were running, we could move onto the ingestion side. The recipes for ingestion are clearly explained and appear well-polished in comparison with the other systems evaluated.
 
All ingestion components use Python, so a simple conda environment was created and all plugins were install using pip, for example:
 
<code>pip install 'acryl-datahub[datahub-rest]'</code>
 
==== Hive Ingestion ====
Ths method used a connection to the ''Hive Server2'' server, as oppsed to the Metastore (as used by [[Data Catalog Application Evaluation Rubric/Atlas|Atlas]]) or to Hive's MySQL database (as used by [[Data Catalog Application Evaluation Rubric/Amundsen|Amundsen]]).
 
It required a working pyhive configuration and used a user's own Kerberos ticket.
 
The following recipe ingested all of our Hive tables, with the exception of one which caused an error and had to be omitted.
source:
  type: "hive"
  config:
    host_port: analytics-hive.eqiad.wmnet:10000
    options:
      connect_args:
        auth: 'KERBEROS'
        kerberos_service_name: hive
    table_pattern:
      deny:
        - 'gage.webrequest_bad_json'
sink:
  type: "datahub-rest"
  config:
    server: '<nowiki>http://localhost:8080'</nowiki>
This pipeline was then executed with the command <code>datahub ingest -c hive.yml</code>
 
===== Kafka Ingestion =====
We ingested kafka topic names from the kafka-jumbo cluster. No schema information was accociated with the toic names, although automatic schema association might be possible to add if we make more effective use of the schema registry component. The following recipe ingested the kafka topics.
source:
  type: "kafka"
  config:
     connection:
       bootstrap: "kafka-jumbo1001.eqiad.wmnet:9092"
       schema_registry_url: <nowiki>http://localhost:8081</nowiki>
sink:
  type: "datahub-rest"
  config:
     server: '<nowiki>http://localhost:8080'</nowiki>
 
===== Druid Ingestion =====
Druid was particularly simple to ingest, given the lack of authentication on the data sources at the moment. We ingested data from both the analytics cluster and from the public cluster.
 
The result was 27 datasets with full schema information.
 
=== Progress Status ===
[[File:Datahub Evaluation Progress.png|border|frameless]]
Progress with DataHub was good. It would have ben nice to have been able to spend a little more time looking at the Airflow based ingestion and the support for lineage, but we moved onto other evaluation candidates after successfully ingesting Hive, Kafka, and Druid.
 
=== Perceptions ===
DataHub seems like a well-managed project with a vibrant community and solid backing from a commercial entity (LinkedIn) who has a proven track-record of open-source project support.
 
It's true that it's a pre-1.0 product and that some features such as data quality reporting and granular authorization are not yet finished, but the project's roadmap shows that they are high on the agenda.
 
The community has been responsive to all of our questions and has offered to make their engineering staff available to us in a private Slack channel in support of our MVP.
 
=== Outcome ===
DataHub has been proposed as the primary candiate to be taken forward to a full MVP phase and hopefully a subsequent production deployment.

Latest revision as of 18:04, 4 July 2022