You are browsing a read-only backup copy of Wikitech. The primary site can be found at


From Wikitech-static
< Analytics‎ | Systems‎ | Cluster
Revision as of 13:56, 1 October 2021 by imported>Joal (Update details and add technical findings section.)
Jump to navigation Jump to search

Apache Gobblin is Hadoop ingestion software used at WMF primarily to import data from Kafka into HDFS.

Until 2021, we used Camus for this purpose. T238400 has some information on how Gobblin was chosen as its replacement.

Gobblin jobs

Gobblin jobs are declared in puppet and their configuration is defined in refinery.

WMF's Gobblin fork

The Data Engineering team maintains a fork of Gobblin. The master branch should track upstream and our own gobblin-wmf gobblin module in the wmf branch. The gobblin-wmf module mostly contains code to ingest Kafka events to HDFS.

Releasing new Gobblin versions

Build the gobblin-wmf jar

In the gobblin source folder, run: ./gradlew :gobblin-modules:gobblin-wmf:shadow. This creates multiple jars in the build/gobblin-wmf/libs subfolder, and the one to be used for execution is named as in gobblin-wmf-0.16.0-wmf1-all.jar. The all mention that the jar is a fat-jar, containing all the needed dependencies to be executed.

Upload the jar to Archiva

We upload our gobblin-wmf artifacts directly to Archiva, and then add them as git-fat jar files in Analytics/Systems/Cluster/Deploy/Refinery, and deploy them like we do other jar artifacts with analytics/refinery.

We do not (as of 2021-07) have an automated release process for Gobblin. You must manually upload the packaged artifact .jars to archiva, and manually download and git add them to analytics/refinery.

Technical findings

Pulling from Kafka

We experienced regular inconsistencies in volumes of data pulled from Kafka in the first month of usage of Gobblin. More precisely, it happened regularly that some pull-tasks (executed in map-containers) were actually not pulling any data while they were expected to do so. This didn't lead to Gobblin job-errors, just delay in pulling some data. Unfortunately, due to how we trigger computation jobs after data is pulled, the problem led to data-loss alerts when the phenomenon occurred just after the hour. We investigated and found why some tasks were failing to pull data: there is a specific settings making Gobblin time-out (without error, just no data) when fetching it's first batch of data from kafka. Details can be found in that task:

Publishing metrics

Gobblin metrics are not easy to grasp. In particular they embed a Context notion (as tag) that details from how deep in the gobblin-architecture the metrics is generated. In our usage of gobblin, here are some findings:

  • The Gobblin CLI process and every Gobblin map-task are different metrics context (the map-tasks have the same job-id tag)
  • There are many contexts in which metrics are generated, making it not easy to define which to get (for our use case, it seems that contexts of the form: metricContextName=gobblin.metrics.job_JOBNAME_JOBTS are the ones summurazing all the values of interest).
  • Metrics are generated during the task execution with intermediate values, and the last generated metric-event is flagged as finalReport=true.