You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Apache Gobblin is Hadoop ingestion software used at WMF primarily to import data from Kafka into HDFS.
WMF's Gobblin fork
The Data Engineering team maintains a fork of Gobblin. The master branch should track upstream and our own gobblin-wmf gobblin module in the wmf branch. The gobblin-wmf module mostly contains code to ingest Kafka events to HDFS.
Releasing new Gobblin versions
Build the gobblin-wmf jar
In the gobblin source folder, run:
./gradlew :gobblin-modules:gobblin-wmf:shadow. This creates multiple jars in the
build/gobblin-wmf/libs subfolder, and the one to be used for execution is named as in
all mention that the jar is a fat-jar, containing all the needed dependencies to be executed.
Upload the jar to Archiva
We upload our gobblin-wmf artifacts directly to Archiva, and then add them as git-fat jar files in Analytics/Systems/Cluster/Deploy/Refinery, and deploy them like we do other jar artifacts with analytics/refinery.
We do not (as of 2021-07) have an automated release process for Gobblin. You must manually upload the packaged artifact .jars to archiva, and manually download and git add them to analytics/refinery.
Pulling from Kafka
We experienced regular inconsistencies in volumes of data pulled from Kafka in the first month of usage of Gobblin. More precisely, it happened regularly that some pull-tasks (executed in map-containers) were actually not pulling any data while they were expected to do so. This didn't lead to Gobblin job-errors, just delay in pulling some data. Unfortunately, due to how we trigger computation jobs after data is pulled, the problem led to data-loss alerts when the phenomenon occurred just after the hour. We investigated and found why some tasks were failing to pull data: there is a specific settings making Gobblin time-out (without error, just no data) when fetching it's first batch of data from kafka. Details can be found in that task: https://phabricator.wikimedia.org/T290723.
Gobblin metrics are not easy to grasp. In particular they embed a
Context notion (as tag) that details from how deep in the gobblin-architecture the metrics is generated. In our usage of gobblin, here are some findings:
- The Gobblin CLI process and every Gobblin map-task are different metrics context (the map-tasks have the same job-id tag)
- There are many contexts in which metrics are generated, making it not easy to define which to get (for our use case, it seems that contexts of the form: metricContextName=gobblin.metrics.job_JOBNAME_JOBTS are the ones summurazing all the values of interest).
- Metrics are generated during the task execution with intermediate values, and the last generated metric-event is flagged as