You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Apache Gobblin is Hadoop ingestion software used at WMF primarily to import data from Kafka into HDFS.
The Data Engineering team maintains a gobblin-wmf repository from which we add our gobblin extensions and build our gobblin fat jar for deployment. The gobblin-wmf-core module mostly contains code to ingest Kafka events to HDFS and to send metrics to Prometheus.
Releasing new Gobblin versions
Releasing is handled by a Jenkins job. Log into Jenkins and run the analytics-gobblin-wmf-maven-release-docker job. This will create a new git tag version, build the gobblin-wmf artifacts and upload them to Archiva.
To deploy, you will need to manually add the gobblin-wmf-core-jar-with-dependencies.jar file to analytics/refinery/artifacts, deploy following Analytics/Systems/Cluster/Deploy/Refinery, and then update the gobblin jar specified in puppet (here for analytics-hadoop, and here for analytics-test-hadoop).
A Gobblin Grafana dashboard exists to visualize some of these metrics.
- If the last successful run timestamp for a gobblin job is older than 2 hours, a critical alert will be fired to the Data-Engineering team
- If a gobblin job run reports any Kafka topic partitions for which the number of actual extracted records is different than the number of expected extracted records, a warning alert will be fired to the Data-Engineering team.
Pulling from Kafka
We experienced regular inconsistencies in volumes of data pulled from Kafka in the first month of usage of Gobblin. More precisely, it happened regularly that some pull-tasks (executed in map-containers) were actually not pulling any data while they were expected to do so. This didn't lead to Gobblin job-errors, just delay in pulling some data. Unfortunately, due to how we trigger computation jobs after data is pulled, the problem led to data-loss alerts when the phenomenon occurred just after the hour. We investigated and found why some tasks were failing to pull data: there is a specific settings making Gobblin time-out (without error, just no data) when fetching it's first batch of data from kafka. Details can be found in that task: https://phabricator.wikimedia.org/T290723.
Gobblin metric details
Gobblin metrics are not easy to grasp. In particular they embed a
Context notion (as tag) that details from how deep in the gobblin-architecture the metrics is generated. In our usage of gobblin, here are some findings:
- The Gobblin CLI process and every Gobblin map-task are different metrics context (the map-tasks have the same job-id tag)
- There are many contexts in which metrics are generated, making it not easy to define which to get (for our use case, it seems that contexts of the form: metricContextName=gobblin.metrics.job_JOBNAME_JOBTS are the ones summarizing all the values of interest).
- Metrics are generated during the task execution with intermediate values, and the last generated metric-event is flagged as