You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Systems/Cluster/Gobblin
Apache Gobblin is Hadoop ingestion software used at WMF primarily to import data from Kafka into HDFS.
Until 2021, we used Camus for this purpose. T238400 has some information on how Gobblin was chosen as its replacement.
Gobblin jobs
Gobblin jobs are declared in puppet.
WMF's Gobblin fork
The Data Engineering team maintains a fork of Gobblin. We use this fork to maintain our own gobblin-wmf gobblin module in the wmf branch. The gobblin-wmf module mostly contains code for interact with Event Platform based events in Kafka. The master branch should track upstream.
Releasing new Gobblin versions
We upload our gobblin-wmf artifacts directly to Archiva, and then add them as git-fat jar files in Analytics/Systems/Cluster/Deploy/Refinery, and deploy them like we do other jar artifacts with analytics/refinery.
We do not (as of 2021-07) have an automated release process for Gobblin. You must manually upload the packaged artifact .jars to archiva, and manually download and git add them to analytics/refinery.