You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Systems/Cluster/Gobblin

From Wikitech-static
< Analytics‎ | Systems‎ | Cluster
Revision as of 15:07, 29 July 2021 by imported>Ottomata (Created page with "[https://gobblin.apache.org/ Apache Gobblin] is Hadoop ingestion software used at WMF primarily to import data from Kafka into HDFS. == Gobblin jobs == Gobblin jobs are [https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/gobblin.pp declared in puppet]. == WMF's Gobblin fork == The Data Engineering team maintains a [https://gerrit.wikimedia.org/g/analytics/gobblin fork of Gobblin]. We use this fork to maintain our own [...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Apache Gobblin is Hadoop ingestion software used at WMF primarily to import data from Kafka into HDFS.

Gobblin jobs

Gobblin jobs are declared in puppet.

WMF's Gobblin fork

The Data Engineering team maintains a fork of Gobblin. We use this fork to maintain our own gobblin-wmf gobblin module in the wmf branch. The gobblin-wmf module mostly contains code for interact with Event Platform based events in Kafka. The master branch should track upstream.

Releasing new Gobblin versions

We upload our gobblin-wmf artifacts directly to Archiva, and then add them as git-fat jar files in Analytics/Systems/Cluster/Deploy/Refinery, and deploy them like we do other jar artifacts with analytics/refinery.

We do not (as of 2021-07) have an automated release process for Gobblin. You must manually upload the packaged artifact .jars to archiva, and manually download and git add them to analytics/refinery.