You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Analytics/Systems/Cluster/Gobblin"

From Wikitech-static
Jump to navigation Jump to search
imported>Ottomata
(Created page with "[https://gobblin.apache.org/ Apache Gobblin] is Hadoop ingestion software used at WMF primarily to import data from Kafka into HDFS. == Gobblin jobs == Gobblin jobs are [https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/gobblin.pp declared in puppet]. == WMF's Gobblin fork == The Data Engineering team maintains a [https://gerrit.wikimedia.org/g/analytics/gobblin fork of Gobblin]. We use this fork to maintain our own [...")
 
imported>Neil P. Quinn-WMF
(Add some historical information)
 
Line 1: Line 1:
[https://gobblin.apache.org/ Apache Gobblin] is Hadoop ingestion software used at WMF primarily to import data from Kafka into HDFS.
[https://gobblin.apache.org/ Apache Gobblin] is Hadoop ingestion software used at WMF primarily to import data from Kafka into HDFS.
Until 2021, we [[Obsolete:Camus in the Analytics Cluster|used Camus]] for this purpose. [[phab:T238400|T238400]] has some information on how Gobblin was chosen as its replacement.


== Gobblin jobs ==
== Gobblin jobs ==

Latest revision as of 22:21, 3 August 2021

Apache Gobblin is Hadoop ingestion software used at WMF primarily to import data from Kafka into HDFS.

Until 2021, we used Camus for this purpose. T238400 has some information on how Gobblin was chosen as its replacement.

Gobblin jobs

Gobblin jobs are declared in puppet.

WMF's Gobblin fork

The Data Engineering team maintains a fork of Gobblin. We use this fork to maintain our own gobblin-wmf gobblin module in the wmf branch. The gobblin-wmf module mostly contains code for interact with Event Platform based events in Kafka. The master branch should track upstream.

Releasing new Gobblin versions

We upload our gobblin-wmf artifacts directly to Archiva, and then add them as git-fat jar files in Analytics/Systems/Cluster/Deploy/Refinery, and deploy them like we do other jar artifacts with analytics/refinery.

We do not (as of 2021-07) have an automated release process for Gobblin. You must manually upload the packaged artifact .jars to archiva, and manually download and git add them to analytics/refinery.