You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Kafka: Difference between revisions
(Change 404'd apt link in intro to current apt URL. Not future-proof as it's pointing to bullseye specifically but it's better than nothing)
|Line 1:||Line 1:|
[https://kafka.apache.org/ Apache Kafka] is a scalable and durable distributed logging buffer. We currently run 7 Kafka clusters using [https://www.confluent.io/ Confluent's] [https://apt.wikimedia.org/wikimedia/
[https://kafka.apache.org/ Apache Kafka] is a scalable and durable distributed logging buffer. We currently run 7 Kafka clusters using [https://www.confluent.io/ Confluent's] [https://apt.wikimedia.org/wikimedia//thirdparty/confluent/ Kafka distribution].
== Administration ==
== Administration ==
Revision as of 21:36, 17 June 2022
See the Kafka Administration page for administration tips and documentation.
We use Kafka MirrorMaker to handle multi datacenter replication of Kafka topics. There are various replication topologies to choose from. As of 2021-04, only our Kafka main cluster is truly multi-DC. Here's how it works:
Producers always write to their local Kafka cluster. Producers must prefix topic names with the datacenter they are in. E.g. a producer in eqiad that is producing a stream named 'application.state-change' should produce to a Kafka topic named 'eqiad.application.state-change'.
Kafka MirrorMaker is in each datacenter is configured to pull all prefixed topics from the alternate datacenter. That is, Kafka MirrorMaker running in codfw pulls all topics in Kafka main-eqiad that are prefixed with 'eqiad.'. Conversely, Kafka MirrorMaker running in eqiad pulls all topics in Kafka main-codfw that are prefixed with 'codfw.' Non DC-prefixed topics are not replicated.
Our six clusters are,
The Kafka jumbo cluster is the replacement for the Kafka analytics cluster. The bulk of the data on this cluster is from webrequests. This cluster is also used directly for Analytics EventLogging, Discovery Analytics, statsv and EventStreams. Much of the data here is imported into Hadoop using Gobblin. Data from main clusters is mirrored to this cluster for Analytics purposes.
This cluster is intended to be used for high volume Analytics, as well as other non production critical services. If you are building a production level (e.g. paging on a holiday) service that uses Kafka, you should use the main Kafka clusters. All (almost) topics from the main clusters are mirrored here via MirrorMaker instances running on each of the broker nodes. These MirrorMaker instances are called main-eqiad_to_jumbo-eqiad, and as such they mirror topics from the main-eqiad cluster.
main (eqiad and codfw)
The 'main' Kafka clusters in eqiad and codfw are mirrors of each other. The 'main' clusters should be used for low volume critical production services. Main is currently used directly by Event_Platform/EventGate and change-propagation.
MirrorMaker instance run on each broker node, and consume from the remote data center. On main-eqiad, these instances are called main-codfw_to_main-eqiad, and on main-codfw, they are callsed main-eqiad_to_main-codfw.
Only topics prefixed by the appropriate datacenter names are mirrored between the two main clusters. I.e. only eqiad.* topics are mirrored from main-eqiad -> main-codfw, and vice versa.
logging (eqiad and codfw)
The 'logging' Kafka clusters in eqiad and codfw are a component of the ELK logging pipeline.
Hosts produce to this Kafka cluster by way of rsyslog omkafka, and logstash is the consumer. This provides a buffering layer to smooth load on the logstash collectors, and prevent lost log messages in the event that logstash crashes or is unable to cope with the load.
Multi-site for this cluster is designed so that logstash collectors in eqiad and codfw may consume from both kafka clusters, and agents may produce to only the nearest cluster. Currently the Kafka brokers for this cluster run on the same hardware as the logstash Elasticsearch instances.
Used to test kafka changes, such as rebalancing partitions and upgrading the Kafka version.
kafaktee, available on centrallog1001, is a replacement for udp2log that consumes from Kafka instead of from the udp2log firehose. It consumes, samples, and filters the webrequest to files for easy grepping and troubleshooting. Fundraising also runs an instance of Kafkatee that feeds webrequest logs into banner analysis logic.
- Kafka Dashboard
- Kafka By Topic Dashboard
- Kafka MirrorMaker Dashboard
- VarnishKafka Dashboard
- Logging Dashboard
Kafka is puppetized in order to be able to spawn up arbitrary clusters in labs. Here's how.
You'll need a running Zookeeper and Kafka broker. These instructions show how to set up a single node Zookeeper and Kafka on the same host.
Create a new Jessie labs instance. In this example we have named our new instance 'kafka1' and it is in the 'analytics' project. Thus the hostname is kafka1.analytics.eqiad1.wikimedia.cloud. Wait for the instance to spawn and finish its first puppet run. Make sure you can log in.
Edit hiera data for your project and set the following:
zookeeper_cluster_name: my-zookeeper-cluster zookeeper_clusters: my-zookeeper-cluster: hosts: kafka1.analytics.eqiad1.wikimedia.cloud: "1" kafka_clusters: my-kafka-cluster: zookeeper_cluster_name: my-zookeeper-cluster brokers: kafka1.analytics.eqiad1.wikimedia.cloud: id: "1"
Go to the configure instance page for your new instance, and check the following boxes to include needed classes:
Run puppet on your new instance. Fingers crossed and you should have a new Kafka broker running.
To verify, log into your instance and run
kafka topics --create --topic test --partitions 2 --replication-factor 1 kafka topics --describe
If this succeeds, you will have created a topic in your new single node Kafka cluster.
Kafka clients usually take a list of brokers and/or a zookeeper connect string in order to work with Kafka. In this example, those would be:
- broker list: kafka1.analytics.eqiad1.wikimedia.cloud:9092
- zookeeper connect: kafka1.analytics.eqiad1.wikimedia.cloud:2181/kafka/my-kafka-cluster
Note that the zookeeper connect URL contains a path that has the value of kafka_cluster_name in it. You should substitute this for whatever you named your cluster in your hiera config.
How do I ...
Easiest CLI to interact with Kafka is kafkacat. This can be executed in several modes (e.g. consumer, producer, etc.) See kafkacat --help for more info.
List available Kafka topics
stat1007$ kafkacat -L -b kafka-jumbo1001.eqiad.wmnet:9092 | grep topic
stat1007$ kafkacat -C -b kafka-jumbo1001.eqiad.wmnet:9092 -t test
stat1007$ cat test_message.txt Hola Mundo stat1007$ cat test_message.txt | kafkacat -P -b kafka-jumbo1001.eqiad.wmnet:9092 -t test
Consume with a Python Kafka client
Consume using Spark Streaming
Consume avro schema from kafka
NOTE: WMF no longer actively uses Avro in Kafka. (The mediawiki_CirrusSearchRequestSet does not have any data.)
from kafka import KafkaConsumer import avro.schema import avro.io import io # To consume messages consumer = KafkaConsumer('mediawiki_CirrusSearchRequestSet', group_id='my_group', metadata_broker_list=['kafka-jumbo1001:9092']) schema_path="/home/madhuvishy/avro-kafka/CirrusSearchRequestSet.avsc" schema = avro.schema.parse(open(schema_path).read()) for msg in consumer: bytes_reader = io.BytesIO(msg.value) decoder = avro.io.BinaryDecoder(bytes_reader) reader = avro.io.DatumReader(schema) data = reader.read(decoder) print data