You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Talk:Incidents/2019-04-02 0401KafkaJumbo

From Wikitech-static
Revision as of 17:49, 8 April 2022 by imported>Krinkle (Krinkle moved page Talk:Incident documentation/2019-04-02 0401KafkaJumbo to Talk:Incidents/2019-04-02 0401KafkaJumbo)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

[Luca] A couple of notes after the first pass of reading:

  • I don't think that Kafka stopped to serve connections, since we'd have had a way bigger impact in my opinion. Some brokers were still up and running (while the others were OOMing), but of course they were not able to sustain all the traffic.
  • We need to define a clear SLO (service level objectives) with the SRE team about the Kafka Jumbo cluster. In this case, the incident report says that we were lucky to find somebody in PST working on it from the SRE team, and it was clearly an emergency since Analytics traffic was dropped. We (as Analytics) should have a clear definition of what level of service the Jumbo cluster should get, and have support from SRE accordingly. It is true that the Analytics team can count on two SREs in US/EU timezones, but as this incident report shows it can happen that two is not enough :)
  • As a follow up on the item above, should any page be fired to SRE/Analytics if an event like this re-happens?
  • Should we need to raise a bit the heap size of the Kafka brokers (currently 2G) to account for events like these? It would remove a couple of Gigabytes from the page cache of course..