You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incidents/20160128-MediaWiki-API: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
 
imported>Krinkle
No edit summary
 
Line 8: Line 8:
* 13:53 Faidon pushes a patch to remove kafka1012 from mediawiki-config and syncs it to the appserver fleet
* 13:53 Faidon pushes a patch to remove kafka1012 from mediawiki-config and syncs it to the appserver fleet
* 13:58 File is synced across the fleet, traffic recovers
* 13:58 File is synced across the fleet, traffic recovers
== Conclusions ==


== Actionables ==
== Actionables ==


* {{Status}} MediaWiki monolog doesn't handle Kafka failures gracefully ({{Bug|T125084}})
* MediaWiki monolog doesn't handle Kafka failures gracefully ({{Bug|T125084}})


[[Category:Incident documentation]]
[[Category:Incident documentation]]

Latest revision as of 17:24, 10 August 2022

Summary

A Kafka broker was rebooted as part of a standard upgrade. The MediaWiki API cluster failed as a result, due to an overwhelming number of piled up connection attempts to that particular Kafka broker. The broker was ultimately depooled from mediawiki-config as part of the incident response and the service was ultimately restored, 25 minutes after it first started failing.

Timeline

  • 12:53 Luca follows the safe broker restart process and stops Kafka on kafka1012 for a host reboot for a kernel upgrade
  • 13:36 multiple mw* hosts report HHVM Rendering "Socket timeout after 10 seconds" failures, multiple opsens respond
  • 13:49 Giuseppe notices a lot of HHVM threads stuck at attempting to connect, then a large number of connections to kafka1012 in a SYN_SENT state.
  • 13:53 Faidon pushes a patch to remove kafka1012 from mediawiki-config and syncs it to the appserver fleet
  • 13:58 File is synced across the fleet, traffic recovers

Actionables

  • MediaWiki monolog doesn't handle Kafka failures gracefully (bug T125084)