You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/2019-02-20 irc-outage

From Wikitech-static
< Incident documentation
Revision as of 19:34, 31 March 2021 by imported>Krinkle (Krinkle moved page Incident documentation/20190220-irc-outage to Incident documentation/2019-02-20 irc-outage)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Summary

The IRC-based changes feed was not pushing out changes for 67 minutes following a restart of systemd-journald which was done as part of a security update.

Timeline

  • 11:01 - Moritz restarts systemd-journald on kraz.wikimedia.org as part of the rollout of a security update for systemd
  • 12:00 - User sDrewth reports on the #wikimedia-operations channel "is it known that the RC IRC feeds have stopped?"
  • 12:02 - Brian Wolff creates https://phabricator.wikimedia.org/T216607
  • 12:08 - Moritz restarts the ircecho service on kraz and changes are propagated again

Conclusions

  • The Icinga status of kraz was checked following the restart of journald, but our monitoring didn't alert an error, despite being broken the udpmxircecho service continued to run and our Icinga check only validates the presence of the process, not whether it's functional

Actionables

  • phab:T216607 - Restarting systemd-journald breaks ircecho service
  • phab:T216611 - Icinga check for ircecho should check for actual activity
  • phab:T185319 - IRC RecentChanges feed: code stewardship request (not implemented yet)