You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incidents/2019-02-20 irc-outage
< Incidents
Jump to navigation
Jump to search
Revision as of 17:47, 8 April 2022 by imported>Krinkle (Krinkle moved page Incident documentation/2019-02-20 irc-outage to Incidents/2019-02-20 irc-outage)
Summary
The IRC-based changes feed was not pushing out changes for 67 minutes following a restart of systemd-journald which was done as part of a security update.
Timeline
- 11:01 - Moritz restarts systemd-journald on kraz.wikimedia.org as part of the rollout of a security update for systemd
- 12:00 - User sDrewth reports on the #wikimedia-operations channel "is it known that the RC IRC feeds have stopped?"
- 12:02 - Brian Wolff creates https://phabricator.wikimedia.org/T216607
- 12:08 - Moritz restarts the ircecho service on kraz and changes are propagated again
Conclusions
- The Icinga status of kraz was checked following the restart of journald, but our monitoring didn't alert an error, despite being broken the udpmxircecho service continued to run and our Icinga check only validates the presence of the process, not whether it's functional
Actionables
- phab:T216607 - Restarting systemd-journald breaks ircecho service
- phab:T216611 - Icinga check for ircecho should check for actual activity
- phab:T185319 - IRC RecentChanges feed: code stewardship request (not implemented yet)