You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/2021-02-26 sudo

From Wikitech-static
< Incident documentation
Revision as of 19:54, 5 March 2021 by imported>Krinkle (Krinkle moved page Incident documentation/2021-03-03 sudo to Incident documentation/2021-02-26 sudo: Happened Feb 26, nor Mar 3)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: in-review

Summary

A patch was merged and deployed to all hosts containing a syntax error on the /etc/sudoers file. This meant sudo did not work for the period of time indicated below, affecting mostly nagios execution (alerting) and creating root mail spam. As a consequence, also mail delivery got overloaded/delayed.

Timeline

  • 08:50 666899 is merged, containing a syntax error in /etc/sudoers
  • 08:51 People warn on IRC unable to sudo on db1107 due to a parse error (>>> /etc/sudoers: syntax error near line 6 <<<), and other hosts
  • 08:52 100s of emails start to arrive to root@ with *** SECURITY information for <hostname>*** (sudo failures)
  • 08:55 <jbond42> !log disabled puppet pending rollback of https://gerrit.wikimedia.org/r/666899
  • 08:59 klausman merges 667110, containing a fix, and runs puppet-merge soon after.
  • 09:00 Incident opened. jynus becomes IC.
  • 09:06 Puppet reenabled
  • 09:12 Start reenabling puppet fleetwide
  • 09:23 Puppet run at 10%
  • 09:37 Puppet run at 30%
  • 09:50 Puppet run at 50%
  • 10:17 Puppet run at 80%
  • ~10:20ish UNKNOWN nagios alerts gone
  • 10:33 puppet run finished
  • 11:40: jbond took over ic
  • 11:45: mx2001 queue has remain static at 4344 for 20 minutes
  • 11:45: mx1001 queue reducing at between 0-3 msgs/sec
  • 12:55: run `exiqgrep -i -o 7200 -y 10800 -f 'root@wikimedia.org' | xargs exim -Mrm` on mx servers
  • 12:55: queue down to ~ 2000 (891 frozen) msgs on mx1001 and 500 (434 frozen) on mx2001
  • 13:02: Still receiving 450-4.2.1 from gmail for a number of recipients
  • 13:30: reports of flood emails slowing down
  • 13:43: message queue excluding frozen messages on mx2001 is 0 (mx1001 ~ 800)
  • 14:00 ran the following to push through the last few messages: `for i in $(sudo exiqgrep -f nagios@lists1001.wikimedia.org -i) ; do sudo exim -M ${i} ; sleep 1 ; done `
  • 14:05: unfrozen queue is still at 784 however the queue looks normal
  • 14:05: Incident resolved

Cleanup GMail

You can use this filter to find them all:

 from:(nagios@) SECURITY after:2021-02-25 

Look also in your spam folder.

Actionables