You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incidents/2023-03-09 mailman
document status: draft
Summary
Incident ID | 2023-03-09 mailman | Start | 2023-03-07 14:35:31 |
---|---|---|---|
Task | T331626 | End | 2023-03-09 16:02:59 |
People paged | 0 | Responder count | 6? |
Coordinators | n/a | Affected metrics/SLOs | No relevant SLOs exist |
Impact | For 2 days, Mailman did not deliver any mail, affecting public and private community, affiliate, and WMF mailing lists |
During network maintenance, the Mailman runner process for delivering emails out of the queue crashed because it couldn't connect to the MariaDB database server and was not automatically restarted. As a result, Mailman continued to accept and process incoming email, but outgoing mail was queued. This was first reported in T331626, via Gerrit Reviewer Bot being broken. It was determined that the mediawiki-commits list was not delivering mail, leading to discovery of a growing backlog of 4k queued outgoing emails in Mailman. The mailman3
systemd service was restarted, causing all of the individual runner processes to be restarted, including the "out" runner, which began delivering the backlog. It took slightly over 5 hours for the backlog to be cleared.
The network maintenance in question was T329073: eqiad row A switches upgrade. lists1001.wikimedia.org
(the Mailman server) was not listed on the task but it was affected and downtimed (see T329073#8672655). icinga monitoring correctly detected the issue (see T331626#8680354), but was not noticed by humans.
Timeline
All times in UTC.
- 2023-03-07
- 14:09: lists1001 and 237 other hosts downtimed for switches upgrade (T329073#8672655)
- 14:20ish: network maintenance happens
- 14:35: "out" runner crashes OUTAGE BEGINS (see stack trace)
- 14:43: <+icinga-wm> PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
- 2023-03-09
- 13:29 kostajh files T331626: reviewer-bot is not working
- 14:24 valhallasw says no email is coming in via the mediawiki-commits@ mailing list
- 15:15 JJMC89 files T331633: Not receiving posts or moderation messages, "The last message I received was at 7 Mar 2023 11:18:24 +0000, but I can see posts after that in list archives."
- 15:53 hashar posts about the large out queue backlog in Mailman: T331626#8680273
- 16:03 marostegui restarts
mailman3
systemd service, the out runner begins processing the queue - 18:40 legoktm sends notification to listadmins@ mailing list
- 23:34 out queue reaches zero OUTAGE ENDS
Detection
Automated monitoring was the first to detect the issue, less than 10 minutes after the runner crashed:
14:43: <+icinga-wm> PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
The correct alert fired, as it explicitly checks that the expected number of runner processes are actually running.
However, it was not investigated until a human reported it, nearly 2 days later.
Conclusions
What went well?
- Automated monitoring correctly identified the issue pretty quickly
What went poorly?
- lists1001 was not marked as part of the row A switch upgrade so the service maintainers (term used loosely) weren't explicitly aware
- We typically send potential downtime notifications (example) to listadmins when we know maintenance is expected.
- No human noticed the alerting
- The web service on lists1001 is often flaky, causing alerts that automatically recover so people may have begun to tune out the alerts.
- Also there were a lot of alerts at that time because of multiple ongoing maintenance things, so it was also probably lost in the noise.
- We don't actually monitor the size of the out queue (see script) because historically it did grow very large. We probably should, things are also much calmer these days.
- Amir was on vacation
Where did we get lucky?
- Didn't get lucky
Links to relevant documentation
Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.
Actionables
- Identify how lists1001 got missed during the eqiad row A switch upgrade preparation
- Add monitoring to out queue size
- Consider making the runner crashed monitoring page? If a runner crashes, it definitely needs manual intervention. And crashes are much much rarer than during the initial MM3 deployment.
- Someone should probably figure out why lists1001's web service is flaky and randomly going down
Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.
Add the #Sustainability (Incident Followup) and the #SRE-OnFIRE (Pending Review & Scorecard) Phabricator tag to these tasks.
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | ||
Were the people who responded prepared enough to respond effectively | yes | ||
Were fewer than five people paged? | no | ||
Were pages routed to the correct sub-team(s)? | no | ||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | no | ||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | ||
Was a public wikimediastatus.net entry created? | no | ||
Is there a phabricator task for the incident? | yes | task T331626 | |
Are the documented action items assigned? | |||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | yes | ||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented. |
yes | |
Were the people responding able to communicate effectively during the incident with the existing tooling? | yes | ||
Did existing monitoring notify the initial responders? | no | IRC alert sent but not noticed | |
Were the engineering tools that were to be used during the incident, available and in service? | yes | ||
Were the steps taken to mitigate guided by an existing runbook? | no | ||
Total score (count of all “yes” answers above) |