You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incidents/2022-05-24 Failed Apache restart

From Wikitech-static
< Incidents
Revision as of 13:10, 24 May 2022 by imported>Marostegui (→‎Scorecard)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2022-05-24 Failed Apache restart Start 11:40
Task End 12:16
People paged 27 Responder count 5
Coordinators Manuel Arostegui Affected metrics/SLOs
Impact Apache service down

Timeline

All times in UTC.

  • 11:40 https://gerrit.wikimedia.org/r/c/operations/puppet/+/798615/ gets merged
  • 11:45 INCIDENT STARTS ​​<+jinxer-wm> (ProbeDown) firing: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
  • 11:45 John realises the patch broke apaches across mw hosts
  • 11:45 puppet gets disabled on mw hosts, preventing a wider outage
  • 11:46 The patch gets reverted and merged: https://gerrit.wikimedia.org/r/c/operations/puppet/+/797222
  • 11:47 A manual puppet run is forced on eqiad
  • 11:47 Multiple alerts arrive to IRC
  • 11:49 the revert doesn’t fix things and a manual cumin run is needed
  • 11:50  Incident opened.  <Manuel Arostegui> becomes IC.
  • [11:50:46] <_joe_> jbond: it seems it tries to listen on port 443
  • [11:51:16] <taavi> jbond: your patch makes apache2 listen on 443 regardless whether mod_ssl is enabled
  • 11:56: The following command is issued across the fleet:  sudo cumin -m async 'mw1396*' 'sed -i"" "s/Listen 443//" /etc/apache2/ports.conf '  'systemctl start apache2 '
  • 11:57: First recoveries arrive
  • 11:58 MW app servers fixed here
  • 12:05 The following patch is merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/798631/ to get the proper fix in place for other non-mw servers relying on existing default debian configuration
  • 12:07 <_joe_> I am running puppet on people1003 (to test patch) - <_joe_> the change does the right thing there
  • 12:16 Other non-mw apaches fixed here:  < jbond> sudo cumin C:httpd 'systemctl status apache2.service &>/dev/null' is all good

Detection

Conclusions

What went well?

  • The cause was quickly identified even before the alerts appeared
  • cumin let us ran a command quickly across the fleet

What went poorly?

  • A careful review of the patch should've been done

Where did we get lucky?

  • Quickly identified the issue and the solution was clear.

How many people were involved in the remediation?

  • 3 SREs
  • 1 Volunteer
  • 1 IC

Links to relevant documentation

Actionables

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

  • To do #1 (TODO: Create task)
  • To do #2 (TODO: Create task)

TODO: Add the #Sustainability (Incident Followup) and the #SRE-OnFIRE (Pending Review & Scorecard) Phabricator tag to these tasks.

Scorecard

Incident Engagement™ ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? Needs checking - probably not
Were the people who responded prepared enough to respond effectively Yes
Were fewer than five people paged? No
Were pages routed to the correct sub-team(s)? N/A
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. Yes
Process Was the incident status section actively updated during the incident? Yes
Was the public status page updated? No
Is there a phabricator task for the incident? No
Are the documented action items assigned? No
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

Were the people responding able to communicate effectively during the incident with the existing tooling? Yes
Did existing monitoring notify the initial responders? Yes
Were all engineering tools required available and in service?
Was there a runbook for all known issues present?
Total score (count of all “yes” answers above)