You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incidents/2022-05-24 Failed Apache restart: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
No edit summary
imported>Alexandros Kosiaris
No edit summary
 
Line 1: Line 1:
{{irdoc|status=draft}}
{{irdoc|status=final}}


==Summary==
==Summary==
Line 9: Line 9:
| start = 11:40
| start = 11:40
| end = 12:16
| end = 12:16
| impact = For 35 minutes, numerous internal services that use Apache on the backend were down. This included Kibana (logstash) and Matomo (piwik). For 20 of those minutes, there was also reduced MediaWiki server capacity, but no measurable end-user impact for wiki traffic.
| impact = A very small amount of 502 HTTP errors for users (predominantly logged-in users). Plus some 140 IRC alerts for a subset of hosts running Apache
}}
}}
* A Puppet change seem to have caused an apache restart https://gerrit.wikimedia.org/r/c/operations/puppet/+/798615/ that didn’t work as it made it listen 443 regardless whether mod_ssl is enabled
** Alerts: connect to address 10.X.X.X and port 80: Connection refused | CRITICAL - degraded: The following units failed: apache2.service
** Puppet was quickly disabled to prevent a site-wide outage (apache failing everywhere affected)
** Initial triage was done via cumin: ''sudo cumin -m async 'mw1396*' 'sed -i"" "s/Listen 443//" /etc/apache2/ports.conf '  'systemctl start apache2 '''  as the patch revert didn’t work.
** The final fix: https://gerrit.wikimedia.org/r/c/operations/puppet/+/798631/
* {{TOC|align=right}}


Puppet [[gerrit:c/operations/puppet/+/798615/|change 798615/]] refactored the generic "apache" class. This unexpectedly led to Apache being restarted everywhere during the next Puppet run across the fleet, for backend hosts of any service using Apache. The change was meant to be a no-op, but in fact caused Apache to unconditionally listen on port 443 (HTTPS), regardless whether mod_ssl is installed, and regardless of whether another service is already listening on that port (e.g. Envoy as TLS proxy on [[Application servers|appservers]]).
==Timeline==
 
The change was understood by [[Puppet]] as requiring a restart of the service (not a config reload), and because Apache could not actually listen on port 443, it failed to start up again. This led to alerts like:
connect to address 10.X.X.X and port 80: Connection refused | CRITICAL - degraded: The following units failed: apache2.service
Since Puppet runs rollingly across the fleet on a timer, it only affected a few hosts at first. The few affected MW hosts were automatically depooled by [[PyBal|Pybal]] based on health monitoring.
 
To avoid a site-wide outage, we immediately globally disabled Puppet, so as to prevent Apache going down everywhere. After disabling Puppet, the original Puppet change was reverted and Puppet forced-run on all Eqiad hosts. However, this didn't work ([[gerrit:c/operations/puppet/+/797222|change 797222]]).
 
To recover in the meanwhile, the folllowing was run via Cumin:
sudo cumin -m async 'mw1396*' 'sed -i"" "s/Listen 443//" /etc/apache2/ports.conf '  'systemctl start apache2 '
The final fix: https://gerrit.wikimedia.org/r/c/operations/puppet/+/798631/ {{TOC|align=right}}


==Timeline==


'''All times in UTC.'''
'''All times in UTC.'''


* 11:40 https://gerrit.wikimedia.org/r/c/operations/puppet/+/798615/ gets merged
* 11:40 <nowiki>https://gerrit.wikimedia.org/r/c/operations/puppet/+/798615/</nowiki> gets merged


* 11:45 INCIDENT STARTS ​​<+jinxer-wm> (ProbeDown) firing: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - [[Network monitoring#ProbeDown|https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown]] - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
* 11:45 INCIDENT STARTS ​​<+jinxer-wm> (ProbeDown) firing: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - [[Network monitoring#ProbeDown|https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown]] - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
Line 34: Line 29:
* 11:45 John realises the patch broke apaches across mw hosts
* 11:45 John realises the patch broke apaches across mw hosts
* 11:45 puppet gets disabled on mw hosts, preventing a wider outage
* 11:45 puppet gets disabled on mw hosts, preventing a wider outage
* 11:46 The patch gets reverted and merged: https://gerrit.wikimedia.org/r/c/operations/puppet/+/797222
* 11:46 The patch gets reverted and merged: <nowiki>https://gerrit.wikimedia.org/r/c/operations/puppet/+/797222</nowiki>
* 11:47 A manual puppet run is forced on eqiad
* 11:47 A manual puppet run is forced on eqiad
* 11:47 Multiple alerts arrive to IRC
* 11:47 Multiple alerts arrive to IRC
Line 41: Line 36:
* [11:50:46] <_joe_> jbond: it seems it tries to listen on port 443
* [11:50:46] <_joe_> jbond: it seems it tries to listen on port 443
* [11:51:16] <taavi> jbond: your patch makes apache2 listen on 443 regardless whether mod_ssl is enabled
* [11:51:16] <taavi> jbond: your patch makes apache2 listen on 443 regardless whether mod_ssl is enabled
* 11:56: The following command is issued across the fleet:  <code>sudo cumin -m async 'mw1396*' 'sed -i"" "s/Listen 443//" /etc/apache2/ports.conf '  'systemctl start apache2 '</code>
* 11:56: The following command is issued across the fleet:  sudo cumin -m async 'mw1396*' 'sed -i"" "s/Listen 443//" /etc/apache2/ports.conf '  'systemctl start apache2 '  
* 11:57: First recoveries arrive
* 11:57: First recoveries arrive
* 11:58 MW app servers fixed here
* 11:58 MW app servers fixed here
* 12:05 The following patch is merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/798631/ to get the proper fix in place for other non-mw servers relying on existing default debian configuration
* 12:05 The following patch is merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/798631/ to get the proper fix in place for other non-mw servers relying on existing default debian configuration
* 12:07 <_joe_> I am running puppet on people1003 (to test patch) - <_joe_> the change does the right thing there
* 12:07 <_joe_> I am running puppet on people1003 (to test patch) - <_joe_> the change does the right thing there
* 12:16 Other non-mw apaches fixed here:  < jbond> <code>sudo cumin C:httpd 'systemctl status apache2.service &>/dev/null'</code> is all good
* 12:16 Other non-mw apaches fixed here:  < jbond> sudo cumin C:httpd 'systemctl status apache2.service &>/dev/null' is all good


==Detection==
==Detection==
Line 55: Line 50:


==Conclusions==
==Conclusions==
During the Incident Review ritual, it was pointed out that if we had a way to deploy those changes in a controlled environment (e.g. canary) we could have been saved from this one. It was also noted that PCC did not catch this one as puppet complied this fine, it's just the resulting Apache configuration was unconditionally also listening on port 443.
===What went well?===
===What went well?===
*The cause was quickly identified even before the alerts appeared
*The cause was quickly identified even before the alerts appeared
Line 61: Line 58:
===What went poorly?===
===What went poorly?===
*A careful review of the patch should've been done
*A careful review of the patch should've been done
*We don't have an environment to do a deployment of puppet changes in a controlled environment.


===Where did we get lucky?===
===Where did we get lucky?===
Line 76: Line 74:


==Actionables==
==Actionables==
<mark>Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.</mark>
*<mark>To do #1 (TODO: Create task)</mark>
*<mark>To do #2 (TODO: Create task)</mark>
<mark>TODO: Add the [[phab:project/view/4758/|#Sustainability (Incident Followup)]] and the [[phab:project/profile/4626/|#SRE-OnFIRE (Pending Review & Scorecard)]]  Phabricator tag to these tasks.</mark>


==Scorecard==
==Scorecard==
Line 133: Line 125:
|-
|-
|Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?
|Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?
|
|No
|
|
|-
|-
Line 139: Line 131:
|To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? ''Answer “no” if there are''
|To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? ''Answer “no” if there are''
''open tasks that would prevent this incident or make mitigation easier if implemented.''
''open tasks that would prevent this incident or make mitigation easier if implemented.''
|
|No
|
|
|-
|-
Line 155: Line 147:
|-
|-
|Was there a runbook for all known issues present?
|Was there a runbook for all known issues present?
|
|No
|
|
|-
|-
! colspan="2" align="right" |Total score (c'''ount of all “yes” answers above)'''
! colspan="2" align="right" |Total score (c'''ount of all “yes” answers above)'''
|
|5
|
|
|}
|}

Latest revision as of 08:35, 17 June 2022

document status: final

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2022-05-24 Failed Apache restart Start 11:40
Task End 12:16
People paged 27 Responder count 5
Coordinators Manuel Arostegui Affected metrics/SLOs
Impact A very small amount of 502 HTTP errors for users (predominantly logged-in users). Plus some 140 IRC alerts for a subset of hosts running Apache

Timeline

All times in UTC.

  • 11:40 https://gerrit.wikimedia.org/r/c/operations/puppet/+/798615/ gets merged
  • 11:45 John realises the patch broke apaches across mw hosts
  • 11:45 puppet gets disabled on mw hosts, preventing a wider outage
  • 11:46 The patch gets reverted and merged: https://gerrit.wikimedia.org/r/c/operations/puppet/+/797222
  • 11:47 A manual puppet run is forced on eqiad
  • 11:47 Multiple alerts arrive to IRC
  • 11:49 the revert doesn’t fix things and a manual cumin run is needed
  • 11:50  Incident opened.  <Manuel Arostegui> becomes IC.
  • [11:50:46] <_joe_> jbond: it seems it tries to listen on port 443
  • [11:51:16] <taavi> jbond: your patch makes apache2 listen on 443 regardless whether mod_ssl is enabled
  • 11:56: The following command is issued across the fleet:  sudo cumin -m async 'mw1396*' 'sed -i"" "s/Listen 443//" /etc/apache2/ports.conf '  'systemctl start apache2 '
  • 11:57: First recoveries arrive
  • 11:58 MW app servers fixed here
  • 12:05 The following patch is merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/798631/ to get the proper fix in place for other non-mw servers relying on existing default debian configuration
  • 12:07 <_joe_> I am running puppet on people1003 (to test patch) - <_joe_> the change does the right thing there
  • 12:16 Other non-mw apaches fixed here:  < jbond> sudo cumin C:httpd 'systemctl status apache2.service &>/dev/null' is all good

Detection

Conclusions

During the Incident Review ritual, it was pointed out that if we had a way to deploy those changes in a controlled environment (e.g. canary) we could have been saved from this one. It was also noted that PCC did not catch this one as puppet complied this fine, it's just the resulting Apache configuration was unconditionally also listening on port 443.

What went well?

  • The cause was quickly identified even before the alerts appeared
  • cumin let us ran a command quickly across the fleet

What went poorly?

  • A careful review of the patch should've been done
  • We don't have an environment to do a deployment of puppet changes in a controlled environment.

Where did we get lucky?

  • Quickly identified the issue and the solution was clear.

How many people were involved in the remediation?

  • 3 SREs
  • 1 Volunteer
  • 1 IC

Links to relevant documentation

Actionables

Scorecard

Incident Engagement™ ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? No At least 4 out of 5 are usual responders (Marostegui, jbond, _joe_, jynus)
Were the people who responded prepared enough to respond effectively Yes
Were fewer than five people paged? No
Were pages routed to the correct sub-team(s)? No N/A
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. Yes
Process Was the incident status section actively updated during the incident? Yes
Was the public status page updated? No
Is there a phabricator task for the incident? No
Are the documented action items assigned? No
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? No
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

No
Were the people responding able to communicate effectively during the incident with the existing tooling? Yes
Did existing monitoring notify the initial responders? Yes
Were all engineering tools required available and in service? No At least Kibana and piwik/matomo where down
Was there a runbook for all known issues present? No
Total score (count of all “yes” answers above) 5