You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incidents/2022-05-09 confctl: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>LSobanski
 
imported>LSobanski
No edit summary
Line 1: Line 1:
{{irdoc|status=draft}}
{{irdoc|status=review}}


==Summary==
==Summary==
Line 12: Line 12:
<!-- Reminder: No private information on this page! -->
<!-- Reminder: No private information on this page! -->


<mark>Summary of what happened, in one or two paragraphs. Avoid assuming deep knowledge of the systems here, and try to differentiate between proximate causes and root causes.</mark>
For approximately 6 minutes, confctl-managed services were set as inactive for most of codfw. This caused all end-user traffic that was at the time being routed to codfw (Central US, South America - at a low traffic moment) to respond with errors. While appservers in codfw were at the moment "passive" (not receiving end-user traffic), other services that are active were affected (edge cache, swift, elastic, wdqs…). The most visible effect, during the duration of the incident, was approximately 1.4k HTTP requests per second to not be served to text edges and 800 HTTP requests per second to fail to be served from upload edges. The trigger for the issue was a gap in tooling that allowed running a command with invalid input.
 
...


{{TOC|align=right}}
{{TOC|align=right}}


==Timeline==
==Timeline==
<mark>Write a step by step outline of what happened to cause the incident, and how it was remedied.  Include the lead-up to the incident, and any epilogue.</mark>
<mark>Consider including a graphs of the error rate or other surrogate.</mark>
<mark>Link to a specific offset in SAL using the SAL tool at https://sal.toolforge.org/ ([https://sal.toolforge.org/production?q=synchronized&d=2012-01-01 example])</mark>
'''All times in UTC.'''
'''All times in UTC.'''


*00:00 (TODO) '''OUTAGE BEGINS'''
*07:44 confctl command with invalid parameters is executed '''OUTAGE BEGINS'''
*00:04 (Something something)
*07:44 Engineer executing the change realizes the change is running against more servers than expected and cancels the execution mid-way
*00:06 (Voila) '''OUTAGE ENDS'''
*07:46 Monitoring system detects the app servers unavailability, 15 pages are sent
*00:15 (post-outage cleanup finished)
*07:46 Engineer executing the change notifies others via IRC
*07:50 confctl command to repool all codfw servers is executed '''OUTAGE ENDS'''  
[[File:2022-05-09 confctl error graph.png]]


<!-- Reminder: No private information on this page! -->
[[File:2022-05-09 confctl 5xx.png]]<!-- Reminder: No private information on this page! -->
<mark>TODO: Clearly indicate when the user-visible outage began and ended.</mark>


==Detection==
==Detection==
<mark>Write how the issue was first detected. Was automated monitoring first to detect it? Or a human reporting an error?</mark>
The issue was detected by both the monitoring, with expected alerts firing, and the engineer executing the change.


<mark>Copy the relevant alerts that fired in this section.</mark>
Example alerts:


<mark>Did the appropriate alert(s) fire? Was the alert volume manageable? Did they point to the problem with as much accuracy as possible?</mark>
07:46:18: <jinxer-wm> (ProbeDown) firing: (27) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - <nowiki>https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown</nowiki> - <nowiki>https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http</nowiki> - <nowiki>https://alerts.wikimedia.org/?q=alertname%3DProbeDown</nowiki>


<mark>TODO: If human only, an actionable should probably be to "add alerting".</mark>
07:46:19: <jinxer-wm> (ProbeDown) firing: (29) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - <nowiki>https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown</nowiki> - <nowiki>https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http</nowiki> - <nowiki>https://alerts.wikimedia.org/?q=alertname%3DProbeDown</nowiki>


==Conclusions==
==Conclusions==
<mark>What weaknesses did we learn about and how can we address them?</mark>
When provided with invalid input, confctl executes the command against all hosts, it should fail instead.


===What went well?===
===What went well?===
<mark>(Use bullet points) for example: automated monitoring detected the incident, outage was root-caused quickly, etc</mark>
*Monitoring detected the issue
 
*Rollback was performed quickly
*…


===What went poorly?===
===What went poorly?===
<mark>(Use bullet points) for example: documentation on the affected service was unhelpful, communication difficulties, etc</mark>
*Tooling allowed executing a command with bad input
 
*


===Where did we get lucky?===
===Where did we get lucky?===
<mark>(Use bullet points) for example: user's error report was exceptionally detailed, incident occurred when the most people were online to assist, etc</mark>
*The engineer executing the change realized what was going on and stopped the command from completing
 
*…


===How many people were involved in the remediation?===
===How many people were involved in the remediation?===
<mark>(Use bullet points) for example: 2 SREs and 1 software engineer troubleshooting the issue plus 1 incident commander</mark>
*6 SREs
 
*…


==Links to relevant documentation==
==Links to relevant documentation==
<mark>Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.</mark>
[[Conftool#The tools]]


==Actionables==
==Actionables==
Line 86: Line 72:
! rowspan="5" |People
! rowspan="5" |People
|Were the people responding to this incident sufficiently different than the previous five incidents?
|Were the people responding to this incident sufficiently different than the previous five incidents?
|
|no
|
|
|-
|-
|Were the people who responded prepared enough to respond effectively
|Were the people who responded prepared enough to respond effectively
|
|yes
|
|
|-
|-
|Were fewer than five people paged?
|Were fewer than five people paged?
|
|no
|
|
|-
|-
|Were pages routed to the correct sub-team(s)?
|Were pages routed to the correct sub-team(s)?
|
|no
|
|
|-
|-
|Were pages routed to online (business hours) engineers?  ''Answer “no” if engineers were paged after business hours.''
|Were pages routed to online (business hours) engineers?  ''Answer “no” if engineers were paged after business hours.''
|
|yes
|
|
|-
|-
! rowspan="5" |Process
! rowspan="5" |Process
|Was the incident status section actively updated during the incident?
|Was the incident status section actively updated during the incident?
|
|no
|
|
|-
|-
|Was the public status page updated?
|Was the public status page updated?
|
|no
|
|
|-
|-
|Is there a phabricator task for the incident?
|Is there a phabricator task for the incident?
|
|yes
|
|
|-
|-
|Are the documented action items assigned?
|Are the documented action items assigned?
|
|yes
|
|
|-
|-
|Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?
|Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?
|
|yes
|
|
|-
|-
Line 129: Line 115:
|To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? ''Answer “no” if there are''
|To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? ''Answer “no” if there are''
''open tasks that would prevent this incident or make mitigation easier if implemented.''
''open tasks that would prevent this incident or make mitigation easier if implemented.''
|
|yes
|
|
|-
|-
|Were the people responding able to communicate effectively during the incident with the existing tooling?
|Were the people responding able to communicate effectively during the incident with the existing tooling?
|
|yes
|
|
|-
|-
|Did existing monitoring notify the initial responders?
|Did existing monitoring notify the initial responders?
|
|yes
|
|
|-
|-
|Were all engineering tools required available and in service?
|Were all engineering tools required available and in service?
|
|yes
|
|
|-
|-
|Was there a runbook for all known issues present?
|Was there a runbook for all known issues present?
|
|no
|
|
|-
|-
! colspan="2" align="right" |Total score (count of all “yes” answers above)
! colspan="2" align="right" |Total score (count of all “yes” answers above)
|
|9
|
|
|}
|}

Revision as of 12:40, 3 June 2022

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2022-05-09 confctl Start 2022-05-09 07:44:00
Task T309691 End 2022-05-09 07:51:00
People paged 26 Responder count 6
Coordinators Affected metrics/SLOs
Impact All end-user traffic routed to codfw (Central US, South America- at a low traffic moment) received an error response.

For approximately 6 minutes, confctl-managed services were set as inactive for most of codfw. This caused all end-user traffic that was at the time being routed to codfw (Central US, South America - at a low traffic moment) to respond with errors. While appservers in codfw were at the moment "passive" (not receiving end-user traffic), other services that are active were affected (edge cache, swift, elastic, wdqs…). The most visible effect, during the duration of the incident, was approximately 1.4k HTTP requests per second to not be served to text edges and 800 HTTP requests per second to fail to be served from upload edges. The trigger for the issue was a gap in tooling that allowed running a command with invalid input.

Timeline

All times in UTC.

  • 07:44 confctl command with invalid parameters is executed OUTAGE BEGINS
  • 07:44 Engineer executing the change realizes the change is running against more servers than expected and cancels the execution mid-way
  • 07:46 Monitoring system detects the app servers unavailability, 15 pages are sent
  • 07:46 Engineer executing the change notifies others via IRC
  • 07:50 confctl command to repool all codfw servers is executed OUTAGE ENDS

File:2022-05-09 confctl error graph.png

File:2022-05-09 confctl 5xx.png

Detection

The issue was detected by both the monitoring, with expected alerts firing, and the engineer executing the change.

Example alerts:

07:46:18: <jinxer-wm> (ProbeDown) firing: (27) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown

07:46:19: <jinxer-wm> (ProbeDown) firing: (29) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown

Conclusions

When provided with invalid input, confctl executes the command against all hosts, it should fail instead.

What went well?

  • Monitoring detected the issue
  • Rollback was performed quickly

What went poorly?

  • Tooling allowed executing a command with bad input

Where did we get lucky?

  • The engineer executing the change realized what was going on and stopped the command from completing

How many people were involved in the remediation?

  • 6 SREs

Links to relevant documentation

Conftool#The tools

Actionables

Scorecard

Incident Engagement™ ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? no
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? no
Were pages routed to the correct sub-team(s)? no
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. yes
Process Was the incident status section actively updated during the incident? no
Was the public status page updated? no
Is there a phabricator task for the incident? yes
Are the documented action items assigned? yes
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

yes
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? yes
Were all engineering tools required available and in service? yes
Was there a runbook for all known issues present? no
Total score (count of all “yes” answers above) 9