You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
document status: in-review
|Incident ID||2022-05-09 confctl||Start||2022-05-09 07:44:00|
|People paged||26||Responder count||6|
|Impact||For 5 minutes, all web traffic routed to Codfw received error responses. This affected central USA and South America (local time after midnight).|
confctl command to depool a server was accidentally run with an invalid selection parameter (
host=mw1415 instead of
name=mw1415, details at T308100). There exists no "host" parameter, and Confctl did not validate it, but silently ignore it. The result was that the depool command was interpreted as applying to all hosts, of all services, in all data centers. The command was cancelled partway through the first DC it iterated on (Codfw).
Confctl-managed services were set as inactive for most of the Codfw data center. This caused all end-user traffic that was at the time being routed to codfw (Central US, South America - at a low traffic moment) to respond with errors. While appservers in codfw were at the moment "passive" (not receiving end-user traffic), other services that are active were affected (CDN edge cache, Swift media files, Elasticsearch, WDQS…).
The most visible effect, during the duration of the incident, was approximately 1.4k HTTP requests per second to not be served to text edges and 800 HTTP requests per second to fail to be served from upload edges. The trigger for the issue was a gap in tooling that allowed running a command with invalid input.
All times in UTC.
- 07:44 confctl command with invalid parameters is executed OUTAGE BEGINS
- 07:44 Engineer executing the change realizes the change is running against more servers than expected and cancels the execution mid-way
- 07:46 Monitoring system detects the app servers unavailability, 15 pages are sent
- 07:46 Engineer executing the change notifies others via IRC
- 07:50 confctl command to repool all codfw servers is executed OUTAGE ENDS
The issue was detected by both the monitoring, with expected alerts firing, and the engineer executing the change.
07:46:18: <jinxer-wm> (ProbeDown) firing: (27) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
07:46:19: <jinxer-wm> (ProbeDown) firing: (29) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
When provided with invalid input, confctl executes the command against all hosts, it should fail instead.
What went well?
- Monitoring detected the issue
- Rollback was performed quickly
What went poorly?
- Tooling allowed executing a command with bad input
Where did we get lucky?
- The engineer executing the change realized what was going on and stopped the command from completing
How many people were involved in the remediation?
- 6 SREs
Links to relevant documentation
|People||Were the people responding to this incident sufficiently different than the previous five incidents?||no|
|Were the people who responded prepared enough to respond effectively||yes|
|Were fewer than five people paged?||no|
|Were pages routed to the correct sub-team(s)?||no|
|Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.||yes|
|Process||Was the incident status section actively updated during the incident?||no|
|Was the public status page updated?||no|
|Is there a phabricator task for the incident?||yes|
|Are the documented action items assigned?||yes|
|Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?||yes|
|Tooling||To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented.
|Were the people responding able to communicate effectively during the incident with the existing tooling?||yes|
|Did existing monitoring notify the initial responders?||yes|
|Were all engineering tools required available and in service?||yes|
|Was there a runbook for all known issues present?||no|
|Total score (count of all “yes” answers above)||9|