Incidents/2022-05-01 etcd

document status: draft


Incident metadata (see Incident Scorecard)
Incident ID 2022-05-01 etcd Start 04:48
Task T307382 End 06:38
People paged 15 Responder count 2
Coordinators dzahn, cwhite Affected metrics/SLOs
Impact conftool-data could not be synced between data centers, puppet-merge showed sync errors, wikitech (labweb) hosts showed failed systemd timers

The TLS certificate for etcd.eqiad.wmnet expired. nginx servers on conf* machines use this certificate. conftool-data could not be synced between conf hosts anymore. puppet-merge showed sync errors. labweb (wikitech) hosts alerted because of failed timers/jobs. We got paged due to monitoring of "Etcd replication lag". We had to renew the certificate but it wasn't a simple renew because additionally some certificates had already converted to a new way or creating and managing them while others had not. Both main data centers were in different states. Only eqiad was affected. After figuring this out eventually we created a new certificate for etcd.eqiad using cergen, copied the private key and certs in place and reconfigured servers in eqiad to use it. After this all alerts recovered.



Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

  • T307382 (Modernize etcd tlsproxy certificate management)
  • T307383 (Certificate expiration monitoring)

