You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incidents/2022-05-01 etcd: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Herron
imported>Dzahn
Line 12: Line 12:
<!-- Reminder: No private information on this page! -->
<!-- Reminder: No private information on this page! -->


<mark>The TLS certificate for etcd.eqiad.wmnet expired. nginx servers on conf* machines use this certificate. conftool-data could not be synced between conf hosts anymore. puppet-merge showed sync errors. labweb (wikitech) hosts alerted because of failed timers/jobs. We got paged due to monitoring of "Etcd replication lag". We had to renew the certificate but it wasn't a simple renew because additionally some certificates had already converted to a new way or creating and managing them while others had not. Both main data centers were in different states. Only eqiad was affected. After figuring this out eventually we created a new certificate for etcd.eqiad using cergen, copied the private key and certs in place and reconfigured servers in eqiad to use it. After this all alerts recovered.</mark>
The TLS certificate for etcd.eqiad.wmnet expired. nginx servers on conf* machines use this certificate. conftool-data could not be synced between conf hosts anymore. puppet-merge showed sync errors. labweb (wikitech) hosts alerted because of failed timers/jobs. We got paged due to monitoring of "Etcd replication lag". We had to renew the certificate but it wasn't a simple renew because additionally some certificates had already converted to a new way or creating and managing them while others had not. Both main data centers were in different states. Only eqiad was affected. After figuring this out eventually we created a new certificate for etcd.eqiad using cergen, copied the private key and certs in place and reconfigured servers in eqiad to use it. After this all alerts recovered.


'''Documentation''':
'''Documentation''':
Line 38: Line 38:
! rowspan="5" |People
! rowspan="5" |People
|Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no)
|Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no)
|
| 1
|
| probably? do we actually go through the last 5 incidents? Which list to use?
|-
|-
|Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no)
|Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no)
|
| 1
|
| combined knowledge of both responders did it
|-
|-
|Were more than 5 people paged? (score 0 for yes, 1 for no)
|Were more than 5 people paged? (score 0 for yes, 1 for no)
|
| 1
|
| 15 paged, 2 responded
|-
|-
|Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no)
|Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no)
|
| 0
|
| Are any pages routed to subteams yet?
|-
|-
|Were pages routed to online (business hours) engineers? (score 1 for yes,  0 if people were paged after business hours)
|Were pages routed to online (business hours) engineers? (score 1 for yes,  0 if people were paged after business hours)
|
| 0
|
| weekend and late
|-
|-
! rowspan="5" |Process
! rowspan="5" |Process
|Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no)
|Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no)
|
| 0
|
| no public impact that would have made it useful
|-
|-
|Was the public status page updated? (score 1 for yes, 0 for no)
|Was the public status page updated? (score 1 for yes, 0 for no)
|
| 0
|
| no public impact that would have made it useful
|-
|-
|Is there a phabricator task for the incident? (score 1 for yes, 0 for no)
|Is there a phabricator task for the incident? (score 1 for yes, 0 for no)
|
| 1
|
| https://phabricator.wikimedia.org/T302153 was reused, as well as follow-up task https://phabricator.wikimedia.org/T307382
|-
|-
|Are the documented action items assigned?  (score 1 for yes, 0 for no)
|Are the documented action items assigned?  (score 1 for yes, 0 for no)
|
| 1
|
|
|-
|-
|Is this a repeat of an earlier incident (score 0 for yes, 1 for no)
|Is this a repeat of an earlier incident (score 0 for yes, 1 for no)
|
| 0
|
| unsure though, maybe but before we made reports for them
|-
|-
! rowspan="5" |Tooling
! rowspan="5" |Tooling
|Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no)
|Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no)
|
| 0
|
| could have had one to migrate eqiad certs to cergen
|-
|-
|Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no)
|Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no)
|
| 1
|
| IRC
|-
|-
|Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no)
|Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no)
|
| 1
|
| Yes, but only when cert was already expired. Should have had alerting before that.
|-
|-
|Were all engineering tools required available and in service? (score 1 for yes, 0 for no)
|Were all engineering tools required available and in service? (score 1 for yes, 0 for no)
|
| 1
|
|
|-
|-
|Was there a runbook for all known issues present? (score 1 for yes, 0 for no)
|Was there a runbook for all known issues present? (score 1 for yes, 0 for no)
|
| 0
|
|
|-
|-

Revision as of 23:07, 12 May 2022

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2022-05-01 etcd Start 04:48
Task T307382 End 06:38
People paged 15 Responder count 2
Coordinators dzahn, cwhite Affected metrics/SLOs
Impact conftool-data could not be synced between data centers, puppet-merge showed sync errors, wikitech (labweb) hosts showed failed systemd timers

The TLS certificate for etcd.eqiad.wmnet expired. nginx servers on conf* machines use this certificate. conftool-data could not be synced between conf hosts anymore. puppet-merge showed sync errors. labweb (wikitech) hosts alerted because of failed timers/jobs. We got paged due to monitoring of "Etcd replication lag". We had to renew the certificate but it wasn't a simple renew because additionally some certificates had already converted to a new way or creating and managing them while others had not. Both main data centers were in different states. Only eqiad was affected. After figuring this out eventually we created a new certificate for etcd.eqiad using cergen, copied the private key and certs in place and reconfigured servers in eqiad to use it. After this all alerts recovered.

Documentation:

Actionables

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

  • T307382 (Modernize etcd tlsproxy certificate management)
  • T307383 (Certificate expiration monitoring)
  • check if any other services need monitoring as well

TODO: Add the #Sustainability (Incident Followup) and the #SRE-OnFIRE (Pending Review & Scorecard) Phabricator tag to these tasks.

Scorecard

Incident Engagement™ ScoreCard
Question Score Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? (score 1 for yes, 0 for no) 1 probably? do we actually go through the last 5 incidents? Which list to use?
Were the people who responded prepared enough to respond effectively (score 1 for yes, 0 for no) 1 combined knowledge of both responders did it
Were more than 5 people paged? (score 0 for yes, 1 for no) 1 15 paged, 2 responded
Were pages routed to the correct sub-team(s)? (score 1 for yes, 0 for no) 0 Are any pages routed to subteams yet?
Were pages routed to online (business hours) engineers? (score 1 for yes,  0 if people were paged after business hours) 0 weekend and late
Process Was the incident status section actively updated during the incident? (score 1 for yes, 0 for no) 0 no public impact that would have made it useful
Was the public status page updated? (score 1 for yes, 0 for no) 0 no public impact that would have made it useful
Is there a phabricator task for the incident? (score 1 for yes, 0 for no) 1 https://phabricator.wikimedia.org/T302153 was reused, as well as follow-up task https://phabricator.wikimedia.org/T307382
Are the documented action items assigned?  (score 1 for yes, 0 for no) 1
Is this a repeat of an earlier incident (score 0 for yes, 1 for no) 0 unsure though, maybe but before we made reports for them
Tooling Was there, before the incident occurred, open tasks that would prevent this incident / make mitigation easier if implemented? (score 0 for yes, 1 for no) 0 could have had one to migrate eqiad certs to cergen
Were the people responding able to communicate effectively during the incident with the existing tooling? (score 1 for yes, 0 or no) 1 IRC
Did existing monitoring notify the initial responders? (score 1 for yes, 0 for no) 1 Yes, but only when cert was already expired. Should have had alerting before that.
Were all engineering tools required available and in service? (score 1 for yes, 0 for no) 1
Was there a runbook for all known issues present? (score 1 for yes, 0 for no) 0
Total score