You are browsing a read-only backup copy of Wikitech. The primary site can be found at

Incident documentation/2021-10-29 graphite: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Lucas Werkmeister (WMDE)
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{irdoc|status=review}} <!--
#REDIRECT [[Incidents/2021-10-29 graphite]]
The status field should be one of:
* {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review.
* {{irdoc|status=review}} - The incident review working group will contact you then to finalise the report. See also the steps on [[Incident documentation]].
* {{irdoc|status=final}}
== Summary ==
'''Impact''': The backfill process for Graphite metrics silently failed during the Bullseye migration. A subset of metrics experienced loss for data points before October 11th 2021
The process of reimaging a Graphite host is as follows:
# reimage host
# let metrics flow for a few days to validate the host is working
# backfill the rest of the data (online, no downtime) from the other Graphite host following
During the Bullseye migration the backfill process failed (undetected) for a subset of metrics, leading to metric data loss once the Bullseye migration was complete (i.e. graphite2003 first and then graphite1004 were reimaged and put back in service)
== Timeline ==
'''All times in UTC.'''
* Oct 11 11:45 reimage of graphite2003
* Oct 18 11:30 backfill of graphite 2003
* Oct 19 failover from graphite1004 to graphite2003
* Oct 21 10:24 reimage of graphite1004 and data backfill from graphite2003
* Oct 25 failover from graphite2003 to graphite1004
* Oct 26 16:50 report of missing metric data in
== Detection ==
Some Grafana dashboards backed by Graphite showed partial data (starting Oct 11 or Oct 21) for a subset of metrics, as reported by  Lucas Werkmeister in
== Conclusions ==
The <tt>whisper-sync</tt> backfill process is not as reliable as previously thought, no visible errors were logged and/or detected.
=== What went well? ===
* Only a subset of metric files experienced data loss
=== What went poorly? ===
* The data loss was not detected by automated means or during spot-check validation
* The data loss was only detected after both hosts had been reimaged, at which point lost data could no longer be recovered
=== Where did we get lucky? ===
* Only a subset of metric files experienced data loss
=== How many people were involved in the remediation? ===
* 1 SRE (Filippo Giunchedi)
== Links to relevant documentation ==
== Actionables ==
* Understand the feasibility (and need) to back up a small subset of important metrics
* Revise the backfill procedure to be more robust in the face of similar failures in the future (e.g. run a full rsync first, then backfill only the gap)
* Perform validation post-sync / post-backfill to check the number of datapoints across all metric files is roughly in sync between hosts
*Continue (and speed up) the Graphite retirement plan

Latest revision as of 17:49, 8 April 2022