You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/2021-10-29 graphite: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Lucas Werkmeister (WMDE)
imported>Krinkle
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{irdoc|status=review}} <!--
#REDIRECT [[Incidents/2021-10-29 graphite]]
The status field should be one of:
* {{irdoc|status=draft}} - Initial status. When you're happy with the state of your draft, change it to status=review.
* {{irdoc|status=review}} - The incident review working group will contact you then to finalise the report. See also the steps on [[Incident documentation]].
* {{irdoc|status=final}}
-->
 
== Summary ==
 
'''Impact''': The backfill process for Graphite metrics silently failed during the Bullseye migration. A subset of metrics experienced loss for data points before October 11th 2021
 
The process of reimaging a Graphite host is as follows:
 
# reimage host
# let metrics flow for a few days to validate the host is working
# backfill the rest of the data (online, no downtime) from the other Graphite host following https://wikitech.wikimedia.org/wiki/Graphite#Merge_and_sync_metrics
 
During the Bullseye migration the backfill process failed (undetected) for a subset of metrics, leading to metric data loss once the Bullseye migration was complete (i.e. graphite2003 first and then graphite1004 were reimaged and put back in service)
 
{{TOC|align=right}}
 
== Timeline ==
 
'''All times in UTC.'''
 
* Oct 11 11:45 reimage of graphite2003 https://phabricator.wikimedia.org/T247963#7416306
* Oct 18 11:30 backfill of graphite 2003 https://phabricator.wikimedia.org/T247963#7435382
* Oct 19 failover from graphite1004 to graphite2003
* Oct 21 10:24 reimage of graphite1004 and data backfill from graphite2003 https://phabricator.wikimedia.org/T247963#7447084
* Oct 25 failover from graphite2003 to graphite1004 https://phabricator.wikimedia.org/T247963#7455052
* Oct 26 16:50 report of missing metric data in https://phabricator.wikimedia.org/T294355
 
== Detection ==
 
Some Grafana dashboards backed by Graphite showed partial data (starting Oct 11 or Oct 21) for a subset of metrics, as reported by  Lucas Werkmeister in https://phabricator.wikimedia.org/T294355
 
== Conclusions ==
 
The <tt>whisper-sync</tt> backfill process is not as reliable as previously thought, no visible errors were logged and/or detected.
 
=== What went well? ===
* Only a subset of metric files experienced data loss
 
=== What went poorly? ===
* The data loss was not detected by automated means or during spot-check validation
* The data loss was only detected after both hosts had been reimaged, at which point lost data could no longer be recovered
 
=== Where did we get lucky? ===
* Only a subset of metric files experienced data loss
 
=== How many people were involved in the remediation? ===
* 1 SRE (Filippo Giunchedi)
 
== Links to relevant documentation ==
* https://wikitech.wikimedia.org/wiki/Graphite#Merge_and_sync_metrics
 
== Actionables ==
 
* Understand the feasibility (and need) to back up a small subset of important metrics https://phabricator.wikimedia.org/T294355#7464552
* Revise the backfill procedure to be more robust in the face of similar failures in the future (e.g. run a full rsync first, then backfill only the gap) https://phabricator.wikimedia.org/T296295
* Perform validation post-sync / post-backfill to check the number of datapoints across all metric files is roughly in sync between hosts https://phabricator.wikimedia.org/T296295
*Continue (and speed up) the Graphite retirement plan https://phabricator.wikimedia.org/T228380

Latest revision as of 17:49, 8 April 2022