You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Incident documentation/20190716-docker-registry"

From Wikitech-static
Jump to navigation Jump to search
imported>Giuseppe Lavagetto
imported>Krinkle
 
Line 1: Line 1:
'''document status''': {{irdoc-review}}
#REDIRECT [[Incident documentation/2019-07-16 docker registry]]
 
== Summary ==
 
Some swift containers were deleted intentionally, and that had unexpected consequences on the rest of docker-registry. As a result some layers appeared missing from registry, in particular and at the very least, the ones listed in https://phabricator.wikimedia.org/T228196
 
The root cause is the intentional deletion of the docker_registry_eqiad swift container in eqiad. This container was configured to synchronize content with the docker_registry_codfw in codfw. When deleted, the container-to-container synchronization triggered a spike of DELETES.
To illustrate the issue let's follow a concrete missing layer 0d59f51330931db19885c3133b21f3e5df09d6c347b10e38d2ccc9a18db1fab2
 
If we download all the swift logs from 07/16 to 07/18 we can track what happened. First the number of actions recorded in the log that involves that layer
 
<br /><blockquote>swift_activity_0717:45
 
swift_activity_0718:7
 
swift_activity_0719:0</blockquote>Number of DELETES:<blockquote>swift_activity_0717:38
 
swift_activity_0718:0
 
swift_activity_0719:0</blockquote>Number of PUTS:<blockquote>swift_activity_0717:7
 
swift_activity_0718:7
 
swift_activity_0719:0</blockquote>Included in the number of DELETEs there are two timeframes, one that started when the swift container was deleted and another one made by swift several hours later:
   
<code>grep  '.*DELETE.*0d59f51330931db19885c3133b21f3e5df09d6c347b10e38d2ccc9a18db1fab2.*' swift_activity_071* | cut -f3 -d' ' | sort -</code>u<blockquote>14:33:07
 
21:01:21
 
22:28:20
 
22:28:21
 
22:28:24
 
22:28:28
 
22:28:33
 
22:28:38</blockquote>The different number of PUTs maps with the several attempts of recover layers from backup.
 
=== Impact ===
 
Some CI jobs failed, other than that, no known impact. We were lucky that no deploys were done in k8s during the period, otherwise production services would have been affected.
=== Detection ===
 
Human was our detection method in this one, it's unclear yet how we could some automatically caught this.
 
== Timeline ==
 
'''All times in UTC. Date format is DD/MM/YYYY'''
Before the detections in #wikimedia-operations
* 16/07/2019 14:26 deleting docker_registry_eqiad container on eqiad swift cluster and docker_registry and docker_registry_codfw on codfw
* 16/07/2019 ~16:40: report from tarrow about "filesystem layer verification failed for digest" for many images from docker-registry.wikimedia.org
* 16/07/2019 20:00 releng is triggering a republish of releng images
* 16/07/2019 22:45 found a backup on ms-fe2005, uploading only the blobs, it should regenerate old images
* 16/07/2019 23:00 swift upload ended.
* 16/07/2019 00:04 rebuild of releng images completed
* 17/07/2019 07:24 reports of images not working again.
* 17/07/2019 09:00 reuploaded layers from backup.
 
 
== Conclusions ==
 
When manipulating swift containers with container-to-container synchronization we should be extremely cautious as consequences will last for hours if not days.
 
List of improvements:
 
<br />
 
* We need better monitoring regarding container-to-container synchronization in swift, will be useful to have a metric around failures of the synchronization process of operations done on the docker-registry swift container.
* We need to improve our docker rebuild process for disaster recovery, the image rebuild took several hours.
* We need to improve docker registry documentation to include more runbooks or procedures for better diagnostics.
* We need to rethink our golden images approach, the moment one golden image is truncated will affect almost all images.
* keep a backup of the swift container in our backup system.
 
=== What went well? ===
* Cached images on CI and kubernetes nodes helped with not creating impact for end users.
* Incident response?
 
=== What went poorly? ===
* Lack of monitoring in swift container-to-container synchronization.
* When rebuilding releng docker images there was a fear about inadvertently upgrades of software (so is not rebuild anymore is a new image).
* Rebuilding process is slow.
* No page was triggered, as monitoring checks the manifest but do not pull an image.
 
=== Where did we get lucky? ===
* Having a backup of the docker registry container in a swift frontend mitigated greatly the incident, as we were capable of re-uploading missing layers and fix truncated images.
 
== Links to relevant documentation ==
 
* https://phabricator.wikimedia.org/T227570 (root cause triggered when deleting docker_registry_eqiad on eqiad)
* https://phabricator.wikimedia.org/T228196 incident phab task
* https://wikitech.wikimedia.org/wiki/Docker-registry-runbook
== Actionables ==
 
* Sync boron state of /srv/production-images with repo. [ done ]  
* file some bugs to docker-pkg
* educate about pinning packages on docker-pkg templates, this will help a lot when rebuilding templates.
* make a bacula recipe for backing_up docker_registry_codfw swift container. [create phab task]
* get metrics about swift replication [ pending, create phab task]

Latest revision as of 19:14, 20 October 2021