You are browsing a read-only backup copy of Wikitech. The live site can be found at

Incident documentation/20190716-docker-registry: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Giuseppe Lavagetto
No edit summary
(Automatically fixing double redirect from Incidents/20190716-docker-registry to Incidents/2019-07-16 docker registry in a maintenance job)
(One intermediate revision by the same user not shown)
Line 1: Line 1:
'''document status''': {{irdoc-review}}
#REDIRECT [[Incidents/2019-07-16 docker registry]]
== Summary ==
Some swift containers were deleted intentionally, and that had unexpected consequences on the rest of docker-registry. As a result some layers appeared missing from registry, in particular and at the very least, the ones listed in
The root cause is the intentional deletion of the docker_registry_eqiad swift container in eqiad. This container was configured to synchronize content with the docker_registry_codfw in codfw. When deleted, the container-to-container synchronization triggered a spike of DELETES.
To illustrate the issue let's follow a concrete missing layer 0d59f51330931db19885c3133b21f3e5df09d6c347b10e38d2ccc9a18db1fab2
If we download all the swift logs from 07/16 to 07/18 we can track what happened. First the number of actions recorded in the log that involves that layer
<br /><blockquote>swift_activity_0717:45
swift_activity_0719:0</blockquote>Number of DELETES:<blockquote>swift_activity_0717:38
swift_activity_0719:0</blockquote>Number of PUTS:<blockquote>swift_activity_0717:7
swift_activity_0719:0</blockquote>Included in the number of DELETEs there are two timeframes, one that started when the swift container was deleted and another one made by swift several hours later:
<code>grep  '.*DELETE.*0d59f51330931db19885c3133b21f3e5df09d6c347b10e38d2ccc9a18db1fab2.*' swift_activity_071* | cut -f3 -d' ' | sort -</code>u<blockquote>14:33:07
22:28:38</blockquote>The different number of PUTs maps with the several attempts of recover layers from backup.
=== Impact ===
Some CI jobs failed, other than that, no known impact. We were lucky that no deploys were done in k8s during the period, otherwise production services would have been affected.
=== Detection ===
Human was our detection method in this one, it's unclear yet how we could some automatically caught this.
== Timeline ==
'''All times in UTC. Date format is DD/MM/YYYY'''
Before the detections in #wikimedia-operations
* 16/07/2019 14:26 deleting docker_registry_eqiad container on eqiad swift cluster and docker_registry and docker_registry_codfw on codfw
* 16/07/2019 ~16:40: report from tarrow about "filesystem layer verification failed for digest" for many images from
* 16/07/2019 20:00 releng is triggering a republish of releng images
* 16/07/2019 22:45 found a backup on ms-fe2005, uploading only the blobs, it should regenerate old images
* 16/07/2019 23:00 swift upload ended.
* 16/07/2019 00:04 rebuild of releng images completed
* 17/07/2019 07:24 reports of images not working again.
* 17/07/2019 09:00 reuploaded layers from backup.
== Conclusions ==
When manipulating swift containers with container-to-container synchronization we should be extremely cautious as consequences will last for hours if not days.
List of improvements:
<br />
* We need better monitoring regarding container-to-container synchronization in swift, will be useful to have a metric around failures of the synchronization process of operations done on the docker-registry swift container.
* We need to improve our docker rebuild process for disaster recovery, the image rebuild took several hours.
* We need to improve docker registry documentation to include more runbooks or procedures for better diagnostics.
* We need to rethink our golden images approach, the moment one golden image is truncated will affect almost all images.
* keep a backup of the swift container in our backup system.
=== What went well? ===
* Cached images on CI and kubernetes nodes helped with not creating impact for end users.
* Incident response?
=== What went poorly? ===
* Lack of monitoring in swift container-to-container synchronization.
* When rebuilding releng docker images there was a fear about inadvertently upgrades of software (so is not rebuild anymore is a new image).
* Rebuilding process is slow.
* No page was triggered, as monitoring checks the manifest but do not pull an image.
=== Where did we get lucky? ===
* Having a backup of the docker registry container in a swift frontend mitigated greatly the incident, as we were capable of re-uploading missing layers and fix truncated images.
== Links to relevant documentation ==
* (root cause triggered when deleting docker_registry_eqiad on eqiad)
* incident phab task
== Actionables ==
* Sync boron state of /srv/production-images with repo. [ done ]  
* file some bugs to docker-pkg
* educate about pinning packages on docker-pkg templates, this will help a lot when rebuilding templates.
* make a bacula recipe for backing_up docker_registry_codfw swift container. [create phab task]
* get metrics about swift replication [ pending, create phab task]

Latest revision as of 18:17, 8 April 2022