You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Portal:Cloud VPS/Admin/Runbooks/Check for snapshots leaked by cinder backup agent: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>David Caro
imported>David Caro
Line 75: Line 75:
if there are any, you can delete them with:
if there are any, you can delete them with:
<syntaxhighlight lang="shell-session">
<syntaxhighlight lang="shell-session">
user@cloudcontrol1005:~ $ for backup_id in $(sudo wmcs-openstack volume backup list -f value -c ID -c Status | grep -v available | awk '{print $1}); do sudo wmcs-openstack volume backup delete --force "$backup_id"; done
user@cloudcontrol1005:~ $ for backup_id in $(sudo wmcs-openstack volume backup list -f value -c ID -c Status | grep -v available | awk '{print $1}'); do sudo wmcs-openstack volume backup delete --force "$backup_id"; done
</syntaxhighlight>
</syntaxhighlight>



Revision as of 17:20, 24 May 2022

The procedures in this runbook require admin permissions to complete.

Error / Incident

Usually an email/alertmanager/icinga alert with the subject ** PROBLEM alert - <hostname>/Check for snapshots leaked by cinder backup agent test is CRITICAL **

This happens when something is going wrong with periodic cinder backups. Common use cases:

  • There's a backup that times out.
  • Cinder-volume service is down.

Debugging

Quick check

Verify leaked snapshots:

user@cloudcontrol1005:~ $ sudo wmcs-openstack volume snapshot list
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+
| ID                                   | Name                                                | Description | Status    | Size |
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+
| d4aad7fb-97ed-4fa5-a06b-ae7f4b76feab | wmde-templates-alpha-nfs-2022-02-23T10:34:32.423757 | None        | available |   10 |
| 4406f4ce-ca22-4f57-a8e5-8dff8cf32270 | wikilink-nfs-2022-02-23T10:34:01.855598             | None        | available |   10 |
| e5c9d3ef-3d8a-40f5-90f0-900f1e87297a | wikidumpparse-nfs-2022-02-23T10:32:36.696177        | None        | available |  260 |
| 9d9aba32-9795-4d60-9d00-1005f5a19483 | proxy-03-backup-2022-02-23T10:32:08.152936          | None        | available |   10 |
| a4acc0c9-2a56-4bb4-bace-644a838a4922 | proxy-04-backup-2022-02-23T10:32:02.187232          | None        | available |   10 |
| 26ce6bea-6174-4960-9951-3ac8786cef96 | dumps-nfs-2022-02-23T10:31:14.228836                | None        | available |   80 |
| b33fde43-703d-4fea-a27b-90a77b6fc049 | twl-nfs-2022-02-23T09:30:51.449991                  | None        | available |  100 |
| 77e4b1dd-7115-44d9-8dc5-d10999fb1003 | testlabs-nfs-2022-02-23T09:30:42.998448             | None        | available |   40 |
| 0b02c50c-53f2-478e-8e2f-dc110b9972fb | quarry-nfs-2022-02-23T09:28:07.622987               | None        | available |  400 |
| 4716e085-6ebd-4da9-974d-0b891fab6d92 | proxy-04-backup-2022-02-23T09:27:52.369365          | None        | available |   10 |
| 2b347ed5-0dca-4495-8be7-8cd24efdea59 | huggle-nfs-2022-02-23T09:27:33.000022               | None        | available |   40 |
| 405b056c-530f-479c-9e2c-630248ae5c20 | dumps-nfs-2022-02-23T09:27:23.461385                | None        | available |   80 |
| 7f7676a4-c7b0-4dc2-8146-d76764afd6a8 | cvn-nfs-2022-02-23T09:27:14.921842                  | None        | available |    8 |
| f4d18036-f2f9-4c3b-8dd8-39cff9081925 | scratch-2022-02-23T09:25:37.183037                  | None        | available | 3072 |
| e6bf9c4c-a262-40e3-8beb-9c19545924e9 | utrs-nfs-2022-02-21T17:28:35.599328                 | None        | deleting  |   10 |
| 3d215281-4e22-40ce-852b-9555b7727f35 | quarry-nfs-2022-02-21T16:35:24.291820               | None        | available |  400 |
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+

This list should be empty, because the backup_cinder_volumes service clean snapshots after running the backup. If the list is not empty, this is indeed an indication that something is not working as expected.

Check the service status:

user@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.service

Check the service logs:

user@cloudcontrol1005:~$ sudo journalctl -u backup_cinder_volumes.service

Common remediation operations

Verify if cinder API is up and running and start it if not

Most of the times the cinder API being down is the base of the problems, to verify that it's up and running, on each cloudcontrol node:

user@cloudcontrol1005:~# sudo systemctl status cinder*

There should be 3 services up and running, cinder-api, cinder-volume and cinder-scheduler.

Cleanup of corrupted backups and old volume snapshots

The backup_cinder_volumes service uses the admin project to store temporal volume snapshots before backing up them.

If you are sure they are not in use, you can just cleanup them, for that, check if there's any backups first:

user@cloudcontrol1005:~ $ sudo wmcs-openstack volume backup list | grep -v available

if there are any, you can delete them with:

user@cloudcontrol1005:~ $ for backup_id in $(sudo wmcs-openstack volume backup list -f value -c ID -c Status | grep -v available | awk '{print $1}'); do sudo wmcs-openstack volume backup delete --force "$backup_id"; done

Then you can proceed to remove the volume snapshots that are not being used (status available):

user@cloudcontrol1005:~ $ for i in $(sudo wmcs-openstack volume snapshot list -f value -c ID -c Status | grep available | awk '{print $1}') ; do sudo wmcs-openstack volume snapshot delete $i ; done

If you want a more aggressive approach, you can force the operation with:

user@cloudcontrol1005:~ $ for i in $(sudo wmcs-openstack volume snapshot list -f value -c ID -c Status | grep available | awk '{print $1}') ; do sudo wmcs-openstack volume snapshot delete $i --force ; done

Of course this doesn't solve the root of the problem, just the symptom.

See also

There is no service page yet, so for now there's just the proposal:

Old occurrences