You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Portal:Cloud VPS/Admin/Runbooks/Check for snapshots leaked by cinder backup agent: Difference between revisions
imported>Nskaggs |
imported>Nskaggs |
||
Line 80: | Line 80: | ||
<syntaxhighlight lang="shell-session"> | <syntaxhighlight lang="shell-session"> | ||
user@cloudcontrol1005:~# sudo systemctl status cinder* - | user@cloudcontrol1005:~# sudo systemctl status cinder* -l | ||
</syntaxhighlight> | </syntaxhighlight> | ||
Revision as of 16:00, 1 July 2022
Error / Incident
Usually an email/alertmanager/icinga alert with the subject ** PROBLEM alert - <hostname>/Check for snapshots leaked by cinder backup agent test is CRITICAL **
This happens when something is going wrong with periodic cinder backups. Common use cases:
- There's a backup that times out.
- Cinder-volume service is down.
Debugging
Quick check
Verify leaked snapshots:
user@cloudcontrol1005:~ $ sudo wmcs-openstack volume snapshot list
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+
| ID | Name | Description | Status | Size |
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+
| d4aad7fb-97ed-4fa5-a06b-ae7f4b76feab | wmde-templates-alpha-nfs-2022-02-23T10:34:32.423757 | None | available | 10 |
| 4406f4ce-ca22-4f57-a8e5-8dff8cf32270 | wikilink-nfs-2022-02-23T10:34:01.855598 | None | available | 10 |
| e5c9d3ef-3d8a-40f5-90f0-900f1e87297a | wikidumpparse-nfs-2022-02-23T10:32:36.696177 | None | available | 260 |
| 9d9aba32-9795-4d60-9d00-1005f5a19483 | proxy-03-backup-2022-02-23T10:32:08.152936 | None | available | 10 |
| a4acc0c9-2a56-4bb4-bace-644a838a4922 | proxy-04-backup-2022-02-23T10:32:02.187232 | None | available | 10 |
| 26ce6bea-6174-4960-9951-3ac8786cef96 | dumps-nfs-2022-02-23T10:31:14.228836 | None | available | 80 |
| b33fde43-703d-4fea-a27b-90a77b6fc049 | twl-nfs-2022-02-23T09:30:51.449991 | None | available | 100 |
| 77e4b1dd-7115-44d9-8dc5-d10999fb1003 | testlabs-nfs-2022-02-23T09:30:42.998448 | None | available | 40 |
| 0b02c50c-53f2-478e-8e2f-dc110b9972fb | quarry-nfs-2022-02-23T09:28:07.622987 | None | available | 400 |
| 4716e085-6ebd-4da9-974d-0b891fab6d92 | proxy-04-backup-2022-02-23T09:27:52.369365 | None | available | 10 |
| 2b347ed5-0dca-4495-8be7-8cd24efdea59 | huggle-nfs-2022-02-23T09:27:33.000022 | None | available | 40 |
| 405b056c-530f-479c-9e2c-630248ae5c20 | dumps-nfs-2022-02-23T09:27:23.461385 | None | available | 80 |
| 7f7676a4-c7b0-4dc2-8146-d76764afd6a8 | cvn-nfs-2022-02-23T09:27:14.921842 | None | available | 8 |
| f4d18036-f2f9-4c3b-8dd8-39cff9081925 | scratch-2022-02-23T09:25:37.183037 | None | available | 3072 |
| e6bf9c4c-a262-40e3-8beb-9c19545924e9 | utrs-nfs-2022-02-21T17:28:35.599328 | None | deleting | 10 |
| 3d215281-4e22-40ce-852b-9555b7727f35 | quarry-nfs-2022-02-21T16:35:24.291820 | None | available | 400 |
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+
This list should be empty, because the backup_cinder_volumes
service clean snapshots after running the backup.
If the list is not empty, this is indeed an indication that something is not working as expected.
Check the service status:
user@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.service
Check the service logs:
user@cloudcontrol1005:~$ sudo journalctl -u backup_cinder_volumes.service
Check cinder logs:
user@cloudcontrol1005:~$ sudo journalctl -u cinder-volume.service
Common remediation operations
Verify if cinder API is up and running and start it if not
Most of the times the cinder API being down is the base of the problems, to verify that it's up and running, on each cloudcontrol node:
user@cloudcontrol1005:~$ sudo wmcs-openstack volume service list
+------------------+----------------------+------+---------+-------+----------------------------+
| Binary | Host | Zone | Status | State | Updated At |
+------------------+----------------------+------+---------+-------+----------------------------+
| cinder-scheduler | cloudcontrol1004 | nova | enabled | up | 2022-06-06T14:52:24.000000 |
| cinder-scheduler | cloudcontrol1003 | nova | enabled | up | 2022-06-06T14:52:28.000000 |
| cinder-volume | cloudcontrol1004@rbd | nova | enabled | up | 2022-06-06T14:52:29.000000 |
| cinder-volume | cloudcontrol1005@rbd | nova | enabled | up | 2022-06-06T14:52:23.000000 |
| cinder-volume | cloudcontrol1003@rbd | nova | enabled | up | 2022-06-06T14:52:27.000000 |
| cinder-scheduler | cloudcontrol1005 | nova | enabled | up | 2022-06-06T14:52:28.000000 |
| cinder-backup | cloudbackup2002 | nova | enabled | up | 2022-06-06T14:52:22.000000 |
+------------------+----------------------+------+---------+-------+----------------------------+
user@cloudcontrol1005:~# sudo systemctl status cinder* -l
There should be 3 services up and running, cinder-api
, cinder-volume
and cinder-scheduler
.
Examine leftover snapshots
user@cloudcontrol1005:~$ sudo wmcs-openstack volume snapshot show b56c4fea-5c77-4e35-bc6b-6ace1e1dd996
+--------------------------------------------+--------------------------------------+
| Field | Value |
+--------------------------------------------+--------------------------------------+
| created_at | 2022-06-06T10:30:02.000000 |
| description | None |
| id | b56c4fea-5c77-4e35-bc6b-6ace1e1dd996 |
| name | scratch-2022-06-06T10:30:02.003496 |
| os-extended-snapshot-attributes:progress | 100% |
| os-extended-snapshot-attributes:project_id | admin |
| properties | |
| size | 3072 |
| status | available |
| updated_at | 2022-06-06T14:06:57.000000 |
| volume_id | d1478efd-9fa6-4293-8389-e72459b794c0 |
+--------------------------------------------+--------------------------------------+
user@cloudcontrol1005:~$ sudo wmcs-openstack volume show d1478efd-9fa6-4293-8389-e72459b794c0
+--------------------------------+-----------------------------------------------------------------------------------------------------------+
| Field | Value |
+--------------------------------+-----------------------------------------------------------------------------------------------------------+
| attachments | [{'id': 'd1478efd-9fa6-4293-8389-e72459b794c0', 'attachment_id': '957e9c36-04c7-4234-998f-7bab32174d93', |
| | 'volume_id': 'd1478efd-9fa6-4293-8389-e72459b794c0', 'server_id': '2fd8eb82-33ec-4060-91c6-cc0a90de8994', |
| | 'host_name': 'cloudvirt1046', 'device': '/dev/sdb', 'attached_at': '2022-05-13T04:31:46.000000'}] |
| availability_zone | nova |
| bootable | false |
| consistencygroup_id | None |
| created_at | 2022-01-14T22:28:57.000000 |
| description | None |
| encrypted | False |
| id | d1478efd-9fa6-4293-8389-e72459b794c0 |
| migration_status | None |
| multiattach | False |
| name | scratch |
| os-vol-host-attr:host | cloudcontrol1004@rbd#RBD |
| os-vol-mig-status-attr:migstat | None |
| os-vol-mig-status-attr:name_id | None |
| os-vol-tenant-attr:tenant_id | cloudinfra-nfs |
| properties | |
| replication_status | None |
| size | 3072 |
| snapshot_id | None |
| source_volid | None |
| status | in-use |
| type | standard |
| updated_at | 2022-05-13T04:33:39.000000 |
| user_id | novaadmin |
+--------------------------------+-----------------------------------------------------------------------------------------------------------+
Cleanup of corrupted backups and old volume snapshots
The backup_cinder_volumes
service uses the admin
project to store temporal volume snapshots before backing up them.
If you are sure they are not in use, you can just cleanup them, for that, check if there's any backups first:
user@cloudcontrol1005:~ $ sudo wmcs-openstack volume backup list | grep -v available
if there are any, you can delete them with:
user@cloudcontrol1005:~ $ for backup_id in $(sudo wmcs-openstack volume backup list -f value -c ID -c Status | grep -v available | awk '{print $1}'); do sudo wmcs-openstack volume backup delete --force "$backup_id"; done
Then you can proceed to remove the volume snapshots that are not being used (status available
):
user@cloudcontrol1005:~ $ for i in $(sudo wmcs-openstack volume snapshot list -f value -c ID -c Status | grep available | awk '{print $1}') ; do sudo wmcs-openstack volume snapshot delete $i ; done
If you want a more aggressive approach, you can force the operation with:
user@cloudcontrol1005:~ $ for i in $(sudo wmcs-openstack volume snapshot list -f value -c ID -c Status | grep available | awk '{print $1}') ; do sudo wmcs-openstack volume snapshot delete $i --force ; done
Of course this doesn't solve the root of the problem, just the symptom.
See also
There is no service page yet, so for now there's just the proposal:
Old occurrences
- Phabricator T302382 - icinga alert: Check for snapshots leaked by cinder backup agent -- example of a real-life alert
- phab:T302720