You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Portal:Cloud VPS/Admin/Runbooks/Check for snapshots leaked by cinder backup agent: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Nskaggs
imported>Nskaggs
Line 80: Line 80:


<syntaxhighlight lang="shell-session">
<syntaxhighlight lang="shell-session">
user@cloudcontrol1005:~# sudo systemctl status cinder* -1
user@cloudcontrol1005:~# sudo systemctl status cinder* -l
</syntaxhighlight>
</syntaxhighlight>



Revision as of 16:00, 1 July 2022

The procedures in this runbook require admin permissions to complete.

Error / Incident

Usually an email/alertmanager/icinga alert with the subject ** PROBLEM alert - <hostname>/Check for snapshots leaked by cinder backup agent test is CRITICAL **

This happens when something is going wrong with periodic cinder backups. Common use cases:

  • There's a backup that times out.
  • Cinder-volume service is down.

Debugging

Quick check

Verify leaked snapshots:

user@cloudcontrol1005:~ $ sudo wmcs-openstack volume snapshot list
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+
| ID                                   | Name                                                | Description | Status    | Size |
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+
| d4aad7fb-97ed-4fa5-a06b-ae7f4b76feab | wmde-templates-alpha-nfs-2022-02-23T10:34:32.423757 | None        | available |   10 |
| 4406f4ce-ca22-4f57-a8e5-8dff8cf32270 | wikilink-nfs-2022-02-23T10:34:01.855598             | None        | available |   10 |
| e5c9d3ef-3d8a-40f5-90f0-900f1e87297a | wikidumpparse-nfs-2022-02-23T10:32:36.696177        | None        | available |  260 |
| 9d9aba32-9795-4d60-9d00-1005f5a19483 | proxy-03-backup-2022-02-23T10:32:08.152936          | None        | available |   10 |
| a4acc0c9-2a56-4bb4-bace-644a838a4922 | proxy-04-backup-2022-02-23T10:32:02.187232          | None        | available |   10 |
| 26ce6bea-6174-4960-9951-3ac8786cef96 | dumps-nfs-2022-02-23T10:31:14.228836                | None        | available |   80 |
| b33fde43-703d-4fea-a27b-90a77b6fc049 | twl-nfs-2022-02-23T09:30:51.449991                  | None        | available |  100 |
| 77e4b1dd-7115-44d9-8dc5-d10999fb1003 | testlabs-nfs-2022-02-23T09:30:42.998448             | None        | available |   40 |
| 0b02c50c-53f2-478e-8e2f-dc110b9972fb | quarry-nfs-2022-02-23T09:28:07.622987               | None        | available |  400 |
| 4716e085-6ebd-4da9-974d-0b891fab6d92 | proxy-04-backup-2022-02-23T09:27:52.369365          | None        | available |   10 |
| 2b347ed5-0dca-4495-8be7-8cd24efdea59 | huggle-nfs-2022-02-23T09:27:33.000022               | None        | available |   40 |
| 405b056c-530f-479c-9e2c-630248ae5c20 | dumps-nfs-2022-02-23T09:27:23.461385                | None        | available |   80 |
| 7f7676a4-c7b0-4dc2-8146-d76764afd6a8 | cvn-nfs-2022-02-23T09:27:14.921842                  | None        | available |    8 |
| f4d18036-f2f9-4c3b-8dd8-39cff9081925 | scratch-2022-02-23T09:25:37.183037                  | None        | available | 3072 |
| e6bf9c4c-a262-40e3-8beb-9c19545924e9 | utrs-nfs-2022-02-21T17:28:35.599328                 | None        | deleting  |   10 |
| 3d215281-4e22-40ce-852b-9555b7727f35 | quarry-nfs-2022-02-21T16:35:24.291820               | None        | available |  400 |
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+

This list should be empty, because the backup_cinder_volumes service clean snapshots after running the backup. If the list is not empty, this is indeed an indication that something is not working as expected.

Check the service status:

user@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.service

Check the service logs:

user@cloudcontrol1005:~$ sudo journalctl -u backup_cinder_volumes.service

Check cinder logs:

user@cloudcontrol1005:~$ sudo journalctl -u cinder-volume.service

Common remediation operations

Verify if cinder API is up and running and start it if not

Most of the times the cinder API being down is the base of the problems, to verify that it's up and running, on each cloudcontrol node:

user@cloudcontrol1005:~$ sudo wmcs-openstack volume service list
+------------------+----------------------+------+---------+-------+----------------------------+
| Binary           | Host                 | Zone | Status  | State | Updated At                 |
+------------------+----------------------+------+---------+-------+----------------------------+
| cinder-scheduler | cloudcontrol1004     | nova | enabled | up    | 2022-06-06T14:52:24.000000 |
| cinder-scheduler | cloudcontrol1003     | nova | enabled | up    | 2022-06-06T14:52:28.000000 |
| cinder-volume    | cloudcontrol1004@rbd | nova | enabled | up    | 2022-06-06T14:52:29.000000 |
| cinder-volume    | cloudcontrol1005@rbd | nova | enabled | up    | 2022-06-06T14:52:23.000000 |
| cinder-volume    | cloudcontrol1003@rbd | nova | enabled | up    | 2022-06-06T14:52:27.000000 |
| cinder-scheduler | cloudcontrol1005     | nova | enabled | up    | 2022-06-06T14:52:28.000000 |
| cinder-backup    | cloudbackup2002      | nova | enabled | up    | 2022-06-06T14:52:22.000000 |
+------------------+----------------------+------+---------+-------+----------------------------+
user@cloudcontrol1005:~# sudo systemctl status cinder* -l

There should be 3 services up and running, cinder-api, cinder-volume and cinder-scheduler.

Examine leftover snapshots

user@cloudcontrol1005:~$ sudo wmcs-openstack volume snapshot show b56c4fea-5c77-4e35-bc6b-6ace1e1dd996
+--------------------------------------------+--------------------------------------+
| Field                                      | Value                                |
+--------------------------------------------+--------------------------------------+
| created_at                                 | 2022-06-06T10:30:02.000000           |
| description                                | None                                 |
| id                                         | b56c4fea-5c77-4e35-bc6b-6ace1e1dd996 |
| name                                       | scratch-2022-06-06T10:30:02.003496   |
| os-extended-snapshot-attributes:progress   | 100%                                 |
| os-extended-snapshot-attributes:project_id | admin                                |
| properties                                 |                                      |
| size                                       | 3072                                 |
| status                                     | available                            |
| updated_at                                 | 2022-06-06T14:06:57.000000           |
| volume_id                                  | d1478efd-9fa6-4293-8389-e72459b794c0 |
+--------------------------------------------+--------------------------------------+
user@cloudcontrol1005:~$ sudo wmcs-openstack volume show d1478efd-9fa6-4293-8389-e72459b794c0
+--------------------------------+-----------------------------------------------------------------------------------------------------------+
| Field                          | Value                                                                                                     |
+--------------------------------+-----------------------------------------------------------------------------------------------------------+
| attachments                    | [{'id': 'd1478efd-9fa6-4293-8389-e72459b794c0', 'attachment_id': '957e9c36-04c7-4234-998f-7bab32174d93',  |
|                                | 'volume_id': 'd1478efd-9fa6-4293-8389-e72459b794c0', 'server_id': '2fd8eb82-33ec-4060-91c6-cc0a90de8994', |
|                                | 'host_name': 'cloudvirt1046', 'device': '/dev/sdb', 'attached_at': '2022-05-13T04:31:46.000000'}]         |
| availability_zone              | nova                                                                                                      |
| bootable                       | false                                                                                                     |
| consistencygroup_id            | None                                                                                                      |
| created_at                     | 2022-01-14T22:28:57.000000                                                                                |
| description                    | None                                                                                                      |
| encrypted                      | False                                                                                                     |
| id                             | d1478efd-9fa6-4293-8389-e72459b794c0                                                                      |
| migration_status               | None                                                                                                      |
| multiattach                    | False                                                                                                     |
| name                           | scratch                                                                                                   |
| os-vol-host-attr:host          | cloudcontrol1004@rbd#RBD                                                                                  |
| os-vol-mig-status-attr:migstat | None                                                                                                      |
| os-vol-mig-status-attr:name_id | None                                                                                                      |
| os-vol-tenant-attr:tenant_id   | cloudinfra-nfs                                                                                            |
| properties                     |                                                                                                           |
| replication_status             | None                                                                                                      |
| size                           | 3072                                                                                                      |
| snapshot_id                    | None                                                                                                      |
| source_volid                   | None                                                                                                      |
| status                         | in-use                                                                                                    |
| type                           | standard                                                                                                  |
| updated_at                     | 2022-05-13T04:33:39.000000                                                                                |
| user_id                        | novaadmin                                                                                                 |
+--------------------------------+-----------------------------------------------------------------------------------------------------------+


Cleanup of corrupted backups and old volume snapshots

The backup_cinder_volumes service uses the admin project to store temporal volume snapshots before backing up them.

If you are sure they are not in use, you can just cleanup them, for that, check if there's any backups first:

user@cloudcontrol1005:~ $ sudo wmcs-openstack volume backup list | grep -v available

if there are any, you can delete them with:

user@cloudcontrol1005:~ $ for backup_id in $(sudo wmcs-openstack volume backup list -f value -c ID -c Status | grep -v available | awk '{print $1}'); do sudo wmcs-openstack volume backup delete --force "$backup_id"; done

Then you can proceed to remove the volume snapshots that are not being used (status available):

user@cloudcontrol1005:~ $ for i in $(sudo wmcs-openstack volume snapshot list -f value -c ID -c Status | grep available | awk '{print $1}') ; do sudo wmcs-openstack volume snapshot delete $i ; done

If you want a more aggressive approach, you can force the operation with:

user@cloudcontrol1005:~ $ for i in $(sudo wmcs-openstack volume snapshot list -f value -c ID -c Status | grep available | awk '{print $1}') ; do sudo wmcs-openstack volume snapshot delete $i --force ; done

Of course this doesn't solve the root of the problem, just the symptom.

See also

There is no service page yet, so for now there's just the proposal:

Old occurrences