You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Portal:Cloud VPS/Admin/Runbooks/Check unit status of backup cinder volumes: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>David Caro
No edit summary
 
imported>David Caro
Line 43: Line 43:


You can find the code [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/openstack/files/victoria/admin_scripts/wmcs-cinder-backup-manager.py here].
You can find the code [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/openstack/files/victoria/admin_scripts/wmcs-cinder-backup-manager.py here].
== Snapshot stuck in 'deleting' state ==
If you get an error like:
cinderclient.exceptions.OverLimit: SnapshotLimitExceeded: Maximum number of snapshots allowed (16) exceeded (HTTP 413) (Request-ID: req-7a6d86a5-79e3-447f-8125-1e969ef504a7)
It might be that the snapshots are getting stuck in 'deleting' status (for some underlying issue, look into that too).
To check run:
root@cloudcontrol1005:~# cinder snapshot-list --volume-id 7b037262-7214-4cef-a876-a55e26bc43be
+--------------------------------------+--------------------------------------+-----------+----------------------------------------------+------+-----------+
| ID                                  | Volume ID                            | Status    | Name                                        | Size | User ID  |
+--------------------------------------+--------------------------------------+-----------+----------------------------------------------+------+-----------+
| 784b0a3d-d93f-47fa-97ac-fbbe19b8174e | 7b037262-7214-4cef-a876-a55e26bc43be | available | wikidumpparse-nfs-2022-04-13T20:00:14.507152 | 260  | novaadmin |
| 93ba6b09-879f-441b-b9d4-4767c8e53b41 | 7b037262-7214-4cef-a876-a55e26bc43be | deleting  | wikidumpparse-nfs-2022-05-11T10:32:42.692626 | 260  | novaadmin |
+--------------------------------------+--------------------------------------+-----------+----------------------------------------------+------+-----------+
That shows all the snapshots for the volume, as you can see, there's one in 'deleting' state, and has been there for a while (when writing this its 2022-05-24).
Check that there's no rbd snapshots with that id on ceph:
root@cloudcontrol1005:~# rbd list -l  --pool eqiad1-cinder | grep 7b037262-7214-4cef-a876-a55e26bc43be
And if there are not, you can delete the snapshot setting it's state to error:
root@cloudcontrol1005:~# cinder snapshot-reset-state --state error 93ba6b09-879f-441b-b9d4-4767c8e53b41
root@cloudcontrol1005:~# cinder snapshot-delete 93ba6b09-879f-441b-b9d4-4767c8e53b41
A scripted loop with that:
root@cloudcontrol1005:~# for stuck_snapshot in $(openstack volume snapshot list | grep deleting | awk '{print $2}'); do echo "Deleting $stuck_snapshot"; rbd list -l eqiad1-cinder | grep -q "$stuck_snapshot" && echo "... There's some rbd leftovers, check manually" || { cinder snapshot-reset-state --state error "$stuck_snapshot" && cinder snapshot-delete "$stuck_snapshot"; echo ".... removed"; }; done;
'''You should still try to find the underlying issue''', but if there was some instability in the system cleaning up the snapshots might be enough.


= Related information =
= Related information =


* [https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers Some generic info on systmed timers]
* [https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers Some generic info on systmed timers]
* [https://docs.openstack.org/cinder/latest/admin/volume-backups.html Official cinder backup docs]


= Old occurences =
= Old occurences =

Revision as of 17:16, 24 May 2022

Overview

The procedures in this runbook require admin permissions to complete.

This is the systemd timer that triggers the cinder volumes backups.

Error / Incident

The systemd timer failed.

Debugging

First of all, ssh to the machine (ex. cloudcontrol1005) and check the timer status, might get some logs from there:

ssh cloudcontrol1005.wikimedia.org 
dcaro@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.timer
● backup_cinder_volumes.timer - Periodic execution of backup_cinder_volumes.service
    Loaded: loaded (/lib/systemd/system/backup_cinder_volumes.timer; enabled; vendor preset: enabled)
    Active: active (waiting) since Tue 2022-02-22 02:50:52 UTC; 1 weeks 1 days ago
   Trigger: Wed 2022-03-02 10:30:00 UTC; 1h 19min left
  Triggers: ● backup_cinder_volumes.service

You don't see it above, but the dot near the Triggers section is colored red, that means that the service failed, so check that status:

dcaro@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.service
...
Mar 01 16:59:25 cloudcontrol1005 wmcs-cinder-backup-manager[2746503]: TimeoutError
Mar 01 16:59:25 cloudcontrol1005 wmcs-cinder-backup-manager[1800627]: wmcs-cinder-backup-manager: 2022-03-01 16:59:25,960:  WARNING: Failed to backup volume 8d687b46-03b8-4308-9b71-13704a664290
...

Common issues

Timeout when doing a backup

Some backups (currently maps) are really big, and they time out before finishing the backup, this is currently (2022-03-02) a common case for this type of failure, and the main source of leaked snapshots.

Check the current speed of download in the backup machine

Currently the machine that actually does the backups is cloudbackup2002, so we can check if the network was fully saturated or if there were any other contingencies on it.

Go to the host grafana board, if the network is sustainably around 70-100MB/s, then it's the current max speed, and the only alternative is to increase the timeout.

Increasing the timeout of the individual backups

That is currently hardcoded in the python script triggered by the systemd service (there's two scripts, one that does the backups of all volumes, and it's the entry point for the systemctl service, and one that does the backup of a single volume that is used by the former).

You can find the code here.

Snapshot stuck in 'deleting' state

If you get an error like:

cinderclient.exceptions.OverLimit: SnapshotLimitExceeded: Maximum number of snapshots allowed (16) exceeded (HTTP 413) (Request-ID: req-7a6d86a5-79e3-447f-8125-1e969ef504a7)

It might be that the snapshots are getting stuck in 'deleting' status (for some underlying issue, look into that too). To check run:

root@cloudcontrol1005:~# cinder snapshot-list --volume-id 7b037262-7214-4cef-a876-a55e26bc43be
+--------------------------------------+--------------------------------------+-----------+----------------------------------------------+------+-----------+
| ID                                   | Volume ID                            | Status    | Name                                         | Size | User ID   |
+--------------------------------------+--------------------------------------+-----------+----------------------------------------------+------+-----------+
| 784b0a3d-d93f-47fa-97ac-fbbe19b8174e | 7b037262-7214-4cef-a876-a55e26bc43be | available | wikidumpparse-nfs-2022-04-13T20:00:14.507152 | 260  | novaadmin |
| 93ba6b09-879f-441b-b9d4-4767c8e53b41 | 7b037262-7214-4cef-a876-a55e26bc43be | deleting  | wikidumpparse-nfs-2022-05-11T10:32:42.692626 | 260  | novaadmin |
+--------------------------------------+--------------------------------------+-----------+----------------------------------------------+------+-----------+

That shows all the snapshots for the volume, as you can see, there's one in 'deleting' state, and has been there for a while (when writing this its 2022-05-24). Check that there's no rbd snapshots with that id on ceph:

root@cloudcontrol1005:~# rbd list -l  --pool eqiad1-cinder | grep 7b037262-7214-4cef-a876-a55e26bc43be

And if there are not, you can delete the snapshot setting it's state to error:

root@cloudcontrol1005:~# cinder snapshot-reset-state --state error 93ba6b09-879f-441b-b9d4-4767c8e53b41
root@cloudcontrol1005:~# cinder snapshot-delete 93ba6b09-879f-441b-b9d4-4767c8e53b41

A scripted loop with that:

root@cloudcontrol1005:~# for stuck_snapshot in $(openstack volume snapshot list | grep deleting | awk '{print $2}'); do echo "Deleting $stuck_snapshot"; rbd list -l eqiad1-cinder | grep -q "$stuck_snapshot" && echo "... There's some rbd leftovers, check manually" || { cinder snapshot-reset-state --state error "$stuck_snapshot" && cinder snapshot-delete "$stuck_snapshot"; echo ".... removed"; }; done;

You should still try to find the underlying issue, but if there was some instability in the system cleaning up the snapshots might be enough.

Related information

Old occurences

Support contacts

Communication and support

Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:

Discuss and receive general support
Receive mail announcements about critical changes
Subscribe to the cloud-announce@ mailing list (all messages are also mirrored to the cloud@ list)
Track work tasks and report bugs
Use the Phabricator workboard #Cloud-Services for bug reports and feature requests about the Cloud VPS infrastructure itself
Learn about major near-term plans
Read the News wiki page
Read news and stories about Wikimedia Cloud Services
Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)