You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Portal:Cloud VPS/Admin/Runbooks/Check unit status of backup cinder volumes: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>David Caro
imported>Nskaggs
Line 14: Line 14:
First of all, ssh to the machine (ex. cloudcontrol1005) and check the timer status, might get some logs from there:
First of all, ssh to the machine (ex. cloudcontrol1005) and check the timer status, might get some logs from there:


ssh cloudcontrol1005.wikimedia.org  
<syntaxhighlight lang="shell-session">
dcaro@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.timer
$ ssh cloudcontrol1005.wikimedia.org  
</syntaxhighlight>
 
<syntaxhighlight lang="shell-session">
user@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.timer
  ● backup_cinder_volumes.timer - Periodic execution of backup_cinder_volumes.service
  ● backup_cinder_volumes.timer - Periodic execution of backup_cinder_volumes.service
     Loaded: loaded (/lib/systemd/system/backup_cinder_volumes.timer; enabled; vendor preset: enabled)
     Loaded: loaded (/lib/systemd/system/backup_cinder_volumes.timer; enabled; vendor preset: enabled)
Line 21: Line 25:
     Trigger: Wed 2022-03-02 10:30:00 UTC; 1h 19min left
     Trigger: Wed 2022-03-02 10:30:00 UTC; 1h 19min left
   Triggers: ● backup_cinder_volumes.service
   Triggers: ● backup_cinder_volumes.service
</syntaxhighlight>
You don't see it above, but the dot <code>●</code> near the <code>Triggers</code> section is colored red, that means that the service failed.
Check the service status:
<syntaxhighlight lang="shell-session">
user@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.service
</syntaxhighlight>
Check the service logs:
<syntaxhighlight lang="shell-session">
user@cloudcontrol1005:~$ sudo journalctl -u backup_cinder_volumes.service
</syntaxhighlight>
Check cinder logs:
<syntaxhighlight lang="shell-session">
user@cloudcontrol1005:~$ sudo journalctl -u cinder-volume.service
</syntaxhighlight>


You don't see it above, but the dot <code></code> near the <code>Triggers</code> section is colored red, that means that the service failed, so check that status:
Check all services, including logs
<syntaxhighlight lang="shell-session">
user@cloudcontrol1005:~# sudo systemctl status cinder* -1
</syntaxhighlight>


dcaro@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.service
There should be 3 services up and running, <code>cinder-api</code>, <code>cinder-volume</code> and <code>cinder-scheduler</code>.
...
Mar 01 16:59:25 cloudcontrol1005 wmcs-cinder-backup-manager[2746503]: TimeoutError
Mar 01 16:59:25 cloudcontrol1005 wmcs-cinder-backup-manager[1800627]: wmcs-cinder-backup-manager: 2022-03-01 16:59:25,960:  WARNING: Failed to backup volume 8d687b46-03b8-4308-9b71-13704a664290
...


= Common issues =
= Common issues =
Line 42: Line 66:
That is currently hardcoded in the python script triggered by the systemd service (there's two scripts, one that does the backups of all volumes, and it's the entry point for the systemctl service, and one that does the backup of a single volume that is used by the former).
That is currently hardcoded in the python script triggered by the systemd service (there's two scripts, one that does the backups of all volumes, and it's the entry point for the systemctl service, and one that does the backup of a single volume that is used by the former).


You can find the code [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/openstack/files/victoria/admin_scripts/wmcs-cinder-backup-manager.py here].
You can find the code [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/openstack/files/wallaby/admin_scripts/wmcs-cinder-backup-manager.py here].


== Snapshot stuck in 'deleting' state ==
== Snapshot stuck in 'deleting' state ==
Line 78: Line 102:
= Old occurences =
= Old occurences =
* [[phab:T302855]]
* [[phab:T302855]]
* [[phab:T310103]]


= Support contacts =
= Support contacts =
{{:Help:Cloud Services communication}}
{{:Help:Cloud Services communication}}

Revision as of 21:47, 7 June 2022

Overview

The procedures in this runbook require admin permissions to complete.

This is the systemd timer that triggers the cinder volumes backups.

Error / Incident

The systemd timer failed.

Debugging

First of all, ssh to the machine (ex. cloudcontrol1005) and check the timer status, might get some logs from there:

$ ssh cloudcontrol1005.wikimedia.org
user@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.timer
 ● backup_cinder_volumes.timer - Periodic execution of backup_cinder_volumes.service
     Loaded: loaded (/lib/systemd/system/backup_cinder_volumes.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Tue 2022-02-22 02:50:52 UTC; 1 weeks 1 days ago
    Trigger: Wed 2022-03-02 10:30:00 UTC; 1h 19min left
   Triggers: ● backup_cinder_volumes.service

You don't see it above, but the dot near the Triggers section is colored red, that means that the service failed.

Check the service status:

user@cloudcontrol1005:~$ sudo systemctl status backup_cinder_volumes.service

Check the service logs:

user@cloudcontrol1005:~$ sudo journalctl -u backup_cinder_volumes.service

Check cinder logs:

user@cloudcontrol1005:~$ sudo journalctl -u cinder-volume.service

Check all services, including logs

user@cloudcontrol1005:~# sudo systemctl status cinder* -1

There should be 3 services up and running, cinder-api, cinder-volume and cinder-scheduler.

Common issues

Timeout when doing a backup

Some backups (currently maps) are really big, and they time out before finishing the backup, this is currently (2022-03-02) a common case for this type of failure, and the main source of leaked snapshots.

Check the current speed of download in the backup machine

Currently the machine that actually does the backups is cloudbackup2002, so we can check if the network was fully saturated or if there were any other contingencies on it.

Go to the host grafana board, if the network is sustainably around 70-100MB/s, then it's the current max speed, and the only alternative is to increase the timeout.

Increasing the timeout of the individual backups

That is currently hardcoded in the python script triggered by the systemd service (there's two scripts, one that does the backups of all volumes, and it's the entry point for the systemctl service, and one that does the backup of a single volume that is used by the former).

You can find the code here.

Snapshot stuck in 'deleting' state

If you get an error like:

cinderclient.exceptions.OverLimit: SnapshotLimitExceeded: Maximum number of snapshots allowed (16) exceeded (HTTP 413) (Request-ID: req-7a6d86a5-79e3-447f-8125-1e969ef504a7)

It might be that the snapshots are getting stuck in 'deleting' status (for some underlying issue, look into that too). To check run:

root@cloudcontrol1005:~# cinder snapshot-list --volume-id 7b037262-7214-4cef-a876-a55e26bc43be
+--------------------------------------+--------------------------------------+-----------+----------------------------------------------+------+-----------+
| ID                                   | Volume ID                            | Status    | Name                                         | Size | User ID   |
+--------------------------------------+--------------------------------------+-----------+----------------------------------------------+------+-----------+
| 784b0a3d-d93f-47fa-97ac-fbbe19b8174e | 7b037262-7214-4cef-a876-a55e26bc43be | available | wikidumpparse-nfs-2022-04-13T20:00:14.507152 | 260  | novaadmin |
| 93ba6b09-879f-441b-b9d4-4767c8e53b41 | 7b037262-7214-4cef-a876-a55e26bc43be | deleting  | wikidumpparse-nfs-2022-05-11T10:32:42.692626 | 260  | novaadmin |
+--------------------------------------+--------------------------------------+-----------+----------------------------------------------+------+-----------+

That shows all the snapshots for the volume, as you can see, there's one in 'deleting' state, and has been there for a while (when writing this its 2022-05-24). Check that there's no rbd snapshots with that id on ceph:

root@cloudcontrol1005:~# rbd list -l  --pool eqiad1-cinder | grep 7b037262-7214-4cef-a876-a55e26bc43be

And if there are not, you can delete the snapshot setting it's state to error:

root@cloudcontrol1005:~# cinder snapshot-reset-state --state error 93ba6b09-879f-441b-b9d4-4767c8e53b41
root@cloudcontrol1005:~# cinder snapshot-delete 93ba6b09-879f-441b-b9d4-4767c8e53b41

A scripted loop with that:

root@cloudcontrol1005:~# for stuck_snapshot in $(openstack volume snapshot list | grep deleting | awk '{print $2}'); do echo "Deleting $stuck_snapshot"; rbd list -l eqiad1-cinder | grep -q "$stuck_snapshot" && echo "... There's some rbd leftovers, check manually" || { cinder snapshot-reset-state --state error "$stuck_snapshot" && cinder snapshot-delete "$stuck_snapshot"; echo ".... removed"; }; done;

You should still try to find the underlying issue, but if there was some instability in the system cleaning up the snapshots might be enough.

Related information

Old occurences

Support contacts

Communication and support

Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:

Discuss and receive general support
Receive mail announcements about critical changes
Subscribe to the cloud-announce@ mailing list (all messages are also mirrored to the cloud@ list)
Track work tasks and report bugs
Use the Phabricator workboard #Cloud-Services for bug reports and feature requests about the Cloud VPS infrastructure itself
Learn about major near-term plans
Read the News wiki page
Read news and stories about Wikimedia Cloud Services
Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)