You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Portal:Cloud VPS/Admin/Runbooks/Check unit status of backup vms

From Wikitech-static
< Portal:Cloud VPS‎ | Admin‎ | Runbooks
Revision as of 14:43, 22 March 2022 by imported>David Caro (→‎Old incidents)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
The procedures in this runbook require admin permissions to complete.

Error / Incident

The backup_vms unit is failing, that means that the backups for virtual machines are not working as expected.

Debugging

To gather logs, just ssh to the host having the issue and check the backups_vm.service unit status:

root@cloudvirt1021:~# systemctl status backup_vms
● backup_vms.service - Backup vms assigned to this host
   Loaded: loaded (/lib/systemd/system/backup_vms.service; static)
   Active: failed (Result: exit-code) since Mon 2022-03-21 20:01:53 UTC; 17h ago
TriggeredBy: ● backup_vms.timer
   Process: 4099425 ExecStart=/usr/local/sbin/wmcs-backup instances backup-assigned-vms (code=exited, status=1/FAILURE)
  Main PID: 4099425 (code=exited, status=1/FAILURE)

Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]:     backup_entries = get_backups()
Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]:   File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 600, in get_backups
Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]:     backup = subprocess.check_output([BACKY, "-ms", "ls"])
Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]:   File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]:     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]:   File "/usr/lib/python3.9/subprocess.py", line 528, in run
Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]:     raise CalledProcessError(retcode, process.args,
Mar 21 20:01:52 cloudvirt1021 wmcs-backup[4099425]: subprocess.CalledProcessError: Command '['/usr/bin/backy2', '-ms', 'ls']' returned non-zero exit status 100.
Mar 21 20:01:53 cloudvirt1021 systemd[1]: backup_vms.service: Main process exited, code=exited, status=1/FAILURE
Mar 21 20:01:53 cloudvirt1021 systemd[1]: backup_vms.service: Failed with result 'exit-code'.


You can also check the logs for a longer view:

root@cloudvirt1021:~# journalctl -u backup_vms


Common issues

Trying to backup a VM that is not there anymore

Currently the backups gather the list of VMs to backup at the start, but it might happen that one of them gets removed while the script is backing up other VMs, in that case the next run will update the VM list and fix the issue.

There's currently a task to do something more elegant: task T276892.

There's a new VM/project that should not be backed up

This is the projects/VMs lists that control which one get backups or does not, just add/remove the entry there.

Disk full

If this happens, there's a simple workaround that might work, doing a full cleanup, first make sure that backy will not run during the cleanup:

root@cloudvirt1021:~# systemctl stop backup_vms.timer

root@cloudvirt1021:~# systemctl status backup_vms.service
# wait for this to finish if it's running, probably not as it broke

Then run:

root@cloudvirt1021:~# backy2 cleanup --full
   INFO: [backy2.logging] $ /usr/bin/backy2 cleanup --full
   INFO: [backy2.logging] Cleanup: Removed 500709 blobs
   INFO: [backy2.logging] Backy complete.

Start the timer again:

root@cloudvirt1021:~# systemctl stop backup_vms.timer

Corrupted database

This one usually presents as the following errors in the logs:

Mar 19 20:03:45 cloudvirt1021 wmcs-backup[2952831]: INFO:[2022-03-19 20:03:45,731] Creating full backup of pool:eqiad1-compute, image_name:fff879b9-8300-4a11-9cf6-3d424be9ffa3_disk, snapshot_name:2022-03-19T20:03:45_cloudvirt1021
Mar 19 20:04:10 cloudvirt1021 wmcs-backup[2954847]:    ERROR: [backy2.logging] Couldn't parse datetime string: 'b0094916-a7bf-11ec-95b0-dd68c0a50dd7'
Mar 19 20:04:10 cloudvirt1021 wmcs-backup[2954847]:    ERROR: [backy2.logging] Backy failed.

You can verify by checking the db integrity (any output other than ok is bad):

root@cloudvirt1021:~# sqlite3 /srv/backy/backy.sqlite
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> pragma integrity_check;
*** in database main ***
On tree page 71625 cell 219: Rowid 861598 out of order
On tree page 4 cell 0: 2nd reference to page 71847
row 304958 missing from index ix_blocks_checksum
...

An to fix, dump the db and reload it, first make sure backy is not running, as doing this mid-db update might break things:

root@cloudvirt1021:~# systemctl stop backup_vms.timer

root@cloudvirt1021:~# systemctl status backup_vms.service
# wait for this to finish if it's running, probably not as it broke

Then dump the DB:

root@cloudvirt1021:~# sqlite3 /srv/backy/backy.sqlite
sqlite> .output backup.db
sqlite> .dump
sqlite> .quit 

# the backy.fixed.sqlite file did not exist yet, sqlite3 creates it
root@cloudvirt1021:~# sqlite3 backy.fixed.sqlite
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> .read backup.db
sqlite> .quit

# restore the DB
root@cloudvirt1021:~# cp backy.fixed.sqlite /srv/backy2/backy.sqlite

Finally, as some entries might have been lost, you can do a full cleanup of the backups, this might take some time:

root@cloudvirt1021:~# backy2 cleanup --full
   INFO: [backy2.logging] $ /usr/bin/backy2 cleanup --full
   INFO: [backy2.logging] Cleanup: Removed 500709 blobs
   INFO: [backy2.logging] Backy complete.

Start the timer again:

root@cloudvirt1021:~# systemctl stop backup_vms.timer

Related information

Old incidents

  • task T304408 cloudvirt1021 - CRITICAL: Status of the systemd unit backup_vms
  • task T303870 cloudvirt1022/Check unit status of backup_vms is CRITICAL

Contacts

Communication and support

Support and administration of the WMCS resources is provided by the Wikimedia Foundation Cloud Services team and Wikimedia movement volunteers. Please reach out with questions and join the conversation:

Discuss and receive general support
Receive mail announcements about critical changes
Subscribe to the cloud-announce@ mailing list (all messages are also mirrored to the cloud@ list)
Track work tasks and report bugs
Use the Phabricator workboard #Cloud-Services for bug reports and feature requests about the Cloud VPS infrastructure itself
Learn about major near-term plans
Read the News wiki page
Read news and stories about Wikimedia Cloud Services
Read the Cloud Services Blog (for the broader Wikimedia movement, see the Wikimedia Technical Blog)