You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Bacula
Bacula is a backup system that Wikimedia has been using since 2013.
Objective
To have some in house and reliable system to store backed up versions of selected data (that is, NOT everything) for a period of time that is in accordance with our privacy policy. The data should be stored in a secure way to avoid leaking information via e.g. hard disks
History
The switch to Bacula bypassed some of the problems with previous Amanda setup, including disk space problems on its host (tridge). In 20130610 ops meeting it was proposed that the NFS/iSCSI shares on the Netapps could be used to solve the problem stated above but it was quickly pointed out that both NFS and iSCSI communications are unencrypted. At the same time there are possible concerns with the state of the backups being unencrypted on the end disks as well. We could use encrypting file systems either at block level (iSCSI) or filesystem level (eCryptFS) to solve the problems above. However that would cause problems like encryption key handling, leak of information (filesystem names in ecryptfs case) and the possible loss of all encrypted data due to the SPOF that the backup server is all of which given the specific problem in hand could be avoided. Given all that it was proposed that we use bacula who has inherent encryption both for communications and storage, no information leaking and the capability for a master key allowing decryption of encrypted data.
General bacula information
The stuff below is not WMF specific.
Bacula Architecture
The following png probably illustrates the bacula architecture better than words
https://www.bacula.org/9.2.x-manuals/en/images/bacula-applications.png
A couple of notes:
- There is one Director Daemon only.
- There may be multiple Storage Daemons (or SD for short) (for example one per datacenter)
- There is going to be one File Daemon (or FD for short) per machine to be backed-up.
- All communications (indicated by arrows in the PNG) can be encrypted.
- There are passwords that authenticate each party to all the others. TLS/SSL can be used in addition.
- The data store can be Tapes, Files, DVDs, Diskettes. All are called Volumes. The specifics of each medium is abstracted by bacula in day to day operations.
- The SQL server stores the catalog. It is used as the fist place where information should be sought when needed. However it is not the primary source of information. This resides depending on case in the Volumes, configuration files and bootstrap files [1].
Some more information on the various concepts of bacula follows
Jobs
Jobs are the essential unit of activity in Bacula. Whatever bacula does is a job. Whether it backups, restores, verifies a backup or just moves things around in its volumes/pools it is defined as a Job. Jobs are quite flexible allowing to run arbitrary commands before and after a backup as well as supporting file level de-duplication, verification of backups, multiple storage destinations and pools
Jobdefs
Since jobs have way too many attributes that can be defined, jodefs (short for job defaults) work as a way of storing all the standard attributes that don't change between jobs and that way keep job definitions short. Think of it as an abstract definition that is inherited by the more concrete definitions of jobs.
Levels
Backup levels are:
- Full (backup everything specified)
- Differential (backup the changes from the previous Full)
- Incremental(backup the changes from the previous Full, Differential or Incremental)
Schedules
A schedule defines when a job will take place. It supports various formats for defining that "when". This is heavily used for definining the levels easily and in an understandable way. For example
Schedule { name="mysched" Run= Level=Full 1st Sat at 06:00 Run= Level=Differential 3rd Sat at 06:00 Run= Level=Incremental sun-fri at 07:00 }
Filesets
These define what should be backed up and what not. They work by including a directory (or File) and recursing under that backing up everything. The possibility of exclusions does exit, either by filtering out by name or wildcard, regex etc. Generally filesets do not span file-systems in order to avoid backing up by default file-systems like sysfs or procfs but this can be turned off (provided you know what you are doing). Sparse files are supported, so are whole block devices.
Volumes
Volumes is what the data get's stored in. Mostly an abstraction layer for hiding device specific behavior from the other components of bacula. Examples of volumes are:
- tapes
- files
- DVDs
- diskettes (party like it's 1995!!!)
- FIFOs.
Volumes have unique IDs called labels. A volume can be labelled either manually or preferably automatically either through an autochanger (a component of a hardware Tape library) or internally by bacula.
Pools
Pools are just aggregates of volumes. They exist mostly so that jobs can span more than one volume (very useful feature). They are the destination point for backups hiding the volume specifics from the rest of the configuration. Its is a requirement that all volumes in a Pool are of the same type (e.g. DVDs). A job can not span >1 Pool.
Encryption
There are a number of communication channels in a standard bacula setup as shown in architecture. All of them can be configured to be encrypted independently of the others. Please do note that we are talking about communications here and not storage so we are talking about encryption of the TCP connection (yes that means SSL/TLS). These are:
- Control channels. All paths in the architecture diagram starting starting from the Director or going to the Director are control channels. The main reason these should be encrypted is to avoid leak of the username/password used by the director to authenticate itself to the other daemons, since if these leak, impersonation of the director becomes possible (and relatively easy). Also control channels carry client's (backed-up server) file metadata and that should be protected as well.
- Data channels. Paths for the communication between the Storage Daemon and the File Daemon. These contain the actual data. No reason explaining why they should be encrypted (there is however a reasoning behind not encrypting this, see below).
Note that the above is all about the network part of things. However encryption at rest might also be a requirement. The File Daemon can be configured to send their data encrypted to the Storage Daemon. In that case the actual data never leaves the client unencrypted and is stored encrypted at the end medium (Tape, Disk, DVD or diskette). In this case the data path could be already considered encrypted so another layer of encryption at the communications layer is quite possible unnecessary (TODO: confirm this). The data is encrypted using the private key of an SSL certificate and can only be decrypted with that key or a Master key.
SPOFs
The following is a documentation of the various places where problems might occur
- The director. Indeed a SPOF. No multiple directors are allowed at this point and the hostname is the username in control channels. Failure of the director will cause all backups and restore to not be possible. Reinstalling a new director is however relatively easy.
- The catalog. A standard MySQL server. We could have a hot-standby slave to avoid a SPOF. Backups running during failover will fail.
- The storage daemon. Multiple storage daemons can exist albeit they do different jobs. The failure of a storage daemon will lead to all backups and restores associated with that daemon to fail. The same problem with the director regarding the hostname/password scheme exists. Reinstalling a new storage is however relatively easy.
- The data store. NAS, Tape Library, DVD/CD burner etc. A major SPOF from a hardware perspective. Bacula can not do much about it, aside from having the backups mirrored in >1 locations.
WMF specifics
Architecture
A service in EQIAD is used as a director and a storage daemon. Another server in codfw is used as a storage daemon for redundancy purposes. The data is stored encrypted with the file deamon's Puppet certificate key. Only the file daemon and the puppet CA master key can decrypt the data. The clients use encrypted control channels for communication with the director daemon and the storage daemon.
Off-site backups
The codfw storage daemon is used as a mirror.
What to backup
This is on a case by case basis, configured into puppet.
DB Backups
While before it used to be integrated with bacula, this created issues on performance and concurrency when the sizes grew. Since 2018, a separate system is used that still eventually uses bacula for long term backups, but otherwise is separate for backup generations, provisioning and snapshoting: MariaDB/Backups.
The old bacula integration is still available if needed, and at least analytics is using it for internal dbs.
Configuration Management
Everything must be done via puppet. There is a puppet module for this and role classes for director and storage daemon.
Adding a new client
In the director (if needed)
Edit modules/profile/manifests/backup/filesets.pp
profile and add:
bacula::director::fileset { 'myfileset': includes = [ '/srv/important_dir', '/srv/another_important_file.json', ], }
If the path is a common one ('/'
, '/srv'
, ...) it may be already there because another server is backing up the same fileset. You can create a separate fileset with the same content but different name if you believe that it may change in the future. However, if servers with identical or equivalent roles require the same kind of backups, please use the same fileset so they are modified at the same time (filesets are just logical identifiers for paths/list of paths, use your best judgement when reusing them). You can also create multiple jobs for the same host if it makes sense to be logically separate, and/or they need separate backup policies (e.g. a server with 2 profiles).
The myfileset variable should be noted though because it will be used below. myfileset should not contain forward or backward slashes. Keep it short, but meaningful.
In the client
- Add
::profile::backup::host
profile to the host's role (this will enable, but not activate backups) - Add
backup::set { 'myfileset': }
to the relevant profile of the host
Backup Strategy
Two autocreated volume, autolabeled file-backed pools storing all levels in the first one (production). An archival one for historical purposes exists as well.
The following 3 types of backup schedules are standarized for WMF servers:
- Hourly: Full weekly, incremental hourly
- Weekly: Only fulls, weekly
- Monthly: Fulls monthly, diffs every other fortnite, incr. daily
They are scheduled at different days of the week to distribute the load.
Operations
Handy cheatsheet: https://workaround.org/bacula-cheatsheet
Monitoring
Please note that backup alerts do not page, as they are rarely time-sensitive, however, that doesn't mean they should not be taken seriously.
On director/storage systems there is an alert that checks each daemon is up and running. This is just a basic check that prevents from keeping merged a bad or non-configuration, as well as other kind of host-related fatal error. This should be a relatively rare event, if it happens you should check:
- Check journalctl/syslog to understand the state of the daemon
- Try to start it and monitor the above logs to get an error message
- Check previously merged changes to backup profile or bacula module
- Check general host state of the host, including its disk arrays
The more interesting check is the bacula freshness alert, present only on the director. This will be green if and only if it finds at least 1 full and one incremental or differential backup with the configured freshness (with some threshold) for each puppet configured job. For example, if one has configured a backup of MyHost
as "Monthly on Thursdays", it should generate an incremental or differential each week, and a Monthly Full backup. The check will be green only of it find a correct full backup in the last 31(+) days, and the latest backup of any kind in the last 7(+) days. The alert will summarize the global status by classifying backups in the following way, for easier triaging (from better to worse):
- Fresh: Latest backups were successful (defined as they terminated successfully, and as an additional condition for full, they are non-0 byte backups)
- No backups: there were no backups or attempts to backing up recorded (successful or not). This could be just a new, recently configured to be backed up host, or something weird happening with the scheduling/configuration
- Stale-full only: there are backups, but the full are stale (meaning older than the expected date). This could mean that the latest full backup failed, or it is taking more than usual, or something is wrong with new generated backups
- Stale: there are backups (including at least a full one), but they are all older than expected, both the full and incrementals/differentials
- All failures: All backup attempts for this host failed, as long as there is a record
At the moment, the backup schedule is assumed from the name of the job, with the following hardcoded time thresholds to complete the backups:
- Hourly-*: Full weekly, incremental hourly; +1 day, +3 hours of buffer
- Weekly-*: Only fulls, weekly; + 1 day of buffer
- Monthly-*: (default): Fulls monthly, diffs every other fortnite, incr. daily; +1 day, +1 day of buffer
The alert will provide the hostname of the alphabetically first failure occurrence for easier debugging.
How to proceed will depend on the particular case. For a more verbose output of all jobs that failed or were successful, run:
check_bacula.py --verbose
Please note that several independent jobs (one per configure fileset) can be configured on the same host. One can manually check the alert original data by running:
check_bacula.py <name_of_the_job>
(substitute by the appropriate job name), which is just a wrapper of running on the backup director shell: echo 'llist jobname=grafana1002.eqiad.wmnet-Monthly-1st-Fri-production-var-lib-grafana' | bconsole
. To understand the origin of the error, the bacula log, present at /var/lib/bacula/log
will be very helpful on identifying if there was a recent attempt of backing up the specific host, and what were the causes of the error. Typical cases include:
- Breakage of the director (has it been down or misconfigured? Are director, client and storages the right version [FD < (DIR == SD)]?)
- Breakage of the storage daemon (are the RAIDs healthy?)
- Problem at the client (has the client been down for a long time? Is the port open and available?)
- Network connectivity issues between client and director/storage
- Has the backup set gone and/or is misconfigured?
Monitoring of storage/pools/volumes/devices: TBD
Day to day
Generally nothing. Occasionally we 've seen the following problem: The size of the backups would increase enough to throw the schedule out of plan, which means no immediately writeable volumes are around getting all backups paused while waiting for a volume to be allowed to be recycled. Judging whether this is a one time incident or a change in the schedule is required is a bit difficult, it requires knowing history a bit. In any case, the issue can be fix temporarily by purging the oldest volume around.
echo list media | sudo bconsole
should return the list of volumes, find the older one (LastWritten is your friend) and purge it
echo "purge volume=productionXXXX" | sudo bconsole
and backups should resume.
Statistics
To be created
Restore (aka Panic mode)
ssh to helium and:
- bconsole
- restore
- select from the menu the desired case (Most often 5: Most recent backup for a client)
- Select the server
- Choose the FileSet to be restored
- Use the new prompt to browse the bvfs (bacula virtual filesystem) if file metadata has not been expired from the database. Standard ls, cd commands apply. mark the files/dirs you want restored. If you specified a date old enough you will not be able to browse and you will have to restore the entire fileset
- use the mark command to mark files you want to be restored. wildcards work, there is also unmark
- enter done
- modify the job if needed (for example change the destination directory)
- wait :-) (you can use the messages command to see the status of the restore job)
- fetch your backups from /var/tmp/bacula-restores (on the client)
Restore from a non-existent host (missing private key)
If you try to restore from a host that has already been decommissioned you can still select it as a source for restore but you will have to select a different host as the target. Doing that you will see on the target host that the file structure will be restored but all files are empty.
On bconsole, using the "messages" command you can see what the issue was and you would expect a message "Error: Missing private key required to decrypt encrypted backup data.".
Luckily, Bacula encrypts all files with 2 keys, the host key and a global master key, which also happens to be the Puppet CA public key. You can see this in the /etc/bacula/bacula-fd.conf on any host as PKI Master Key = "/var/lib/puppet/ssl/certs/ca.pem".
To restore files in this case:
- ssh to the puppetmaster (e.g. puppetmaster1001.eqiad.wmnet) and cd to /var/lib/puppet/server/ssl/
- concatenate the Puppet CA key and CA cert: cat ca/ca_key.pem ca/ca_crt.pem and copy the result into your clipboard
- ssh to the host you want to restore to and paste the data into a new file, "temp-restore.pem" (or any name).
- disable puppet with a reason, stop the service "bacula-fd"
- edit /etc/bacula/bacula-fd.conf, point the config to your temp key. PKI Keypair = "/etc/bacula/ssl/temp-restore.pem"
- go back to the Bacula director (e.g. helium/backup1001) and follow the normal restore steps above
- on the host you are restoring to, check /var/tmp/bacula-restores/ and verify files are not empty
- remove (shred ?) temp-restore.pem and revert your config change
- enable puppet again / let it start bacula-fd or start it yourself
Modify a pool's retention (or other similar properties)
- Modify the puppet code necessary to update the pool definition config file. For example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/554485/2/modules/profile/manifests/backup/director.pp (Merged as 9695ec8 )
- Deploy the puppet change on the puppet master, as usual
- Run puppet so it updates the file on the director, as well as refresh the director daemon
- run
bconsole
and reload the pool properties with theupdate
command:
$ bconsole Connecting to Director backup1001.eqiad.wmnet:9101 1000 OK: 103 backup1001.eqiad.wmnet Version: 9.4.2 (04 February 2019) Enter a period to cancel a command. *update Automatically selected Catalog: production Using Catalog "production" Update choice: 1: Volume parameters 2: Pool from resource 3: Slots from autochanger 4: Long term statistics 5: Snapshot parameters Choose catalog item to update (1-5): 2 The defined Pool resources are: 1: Archive 2: Databases 3: Default 4: offsite 5: production Select Pool resource (1-5): 2 +--------+-----------+---------+---------+---------+------------+-----------------+--------------+ ... | PoolId | Name | NumVols | MaxVols | UseOnce | UseCatalog | AcceptAnyVolume | VolRetention | ... +--------+-----------+---------+---------+---------+------------+-----------------+--------------+ ... | 5 | Databases | 5 | 60 | 0 | 1 | 0 | 7,776,000 | ... +--------+-----------+---------+---------+---------+------------+-----------------+--------------+ ... Pool DB record updated from resource. You have messages.
You can check the update was successful listing the pools with the command list pools
. Retention time is in seconds.
Bare metal recovery
There is a paid plugin by bacula system to allow baremetal recovery. However doing it manually is also relatively easy. It is quite straightforward as a procedure. It is roughly described below
- Boot with your Rescue Live CDROM.
- Start the Network.
- Re-partition your hard disk(s) as it was before (we are going to be dumping them via sfdisk maybe?)
- Re-format your partitions
- Install bacula-fd
- Perform a Bacula restore of all your files
- Re-install your boot loader
- Reboot