You are browsing a read-only backup copy of Wikitech. The live site can be found at


From Wikitech-static
Jump to navigation Jump to search

Bacula is a backup system that Wikimedia has been using since 2013.


The switch to Bacula bypassed some of the problems with previous Amanda setup, including disk space problems on its host (tridge). In 20130610 ops meeting it was proposed that the NFS/iSCSI shares on the Netapps could be used to solve the problem stated above but it was quickly pointed out that both NFS and iSCSI communications are unencrypted. At the same time there are possible concerns with the state of the backups being unencrypted on the end disks as well. We could use encrypting file systems either at block level (iSCSI) or filesystem level (eCryptFS) to solve the problems above. However that would cause problems like encryption key handling, leak of information (filesystem names in ecryptfs case) and the possible loss of all encrypted data due to the SPOF that the backup server is all of which given the specific problem in hand could be avoided. Given all that I proposed that we use bacula who has inherent encryption both for communications and storage, no information leaking and the capability for a master key allowing decryption of encrypted data.


Bacula Architecture

The following png probably illustrates the bacula architecture better than words

A couple of notes:

  • There is one Director Daemon only.
  • There may be multiple Storage Daemons (or SD for short) (for example one per datacenter)
  • There is going to be one File Daemon (or FD for short) per machine to be backed-up.
  • All communications (indicated by arrows in the PNG) can be encrypted.
  • There are passwords that authenticate each party to all the others. TLS/SSL can be used in addition.
  • The data store can be Tapes, Files, DVDs, Diskettes. All are called Volumes. The specifics of each medium is abstracted by bacula in day to day operations.
  • The SQL server stores the catalog. It is used as the fist place where information should be sought when needed. However it is not the primary source of information. This resides depending on case in the Volumes, configuration files and bootstrap files [1].

Below i try to explain the various concepts of bacula very quickly.


Jobs are the essential unit of activity in Bacula. Whatever bacula does is a job. Whether it backups, restores, verifies a backup or just moves things around in its volumes/pools it is defined as a Job. Jobs are quite flexible allowing to run arbitrary commands before and after a backup as well as supporting file level deduplication,verification of backups, multiple storage destinations and pools


Since jobs have way too many attributes that can be defined, jodefs (short for job defaults) work as a way of storing all the standard attributes that don't change between jobs and that keep job definitions short.


Backup levels are:

  • Full (backup everything specified)
  • Differential (backup the changes from the previous Full)
  • Incremental(backup the changes from the previous Full, Differential or Incremental)


A schedule defines when a job will take place. it support various formats for defined the "when". It also has the possibility to override some of a Jobs defined attributes. This is heavily used for definining the levels easily and in an understandable way. For example

   Schedule {
      Run= Level=Full 1st Sat at 06:00
      Run= Level=Differential 3rd Sat at 06:00
      Run= Level=Incremental sun-fri at 07:00


These define what should be backed up and what not. They work by including a directory (or File) and recursing under that backing up everything. The possibility of exclusions does exit, either by filtering out by name or wildcard, regex etc. Generally filesets do not span filesystems in order to avoid backing up by default filesystem like sysfs or proc but this can be turned off (provided you know what you are doing). Sparse file support exists as well as the whole block device support


Volumes is what the data get's stored in. Mostly an abstraction layer for hiding device specific behaviour from the other components. It can be tapes or files it can also be DVDs, diskettes or even FIFOs. Volumes have unique IDs called labels. A volume can be labelled either manually or preferably automatically either through an autochanger (in the case of Tape libraries) or internally by bacula.


Pools are just aggregates of volumes. They exist mostly so that jobs can span more than one volume (very useful feature). They are the destination point for backups hiding the volume specifics from the rest of the configuration. Pools need that all of their volumes are of the same type


There are a number of communication channels in a standard bacula setup as shown in architecture. All of them can be configured to be encrypted independently of the others. Please do note that we are talking about communications here and not storage so we are talking about encryption of the TCP connection (yes that means SSL/TLS). These are:

  • Control channels. All paths in the architecture diagram starting starting from the Director or going to the Director are control channels. The main reason these should be encrypted is to avoid leak of the username/password used by the director to authenticate itself to the other daemons, since if these leak, impersonation of the director becomes possible (and relatively easy). Also control channels carry client's (backed-up server) file metadata and that should be protected as well.
  • Data channels. Paths for the communication between the Storage Daemon and the File Daemon. These contain the actual data. No reason explaining why they should be encrypted (there is however a reasoning behind not encrypting this, see below)

Furthermore File Daemon can be configured to send their data encrypted to the Storage Daemon. In that case the actual data never leaves the client unencrypted and is stored encrypted at the end medium (Tape, Disk, DVD or diskette). In this case the data path could be already considered encrypted so another layer of encryption at the communications layer is quite possible unnecessary (TODO: confirm this). The data is encrypted using the private key of an SSL certificate and can only be decrypted with that key or a Master key.


The following is a documentation of the various places where problems might occur

  • The director. Indeed a SPOF. No multiple directors are allowed at this point and the hostname is the username in control channels. Failure of the director will cause all backups and restore to not be possible. Reinstalling a new director is however relatively easy.
  • The catalog. A standard MySQL server. We could have a hot-standby slave to avoid a SPOF. Backups running during failover will fail.
  • The storage daemon. Multiple storage daemons can exist albeit they do different jobs. The failure of a storage daemon will lead to all backups and restores associated with that daemon to fail. The same problem with the director regarding the hostname/password scheme exists. Reinstalling a new storage is however relatively easy.
  • The data store. NAS, Tape Library, DVD/CD burner etc. A major SPOF from a hardware perspective. Bacula can not do anything about it. But since we will rely on Netapps for the data store we will use their HA.

WMF specifics

This is WIP as of 20130627

Proposed Architecture

A proposed solution is to use a server in EQIAD as a director and storage daemon. Then we allocate and NFS export one or more Volumes from the Netapps for the data backend. The fact that the data will already be encrypted before even reaching the storage daemon means that we should have no problem with the unencrypted NFS channel. Plus we won't need to ever wipe at least those specific disks in the Netapp. The clients should also use encrypted control channel for communication with the director daemon and the storage daemon. Since everything will be encrypted on the data channel we should avoid double encrypting it.

Off-site backups

Off-site backups are created by using Netapp's snapmirror for sending data to the other DC easily. We already have the snapmirror license and this solution works. Filesystems at the backup Netapp are read-only.

What to backup

For now just mirror the already in place backup. Revisit the issue later, probably on a case by case basis?

DB Backups

After a lot of talks with Asher and Sean we have ended up with a scheme using Percona's xtrabackup together with pigz to dump the entire innodb tablespace, compress it and pipe them to bacula. Restoration is going to be more difficult since the backup needs to be prepared in xtrabackup parlance and the service restarted.

Configuration Management

Everything must be done via puppet. There is a puppet module for this and role classes for director and storage daemon.

Adding a new client

In the director (if needed)

role::backup::director class and add:

bacula::director::fileset { 'myfs': 
   includes = [ '/a/backup',],

The above may very well be there because of another server having the same fileset. The myfs variable should be noted though because it will be used below. myfs should not contain forward or backward slashes

In the client

class { 'backup::host':
   sets => ['myfs',]

Backup Strategy

Two autocreated volume, autolabeled file-backed pools storing all levels in the first one (production). An archival one for historical purposes exists as well


Handy cheatsheet:

Day to day

Generally nothing. Occasionally we 've seen the following problem: The size of the backups would increase enough to throw the schedule out of plan, which means no immediately writeable volumes are around getting all backups paused while waiting for a volume to be allowed to be recycled. Judging whether this is a one time incident or a change in the schedule is required is a bit difficult, it requires knowing history a bit. In any case, the issue can be fix temporarily by purging the oldest volume around.

 echo list media | sudo bconsole

should return the list of volumes, find the older one (LastWritten is your friend) and purge it

 echo "purge volume=productionXXXX" | sudo bconsole

and backups should resume.


To be created

Restore (aka Panic mode)

ssh to helium and:

  1. bconsole
  2. restore
  3. select from the menu the desired case (Most often 5: Most recent backup for a client)
  4. Select the server
  5. Choose the FileSet to be restored
  6. Use the new prompt to browse the bvfs (bacula virtual filesystem) if file metadata has not been expired from the database. Standard ls, cd commands apply. mark the files/dirs you want restored. If you specified a date old enough you will not be able to browse and you will have to restore the entire fileset
  7. use the "mark" command to mark files you want to be restored. wildcards work, there is also "unmark"
  8. enter done
  9. modify the job if needed (for example change the destination directory)
  10. wait :-)
  11. fetch your backups from /var/tmp/bacula-restores (on the client)

Restore from a non-existent host (missing private key)

If you try to restore from a host that has already been decommissioned you can still select it as a source for restore but you will have to select a different host as the target. Doing that you will see on the target host that the file structure will be restored but all files are empty.

On bconsole, using the "messages" command you can see what the issue was and you would expect a message "Error: Missing private key required to decrypt encrypted backup data.".

Luckily, Bacula encrypts all files with 2 keys, the host key and a global master key, which also happens to be the Puppet CA key. You can see this in the /etc/bacula/bacula-fd.conf on any host as PKI Master Key = "/var/lib/puppet/ssl/certs/ca.pem".

The work-around is:

  • ssh to the puppetmaster (e.g. puppetmaster1001.eqiad.wmnet) and cd to /var/lib/puppet/server/ssl/
  • concatenate the Puppet CA key and CA cert: cat ca/ca_key.pem ca/ca_crt.pem and copy the result into your clipboard
  • ssh to the host you want to restore to and paste the data into a new file, "temp-restore.pem" (or any name).
  • disable puppet with a reason, stop the service "bacula-fd"
  • edit /etc/bacula/bacula-fd.conf, point the config to your temp key. PKI Keypair = "/etc/bacula/ssl/temp-restore.pem
  • go back to the Bacula server (e.g. helium) and follow the normal restore steps above
  • on the host you are restoring to, check /var/tmp/bacula-restores/ and verify files are not empty
  • remove (shred ?) temp-restore.pem and revert your config change
  • enable puppet again / let it start bacula-fd or start it yourself

Bare metal recovery

There is a paid plugin by bacula system to allow baremetal recovery. However doing it manually is also relatively easy. It is quite straightforward as a procedure. It is roughly described below

  1. Boot with your Rescue Live CDROM.
  2. Start the Network.
  3. Re-partition your hard disk(s) as it was before (we are going to be dumping them via sfdisk maybe?)
  4. Re-format your partitions
  5. Install bacula-fd
  6. Perform a Bacula restore of all your files
  7. Re-install your boot loader
  8. Reboot

See also