Bacula is an open-source, enterprise-level computer backup system for heterogeneous networks. It is the software of choice for the centralized backup solution at WMF.
Before 2013, there was no holistic approach to infrastructure backups. Back then the recovery strategy was based on wiki xmldumps, local filesystem (LVM) snapshots, geographical replication, git, and a few other per-service strategies. Amanda and NFS with NetApp servers were also used, which apparently caused a lot of frustration among engineers.
In 2013, an effort was led by Alex to rearchitecture WMF backups, which would now be based on Bacula, and would use strong encryption for both storage and transmission of data, unlike previous methods (e.g. NFS), as well as remote storage, and a simplified, common workflow.
While databases adapted to the new bacula workflow, by 2016 its generation was error-prone, unmonitored, blocked bacula recovery for several days a week, had no clear recovery workflow and its hardware was failing. In 2018, a rearchitecture of database backup generation was led by Jaime and Manuel, still using Bacula as backend, but focusing on a fast, automated recovery (which doubles as db provisioning) strategy with modern tooling, as well as 100% coverage - external store databases (which host wiki content) were not previously backed up.
At the end of 2019, when Jaime took ownership from Alex of the general backup workflow, several improvements happened and are still in progress: HW renewal and expansion, bacula monitoring, and cross-dc redundancy rearchitecture.
Bacula model is simple but effective. There are 3 kinds of nodes/services:
- Bacula file daemon (FD): It gets installed on client hosts- local to the data that has to be backed up
- Bacula director (DIR): It is the orchestration and metadata manager, but no data passes through it
- Bacula storage daemon (SD): It handles the storage in different pools
When a backup or restore is requested (because it is scheduled by the director on its configuration, or because it is manually run), the director checks its metadata database and contacts the FD and SD to talk to each other directly (no data passes through the director). TLS is used in all communication, and data is encrypted by the file daemon, so no plain text is viewed by either the director or storage (with the exception of file metadata).
Storage daemons have its own storage format that consolidates files from pools in volumes of (at WMF current configuration) 50 GB each.
Most backup jobs, although not all, follow a backup policy of performing a full backup at the beginning of the month, and an incremental backup everyday. The retention is approximately 90 days.
In addition to the regular file/directory copy, a plugin exists “bpipe” that allows streaming data directly from a unix pipe on the client to bacula. While this functionality is puppetized in WMF infrastructure -and primarily designed to be used for mysql backups (for which no file directly can be backed up, and streaming mysqldump or xtrabackup utilities would be desirable), it is not the preferred method. Mysqldump allows streaming backups, but doing so makes them extremely slow, as they are not parallelized. And while xtrabackup would be faster used this way, it prevents its post-processing. This is why mydumper and xtrabackup (mariabackup) are used for writing to disk on dbprov hosts, post processed, and then using traditional file backups to send them to bacula.