You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
User-visible files appear at http://dumps.wikimedia.org/backup-index.html
Dump activity involves a monitor node (running status sweeps) and arbitrarily many worker nodes running the dumps.
Full project dumps with all content history are run once a month starting on the 1st or 2nd of the month; dumps with only current content are dumped near the end of the month.
Dumps are run in "stages". The stages for the first run of the month look like this: stub xml files (files with all of the metadata for pages and revisions but without any page content) are dumped first. After that's been done on all small wikis, all tables are dumped. This is then done for the 'big' wikis (see list here). Then the current page content for articles is dumped, first for small wikis and then for big ones. And so on.
The stages for the second run of the month are identical to the above, but without the full page history content dumps.
A 'dump scheduler' manages the running of these steps on each available host, given the number of available cores and the order we want the jobs to run in.
There is a worker script which goes through the set of available wikis to dump a single time for each dump step, starting with the wiki that has gone the longest without a dump. The dump scheduler starts up several of these on each host, according to the number of free cores configured, starting the script anew for the same or a later stage, as determined in the stages list.
For each wiki, the worker script simply runs the python script worker.py on the given wiki. To ensure that multiple workers don't try to dump the same wiki at the same time, the python script locks the wiki before proceeding. Stale locks are eventually removed by a monitor script; if you try to run a dump of a wiki by hand when one is already in progress, you will see an error message to the effect that a wiki dump is already running, along with the name of the lock file.
On one host, the monitor script runs a python script for each wiki that checks for and removes stale lock files from dump processes that have died, and updates the central
index.html file which shows the dumps in progress and the status of the dumps that have completed (i.e.
http://dumps.wikimedia.org/backup-index.html ). That is its sole function.
Check /operations/dumps.git, branch 'master' for the python code in use. Some tools are in the 'ariel' branch but all dumps code run in production is in master or, for a few scripts not directly related to dumps production, in our puppet repo.
Getting a copy:
git clone https://gerrit.wikimedia.org/r/p/operations/dumps.git
git checkout master
Getting a copy as a committer:
git clone ssh://<user>@gerrit.wikimedia.org:29418/operations/dumps.git
git checkout master
See also Dumps/Software dependencies.
The scripts call mysqldump, getSlaveServer.php, eval.php, dumpBackup.php, and dumpTextPass.php directly for dump generation. These in turn require backup.inc and backupPrefetch.inc and may call ActiveAbstract/AbstractFilter.php and fetchText.php.
The generation of XML files relies on Export.php under the hood and of course the entire MW infrastructure.
The worker.py script relies on a few C programs for various bz2 operations: checkforbz2footer and recompressxml, both in /usr/local/bin/. These are in the git repo in branch 'ariel', see .
Sites are identified by raw database name currently. A 'friendly' name/hostname can be added for convenience of searching in future.