You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Docs for end-users of the data dumps at meta:Data dumps.
For a list of various information sources about the dumps, see Dumps/Other information sources.
- For documentation on the "adds/changes" dumps, see Dumps/Adds-changes dumps.
- For downlading older media dumps, go to archive.org.
- For current dumps issues, see the Dumps-generation project in Phabricator.
- For current redesign plans and discussion, see Dumps/Dumps 2.0: Redesign
- For historical information about the dumps, see Dumps/History.
User-visible files appear at http://dumps.wikimedia.org/backup-index.html
Dump activity involves a monitor node (running status sweeps) and arbitrarily many worker nodes running the dumps.
We want mirrors! For more information see Dumps/Mirror status.
English language Wikipedia dumps are run once a month; all other wikis are dumped twice a month. The first run includes page content for all revisions; the second dump run skips this step.
Dumps are run in "stages". This doesn't really impact the English language Wikipedia dumps, as one stage simply runs directly after the other on a dedicated server, so that all dump steps get done without interruption by other wiki runs.
For the other dumps, the stages for the first run of the month look like this: stub xml files (files with all of the metadata for pages and revisions but without any page content) are dumped first. After that's been done on all small wikis, all tables are dumped. This is then done for the 'big' wikis (see list here). Then the current page content for articles is dumped, first for small wikis and then for big ones. And so on.
The stages for the second run of the month are identical to the above, but without the full page history content dumps.
A 'dump scheduler' manages the running of these steps on each available host, given the number of available cores and the order we want the jobs to run in.
There is a worker script which goes through the set of available wikis to dump a single time for each dump step, starting with the wiki that has gone the longest without a dump. The dump scheduler starts up several of these on each host, according to the number of free cores configured, starting the script anew for the same or a later stage, as determined in the stages list.
For each wiki, the worker script simply runs the python script worker.py on the given wiki. To ensure that multiple workers don't try to dump the same wiki at the same time, the python script locks the wiki before proceeding. Stale locks are eventually removed by a monitor script; if you try to run a dump of a wiki by hand when one is already in progress, you will see an error message to the effect that a wiki dump is already running, along with the name of the lock file.
On one host, the monitor script runs a python script for each wiki that checks for and removes stale lock files from dump processes that have died, and updates the central
index.html file which shows the dumps in progress and the status of the dumps that have completed (i.e.
http://dumps.wikimedia.org/backup-index.html ). That is its sole function.
Check /operations/dumps.git, branch 'ariel' for the python code in use. Eventually this will make its way back into master; it's still a bit gross right now.
Getting a copy:
git clone https://gerrit.wikimedia.org/r/p/operations/dumps.git
git checkout ariel
Getting a copy as a committer:
git clone ssh://<user>@gerrit.wikimedia.org:29418/operations/dumps.git
git checkout ariel
See also Dumps/Software dependencies.
The scripts call mysqldump, getSlaveServer.php, eval.php, dumpBackup.php, and dumpTextPass.php directly for dump generation. These in turn require backup.inc and backupPrefetch.inc and may call ActiveAbstract/AbstractFilter.php and fetchText.php.
The generation of XML files relies on Export.php under the hood and of course the entire MW infrastructure.
The worker.py script relies on a few C programs for various bz2 operations: checkforbz2footer and recompressxml, both in /usr/local/bin/. These are in the git repo, see .
Adding a new worker box
Install and add to site.pp to the snapshot stanza. If it's running cron jobs, include role::snapshot::cron::primary for that host. Add to modules/dataset/files/exports in puppet also. Once puppet runs, run `exportfs -r` on the dataset hosts (currently dataset1001, ms1001).
- Dumps run out of /srv/dumps on each server. Make sure that you have a copy of the code from that directory (monitor, worker, *.py, dumps/*py) from any other worker node.
- If you are replacing the node that runs en wiki, make sure that the wikidump.conf.hugewikis file is in /srv/dumps/confs.
- That should be it.
Starting dump runs
- For each worker node doing regular dumps (snapshot1002, snapshot1004):
- be on the host as root
su - datasets
- for the first dump run of the month:
python ./dumpscheduler.py --slots 8 --commands /srv/dumps/stages/stages_normal --cache /srv/dumps/cache/running_cache.txt --directory /srv/dumps --verbose
- for the second dump run of the month:
python ./dumpscheduler.py --slots 8 --commands /srv/dumps/stages/stages_partial --cache /srv/dumps/cache/running_cache_partial.txt --directory /srv/dumps --verbose
For en wiki:
- be on snapshot1001 as root
su - datasets
python ./dumpscheduler.py --slots 27 --commands /srv/dumps/stages/stages_normal_hugewikis --cache /srv/dumps/cache/running_cache.txt --directory /srv/dumps --verbose
Dealing with problems
If the hosts serving the dumps run low on disk space, you can reduce the number of backups that are kept. Change the value for 'keep' in the conf file generation in puppet to a lower number.
Logs will be kept of each run. You can find them in the private directory (
/mnt/data/xmldatadumps/private/<wikiname>/<date>/) for the particular dump, filename
dumplog.txt. You can look at them to see if there are any error messages that were generated for a given run.
The worker script sends email if a dump does not complete successfully. It currently sends email to firstname.lastname@example.org which is an alias. If you want to follow and fix failures, add yourself to that alias.
When one or more steps of a dump fail, the index.html file for that dump includes a notation of the failure and sometimes more information about it. Note that one step of a dump failing does not prevent other steps from running unless they depend on the data from that failed step as input.
See Dumps/Rerunning a job for how to rerun all or part of a given dump. This also explains what files may need to be cleaned up before rerunning.
Dumps not running
See the section above, 'Starting dump runs' if you need to restart a run across all wikis from the beginning.
If the monitor does not appear to be running (the index.html file showing the dumps status is never updated), go to one of the regular worker nodes (snapshot1002, 1004) as root in a screen session, su - datasets, cd /srv/dumps, and run
If the host crashes while the dump scheduler is running, the status files are left as-is and the display shows any dumps on any wikis as still running until the monitor node decides the lock file for those wikis is stale enough to mark is as aborted.
To restart the scheduler from where it left off:
- be on the host as root
- start a screen session
- su - datasets
- give the appropriate dumpscheduler command (see 'Starting dump runs' above) but instead of the
--commandsoption with a file path give the option
--restoreand no value.
If the worker script encounters more than three failed dumps in a row (currently configured as such? or did I hardcode that?) it will exit; this avoids generation of piles of broken dumps which later would need to be cleaned up. Once the underlying problem is fixed, you can go to the screen session of the host running those wikis and rerun the previous command in all the windows.
Running a specifc dump on request
See Dumps/Rerunning a job for how to run a specific dump. This is done for special cases only.
Deploying new code
See Dumps/How to deploy for this.
Bugs, known limitations, etc.
See Dumps/Known issues and wish list for this.
Sites are identified by raw database name currently. A 'friendly' name/hostname can be added for convenience of searching in future.
- dumpHTML: static HTML dumps