You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Dumps"

From Wikitech-static
Jump to navigation Jump to search
imported>ArielGlenn
imported>ArielGlenn
Line 1: Line 1:
'''We want mirrors!  For more information see [[Dumps/Mirror status]].'''
Docs for end-users of the data dumps at [[meta:Data dumps]].
Docs for end-users of the data dumps at [[meta:Data dumps]].


Line 4: Line 6:


*For documentation on the "adds/changes" dumps, see [[Dumps/Adds-changes dumps]].
*For documentation on the "adds/changes" dumps, see [[Dumps/Adds-changes dumps]].
*For downlading older media dumps, go to [https://archive.org/details/wikimedia-mediatar?&sort=-downloads&page=2 archive.org].
*For downloading older media dumps, go to [https://archive.org/details/wikimedia-mediatar?&sort=-downloads&page=2 archive.org].
*For current dumps issues, see the [https://phabricator.wikimedia.org/project/sprint/board/1519/ Dumps-generation project] in Phabricator.
*For current dumps issues, see the [https://phabricator.wikimedia.org/project/sprint/board/1519/ Dumps-generation project] in Phabricator.
:* See [[Dumps/Known issues and wish list]] for a much older wishlist.
*For current redesign plans and discussion, see [[Dumps/Dumps 2.0: Redesign]]
*For current redesign plans and discussion, see [[Dumps/Dumps 2.0: Redesign]]
*For historical information about the dumps, see [[Dumps/History]].
*For historical information about the dumps, see [[Dumps/History]].
*For info on HTML dumps, see  [[dumpHTML]] but also
''The following info is for folks who hack on, maintain and administer the dumps and the dump servers.''


{| cellspacing="0" cellpadding="0" style="clear: {{{clear|right}}}; margin-bottom: .5em; float: right; padding: .5em 0 .8em 1.4em; background: none; width: {{{width|{{{1|auto}}}}}};"
{| cellspacing="0" cellpadding="0" style="clear: {{{clear|right}}}; margin-bottom: .5em; float: right; padding: .5em 0 .8em 1.4em; background: none; width: {{{width|{{{1|auto}}}}}};"
| __TOC__
| __TOC__
|}
|}
== Overview ==


User-visible files appear at http://dumps.wikimedia.org/backup-index.html
== Setup ==


Dump activity involves a monitor node (running status sweeps) and arbitrarily many worker nodes running the dumps.
=== Current architecture ===


=== Status ===
Rather than bore you with that here, see [[Dumps/Current Architecture]].
 
=== Current hosts ===


For which hosts are serving data, see [[Dumps/Dump servers]]. For which hosts are generating dumps, see [[Dumps/Snapshot hosts]].
For which hosts are serving data, see [[Dumps/Dump servers]]. For which hosts are generating dumps, see [[Dumps/Snapshot hosts]].
We want mirrors!  For more information see [[Dumps/Mirror status]].


=== Architecture ===
=== Adding a new snapshot host ===


English language Wikipedia dumps are run once a month; all other wikis are dumped twice a monthThe first run includes page content for all revisions; the second dump run skips this step.
Install and add to site.pp in the snapshot stanza (see snapshot1005-7)Add the relevant hiera entries, documented in site.pp, according to whether the server will run en wiki dumps (only one server should do so), or misc cron jobs (one host should do so, not the same host running en wiki dumps).


Dumps are run in "stages"This doesn't really impact the English language Wikipedia dumps, as one stage simply runs directly after the other on a dedicated server, so that all dump steps get done without interruption by other wiki runs.
Dumps run out of /srv/deployment/dumps/dumps/xmldumps-backup on each serverDeployment is done via scap3 from the deployment server.


For the other dumps, the stages for the first run of the month look like this:  stub xml files (files with all of the metadata for pages and revisions but without any page content) are dumped first.  After that's been done on all small wikis, all tables are dumped.  This is then done for the 'big' wikis (see list [[ here]]). Then the current page content for articles is dumped, first for small wikis and then for big ones. And so on.
=== Starting dump runs ===


The stages for the second run of the month are identical to the above, but without the full page history content dumps.
# Do nothing. These jobs run out of cron.
 
A '[https://phabricator.wikimedia.org/diffusion/ODUM/browse/master/xmldumps-backup/dumpscheduler.py dump scheduler]' manages the running of these steps on each available host, given the number of available cores and the order we want the jobs to run in.
 
=== Worker nodes ===
 
There is a [https://phabricator.wikimedia.org/diffusion/ODUM/browse/master/xmldumps-backup/worker worker script] which goes through the set of available wikis to dump a single time for each dump step, starting with the wiki that has gone the longest without a dump. The dump scheduler starts up several of these on each host, according to the number of free cores configured, starting the script anew for the same or a later stage, as determined in the stages list.


For each wiki, the worker script simply runs the python script [https://phabricator.wikimedia.org/diffusion/ODUM/browse/master/xmldumps-backup/worker.py worker.py] on the given wiki. To ensure that multiple workers don't try to dump the same wiki at the same time, the python script locks the wiki before proceeding.  Stale locks are eventually removed by a monitor script; if you try to run a dump of a wiki by hand when one is already in progress, you will see an error message to the effect that a wiki dump is already running, along with the name of the lock file.
== Troubleshooting ==


=== Monitor node ===
=== Fixing code ===


On one host, the [https://phabricator.wikimedia.org/diffusion/ODUM/browse/master/xmldumps-backup/monitor monitor script] runs a [https://phabricator.wikimedia.org/diffusion/ODUM/browse/master/xmldumps-backup/monitor.py python script] for each wiki that checks for and removes stale lock files from dump processes that have died, and updates the central <code>index.html</code> file which shows the dumps in progress and the status of the dumps that have completed (i.e. <code>http://dumps.wikimedia.org/backup-index.html</code> ). That is its sole function.
The dumps code is all in the repo [https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=xmldumps-backup;hb=master /operations/dumps.git, branch 'master'].  Various supporting scripts that are not part of the dumps proper, are in puppet; you can find those in the snapshot module.
 
=== Code ===
 
Check [https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=xmldumps-backup;hb=master /operations/dumps.git, branch 'master'] for the python code in useSome tools are in the 'ariel' branch but all dumps code run in production is in master or, for a few scripts not directly related to dumps production, in our puppet repo.
 
Getting a copy:
: <code>git clone https://gerrit.wikimedia.org/r/p/operations/dumps.git</code>
: <code>git checkout master</code>


Getting a copy as a committer:  
Getting a copy as a committer:  
Line 58: Line 49:
: <code>git checkout master</code>
: <code>git checkout master</code>


=== Programs used ===
ssh to the deployment host.


See also [[Dumps/Software dependencies]].
#cd /srv/deployment/dumps/dumps
#git pull
#scap deploy


The scripts call mysqldump, getSlaveServer.php, eval.php, dumpBackup.php, and dumpTextPass.php directly for dump generation. These in turn require backup.inc and backupPrefetch.inc and may call ActiveAbstract/AbstractFilter.php and fetchText.php.
Note: you likely need to be in the ops ldap group to do the scap.
Also note that changes pushed will not take place until the next dump run; any current run uses the existing dump code to complete.


The generation of XML files relies on Export.php under the hood and of course the entire MW infrastructure.
=== Fixing configuration files ===


The worker.py script relies on a few C programs for various bz2 operations: checkforbz2footer and recompressxml, both in /usr/local/bin/. These are in the git repo in branch 'ariel', see [https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=xmldumps-backup/mwbzutils;h=e76ee6cb52fd40e570e2e62a969f8b57902de1b9;hb=ariel].
Configuration file setup is handled in the snapshot puppet module. You can check the config files themselves at /etc/dumps/confs on any snapshot host.


== Setup ==
=== Out of space ===


=== Adding a new worker box ===
If the hosts serving the dumps run low on disk space, you can reduce the number of backups that are kept.  Change the value for 'keep' in the configuration files in puppet to a lower number.


Install and add to site.pp in the snapshot stanza (see snapshot1005-7).  Add the relevant hiera entries, documented in site.pp, according to whether the server will run en wiki dumps (only one server should do so), or misc cron jobs (one host should do so, not the same host running en wiki dumps).
=== Broken dumps ===


Dumps run out of /srv/deployment/dumps/dumps on each server.  Deployment is done via scap3 from the deployment server.
The dumps can break in a few interesting ways.


=== Starting dump runs ===
# They no longer appear to be running. Is the monitor running? See below. If it is running, perhaps all the workers are stuck on a stage waiting for a previous stage that failed.
:# Shoot them all and let the cron job sort it out. You can also look at the error notifications section and see if anything turns up; fix the underlying problem and wait for cron.
# A dump for a particular wiki has been aborted.  This may be due to me shooting the script because it was behaving badly, or because a host was powercycled in the middle of a run.
:# The next cron job should fix this up.
# A dump on a particular wiki has failed.
:# Check the information on error notifications, track down the underlying issue (db outage? MW deploy of bad code? Other?), fix it, and wait for cron to rerun it.


# Do nothing.  These jobs run out of cron.
=== Error notifications ===


== Dealing with problems ==
Email is ordinarily sent if a dump does not complete successfully, going to ops-dumps@wikimedia.org which is an alias. If you want to follow and fix failures, add yourself to that alias.


===Space ===
Logs are kept of each run. From any snapshot host, you can find the logs in the directory (<code>/mnt/data/xmldatadumps/private/<wikiname>/<date>/dumplog.txt</code>). From these you may glean more reasons for the failure.
If the hosts serving the dumps run low on disk space, you can reduce the number of backups that are kept. Change the value for 'keep' in [[ the conf file generation]] in puppet to a lower number.


===Failed runs===
TBD: Logs that capture the rest will be available at /var/log/dumps/somethingorother and may also contain clues.
Logs will be kept of each run. You can find them in the private directory (<code>/mnt/data/xmldatadumps/private/<wikiname>/<date>/</code>) for the particular dump, filename <code>dumplog.txt</code>.  You can look at them to see if there are any error messages that were generated for a given run.
 
The worker script sends email if a dump does not complete successfully. It currently sends email to ops-dumps@wikimedia.org which is an alias. If you want to follow and fix failures, add yourself to that alias.


When one or more steps of a dump fail, the index.html file for that dump includes a notation of the failure and sometimes more information about it. Note that one step of a dump failing does not prevent other steps from running unless they depend on the data from that failed step as input.
When one or more steps of a dump fail, the index.html file for that dump includes a notation of the failure and sometimes more information about it. Note that one step of a dump failing does not prevent other steps from running unless they depend on the data from that failed step as input.


See [[Dumps/Rerunning a job]] for how to rerun all or part of a given dump. This also explains what files may need to be cleaned up before rerunning.
=== Monitoring is broken ===
 
===Dumps not running===
See the section above, 'Starting dump runs' if you need to restart a run across all wikis from the beginning.


If the monitor does not appear to be running (the index.html file showing the dumps status is never updated), check which host should have it running (see the hiera host entries for the snapshots and look for the one with monitor: true).  This is a service that should be restarted with systemd or upstart, depending on the os version, so you'll want to see what change broke it.
If the monitor does not appear to be running (the index.html file showing the dumps status is never updated), check which host should have it running (see the hiera host entries for the snapshots and look for the one with monitor: true).  This is a service that should be restarted with systemd or upstart, depending on the os version, so you'll want to see what change broke it.


If the host crashes while the dump scheduler is running, the status files are left as-is and the display shows any dumps on any wikis as still running until the monitor node decides the lock file for those wikis is stale enough to mark is as aborted. 
=== Rerunning dumps ===
 
To restart the scheduler from where it left off:
 
Really, you can just wait for cron to pick it up; it checks twice a day for aborted runs, unless the job has fallen outside of the run date range.  You can check thta date range by looking at the cron job entry on any snapshot host for the appropriate entry for fulldumps.sh.
 
If you're outside the range, just do this:
 
# be on each appropriate host as root
# start a screen session
# su - datasets
# bash fulldumps.sh starting_date_of_range todays_date wikitype(regular or huge) dumptype(full or partial)
 
Example: bash fulldumps.sh 01 17  regular full
 
This would pick up the full dumps for everything except enwiki, on the specific host you are running at, for the run tat starts at the first of the month, assuming thatyou are trying to run it on the 17th of the month or earlier.
 
This date cutoff may seem a little odd; it is built in so that the script does not try to start a dump run from scratch so late in the month that it cannot complete by the next run.
 
If the worker script encounters more than three failed dumps in a row (currently configured as such? or did I hardcode that?) it will exit; this avoids generation of piles of broken dumps which later would need to be cleaned up.  Once the underlying problem is fixed, you can go to the screen session of the host running those wikis and rerun the previous command in all the windows.
 
===Running a specifc dump on request===
See [[Dumps/Rerunning a job]] for how to run a specific dump.  This is done for special cases only.
 
== Deploying new code ==
 
See [[Dumps/How to deploy]] for this.
 
== Bugs, known limitations, etc. ==
 
See [[Dumps/Known issues and wish list]] for this.
 
== File layout ==
 
* <base>/
** [http://dumps.wikimedia.org/index.html index.html] - Information about the server
** [http://dumps.wikimedia.org/backup-index.html backup-index.html] - List of all databases and their last-touched status
** [http://dumps.wikimedia.org/afwiki/ <db>/]
*** <date>/
**** [http://dumps.wikimedia.org/afwiki/20060122/ index.html] - List of items in the database


Sites are identified by raw database name currently. A 'friendly' name/hostname can be added for convenience of searching in future.
You really really don't want to do this.  These jobs run out of cron.  All by themselves. Trust me. Once the underlying problem (bad MW code, unhappy db server, out of space, etc) is fixed, it will get taken care of.


== See also ==
Okay, you don't trust me, or something's really broken.  See [[Dumps/Rerunning a job]] if you absolutely have to rerun a wiki/job.
* [[dumpHTML]]: static HTML dumps


[[Category:How-To]]
[[Category:How-To]]
[[Category:Risk management]]
[[Category:Risk management]]
[[Category:dumps]]
[[Category:dumps]]

Revision as of 16:18, 2 August 2016

We want mirrors! For more information see Dumps/Mirror status.

Docs for end-users of the data dumps at meta:Data dumps.

For a list of various information sources about the dumps, see Dumps/Other information sources.

The following info is for folks who hack on, maintain and administer the dumps and the dump servers.

Setup

Current architecture

Rather than bore you with that here, see Dumps/Current Architecture.

Current hosts

For which hosts are serving data, see Dumps/Dump servers. For which hosts are generating dumps, see Dumps/Snapshot hosts.

Adding a new snapshot host

Install and add to site.pp in the snapshot stanza (see snapshot1005-7). Add the relevant hiera entries, documented in site.pp, according to whether the server will run en wiki dumps (only one server should do so), or misc cron jobs (one host should do so, not the same host running en wiki dumps).

Dumps run out of /srv/deployment/dumps/dumps/xmldumps-backup on each server. Deployment is done via scap3 from the deployment server.

Starting dump runs

  1. Do nothing. These jobs run out of cron.

Troubleshooting

Fixing code

The dumps code is all in the repo /operations/dumps.git, branch 'master'. Various supporting scripts that are not part of the dumps proper, are in puppet; you can find those in the snapshot module.

Getting a copy as a committer:

git clone ssh://<user>@gerrit.wikimedia.org:29418/operations/dumps.git
git checkout master

ssh to the deployment host.

  1. cd /srv/deployment/dumps/dumps
  2. git pull
  3. scap deploy

Note: you likely need to be in the ops ldap group to do the scap. Also note that changes pushed will not take place until the next dump run; any current run uses the existing dump code to complete.

Fixing configuration files

Configuration file setup is handled in the snapshot puppet module. You can check the config files themselves at /etc/dumps/confs on any snapshot host.

Out of space

If the hosts serving the dumps run low on disk space, you can reduce the number of backups that are kept. Change the value for 'keep' in the configuration files in puppet to a lower number.

Broken dumps

The dumps can break in a few interesting ways.

  1. They no longer appear to be running. Is the monitor running? See below. If it is running, perhaps all the workers are stuck on a stage waiting for a previous stage that failed.
  1. Shoot them all and let the cron job sort it out. You can also look at the error notifications section and see if anything turns up; fix the underlying problem and wait for cron.
  1. A dump for a particular wiki has been aborted. This may be due to me shooting the script because it was behaving badly, or because a host was powercycled in the middle of a run.
  1. The next cron job should fix this up.
  1. A dump on a particular wiki has failed.
  1. Check the information on error notifications, track down the underlying issue (db outage? MW deploy of bad code? Other?), fix it, and wait for cron to rerun it.

Error notifications

Email is ordinarily sent if a dump does not complete successfully, going to ops-dumps@wikimedia.org which is an alias. If you want to follow and fix failures, add yourself to that alias.

Logs are kept of each run. From any snapshot host, you can find the logs in the directory (/mnt/data/xmldatadumps/private/<wikiname>/<date>/dumplog.txt). From these you may glean more reasons for the failure.

TBD: Logs that capture the rest will be available at /var/log/dumps/somethingorother and may also contain clues.

When one or more steps of a dump fail, the index.html file for that dump includes a notation of the failure and sometimes more information about it. Note that one step of a dump failing does not prevent other steps from running unless they depend on the data from that failed step as input.

Monitoring is broken

If the monitor does not appear to be running (the index.html file showing the dumps status is never updated), check which host should have it running (see the hiera host entries for the snapshots and look for the one with monitor: true). This is a service that should be restarted with systemd or upstart, depending on the os version, so you'll want to see what change broke it.

Rerunning dumps

You really really don't want to do this. These jobs run out of cron. All by themselves. Trust me. Once the underlying problem (bad MW code, unhappy db server, out of space, etc) is fixed, it will get taken care of.

Okay, you don't trust me, or something's really broken. See Dumps/Rerunning a job if you absolutely have to rerun a wiki/job.