You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Dumps"

From Wikitech-static
Jump to navigation Jump to search
imported>ArielGlenn
imported>Neil P. Quinn-WMF
(Update link destination)
(17 intermediate revisions by 7 users not shown)
Line 1: Line 1:
Docs for end-users of the data dumps at [[meta:Data dumps]].
{{Navigation Wikimedia infrastructure|expand=mw}}{{Hatnote|See [[Help:Toolforge/Dumps]] for information on using Dumps data from [[Portal:Toolforge|Toolforge]].}}
These docs are for '''maintainers''' of the various dumps. Information for '''users''' of the dumps can be found on [[meta:Data dumps|metawiki]]. Information for '''developers''' can be found on [[mw:SQL/XML_Dumps|mediawiki.org]].


For a list of various information sources about the dumps, see [[Dumps/Other information sources]].
=== Daily checks ===
Dumps maintainers should watch or check a few things every day:
* email to the ops-dumps mail alias (get on it! [[SRE/Clinic_Duty#Mail_aliases]])
* [https://lists.wikimedia.org/pipermail/xmldatadumps-l/ xmldatadumps-l mailing list]
* [https://phabricator.wikimedia.org/tag/dumps-generation/ phabricator dumps workboard]
* [https://dumps.wikimedia.org/ the current dumps run, if not idle]
* icinga for dumps hosts: [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=snapshot1 snapshots], [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=dumpsdata]


*For documentation on the "adds/changes" dumps, see [[Dumps/Adds-changes dumps]].
=== Dumps types ===
*For downlading older media dumps, go to [https://archive.org/details/wikimedia-mediatar?&sort=-downloads&page=2 archive.org].
We produce several types of dumps. For information about deployment of updates, architecture of the dumps, and troubleshooting each dump type, check the appropriate entry below.
*For current dumps issues, see the [https://phabricator.wikimedia.org/project/sprint/board/1519/ Dumps-generation project] in Phabricator.
*For current redesign plans and discussion, see [[Dumps/Dumps 2.0: Redesign]]
*For historical information about the dumps, see [[Dumps/History]].


{| cellspacing="0" cellpadding="0" style="clear: {{{clear|right}}}; margin-bottom: .5em; float: right; padding: .5em 0 .8em 1.4em; background: none; width: {{{width|{{{1|auto}}}}}};"
* [[Dumps/XML-SQL Dumps|xml/sql dumps]] which contain '''revision metadata and content''' for public Wikimedia projects, along with contents of select '''sql tables'''
| __TOC__
* [[Dumps/Adds-changes_dumps|adds/changes dumps]] which contain a '''daily xml dump of new pages''' or pages with '''new revisions''' since the previous run, for public Wikimedia projects
|}
* [[Dumps/WikidataDumps|Wikidata entity dumps]] which contain dumps of ''' 'entities' (Qxxx)''' in various formats, and a dump of '''lexemes''', run once a week.
== Overview ==
* [[Dumps/CategoriesRDF|category dumps]] which contain weekly full and daily incremental '''category lists''' for public Wikimedia projects, in '''rdf format'''
* [[Dumps/OtherMisc|other miscellaneous dumps]] including '''content translation''' dumps, '''cirrus search''' dumps, and '''global block''' information.


User-visible files appear at http://dumps.wikimedia.org/backup-index.html
Other datasets are also provided for download, such as page view counts; these datasets are managed by other folks and are not documented here.


Dump activity involves a monitor node (running status sweeps) and arbitrarily many worker nodes running the dumps.
=== Hardware ===
* [[Dumps/Snapshot hosts | Dumps snapshot hosts]] that run scripts to generate the dumps
* [[Dumps/Dumpsdata hosts | Dumps datastores]] where the snapshot hosts write intermediate and final dump output files, which are later published to our web servers
* [[Dumps/Dump servers | Dumps servers]] that provide the dumps to the public, to our mirrors, and via nfs to Wikimedia Cloud Services and stats host users


=== Status ===
=== Adding new dumps ===


For which hosts are serving data, see [[Dumps/Dump servers]]. For which hosts are generating dumps, see [[Dumps/Snapshot hosts]].
If you are interested in adding a new dumpset, please check the [[Dumps/New dumps and datasets|guidelines]] (still in draft form).
We want mirrors!  For more information see [[Dumps/Mirror status]].


=== Architecture ===
If you are working with wikibase dumps of some sort, you might want to look at a code walkthrough; see [[Dumps/Wikibase dumps overview]].


English language Wikipedia dumps are run once a month; all other wikis are dumped twice a month.  The first run includes page content for all revisions; the second dump run skips this step.
=== Testing changes to the dumps or new scripts ===


Dumps are run in "stages".  This doesn't really impact the English language Wikipedia dumps, as one stage simply runs directly after the other on a dedicated server, so that all dump steps get done without interruption by other wiki runs.
See [[Dumps/Testing]] for more about this.


For the other dumps, the stages for the first run of the month look like this:  stub xml files (files with all of the metadata for pages and revisions but without any page content) are dumped first.  After that's been done on all small wikis, all tables are dumped.  This is then done for the 'big' wikis (see list [[ here]]). Then the current page content for articles is dumped, first for small wikis and then for big ones. And so on.
=== Mirrors ===


The stages for the second run of the month are identical to the above, but without the full page history content dumps.
If you are adding a mirror, see [[Dumps/Mirror status | Dumps Mirror setup ]].
 
[[Category: Dumps]]
A '[https://phabricator.wikimedia.org/diffusion/ODUM/browse/master/xmldumps-backup/dumpscheduler.py dump scheduler]' manages the running of these steps on each available host, given the number of available cores and the order we want the jobs to run in.
 
=== Worker nodes ===
 
There is a [https://phabricator.wikimedia.org/diffusion/ODUM/browse/master/xmldumps-backup/worker worker script] which goes through the set of available wikis to dump a single time for each dump step, starting with the wiki that has gone the longest without a dump. The dump scheduler starts up several of these on each host, according to the number of free cores configured, starting the script anew for the same or a later stage, as determined in the stages list.
 
For each wiki, the worker script simply runs the python script [https://phabricator.wikimedia.org/diffusion/ODUM/browse/master/xmldumps-backup/worker.py worker.py] on the given wiki. To ensure that multiple workers don't try to dump the same wiki at the same time, the python script locks the wiki before proceeding.  Stale locks are eventually removed by a monitor script; if you try to run a dump of a wiki by hand when one is already in progress, you will see an error message to the effect that a wiki dump is already running, along with the name of the lock file.
 
=== Monitor node ===
 
On one host, the [https://phabricator.wikimedia.org/diffusion/ODUM/browse/master/xmldumps-backup/monitor monitor script] runs a [https://phabricator.wikimedia.org/diffusion/ODUM/browse/master/xmldumps-backup/monitor.py python script] for each wiki that checks for and removes stale lock files from dump processes that have died, and updates the central <code>index.html</code> file which shows the dumps in progress and the status of the dumps that have completed (i.e. <code>http://dumps.wikimedia.org/backup-index.html</code> ). That is its sole function.
 
=== Code ===
 
Check [https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=xmldumps-backup;hb=master /operations/dumps.git, branch 'master'] for the python code in use.  Some tools are in the 'ariel' branch but all dumps code run in production is in master or, for a few scripts not directly related to dumps production, in our puppet repo.
 
Getting a copy:
: <code>git clone https://gerrit.wikimedia.org/r/p/operations/dumps.git</code>
: <code>git checkout master</code>
 
Getting a copy as a committer:
: <code>git clone ssh://<user>@gerrit.wikimedia.org:29418/operations/dumps.git</code>
: <code>git checkout master</code>
 
=== Programs used ===
 
See also [[Dumps/Software dependencies]].
 
The scripts call mysqldump, getSlaveServer.php, eval.php, dumpBackup.php, and dumpTextPass.php directly for dump generation. These in turn require backup.inc and backupPrefetch.inc and may call ActiveAbstract/AbstractFilter.php and fetchText.php.
 
The generation of XML files relies on Export.php under the hood and of course the entire MW infrastructure.
 
The worker.py script relies on a few C programs for various bz2 operations: checkforbz2footer and recompressxml, both in /usr/local/bin/. These are in the git repo in branch 'ariel', see [https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=xmldumps-backup/mwbzutils;h=e76ee6cb52fd40e570e2e62a969f8b57902de1b9;hb=ariel].
 
== Setup ==
 
=== Adding a new worker box ===
 
Install and add to site.pp in the snapshot stanza (see snapshot1005-7).  Add the relevant hiera entries, documented in site.pp, according to whether the server will run en wiki dumps (only one server should do so), or misc cron jobs (one host should do so, not the same host running en wiki dumps).
 
Dumps run out of /srv/deployment/dumps/dumps on each server.  Deployment is done via scap3 from the deployment server.
 
=== Starting dump runs ===
 
# Do nothing.  These jobs run out of cron.
 
== Dealing with problems ==
 
===Space ===
If the hosts serving the dumps run low on disk space, you can reduce the number of backups that are kept.  Change the value for 'keep' in [[ the conf file generation]] in puppet to a lower number.
 
===Failed runs===
Logs will be kept of each run. You can find them in the private directory (<code>/mnt/data/xmldatadumps/private/<wikiname>/<date>/</code>) for the particular dump, filename <code>dumplog.txt</code>.  You can look at them to see if there are any error messages that were generated for a given run.
 
The worker script sends email if a dump does not complete successfully. It currently sends email to ops-dumps@wikimedia.org which is an alias. If you want to follow and fix failures, add yourself to that alias.
 
When one or more steps of a dump fail, the index.html file for that dump includes a notation of the failure and sometimes more information about it. Note that one step of a dump failing does not prevent other steps from running unless they depend on the data from that failed step as input.
 
See [[Dumps/Rerunning a job]] for how to rerun all or part of a given dump. This also explains what files may need to be cleaned up before rerunning.
 
===Dumps not running===
See the section above, 'Starting dump runs' if you need to restart a run across all wikis from the beginning.
 
If the monitor does not appear to be running (the index.html file showing the dumps status is never updated), check which host should have it running (see the hiera host entries for the snapshots and look for the one with monitor: true).  This is a service that should be restarted with systemd or upstart, depending on the os version, so you'll want to see what change broke it.
 
If the host crashes while the dump scheduler is running, the status files are left as-is and the display shows any dumps on any wikis as still running until the monitor node decides the lock file for those wikis is stale enough to mark is as aborted. 
 
To restart the scheduler from where it left off:
 
Really, you can just wait for cron to pick it up; it checks twice a day for aborted runs, unless the job has fallen outside of the run date range.  You can check thta date range by looking at the cron job entry on any snapshot host for the appropriate entry for fulldumps.sh.
 
If you're outside the range, just do this:
 
# be on each appropriate host as root
# start a screen session
# su - datasets
# bash fulldumps.sh starting_date_of_range todays_date wikitype(regular or huge) dumptype(full or partial)
 
Example: bash fulldumps.sh 01 17  regular full
 
This would pick up the full dumps for everything except enwiki, on the specific host you are running at, for the run tat starts at the first of the month, assuming thatyou are trying to run it on the 17th of the month or earlier.
 
This date cutoff may seem a little odd; it is built in so that the script does not try to start a dump run from scratch so late in the month that it cannot complete by the next run.
 
If the worker script encounters more than three failed dumps in a row (currently configured as such? or did I hardcode that?) it will exit; this avoids generation of piles of broken dumps which later would need to be cleaned up.  Once the underlying problem is fixed, you can go to the screen session of the host running those wikis and rerun the previous command in all the windows.
 
===Running a specifc dump on request===
See [[Dumps/Rerunning a job]] for how to run a specific dump.  This is done for special cases only.
 
== Deploying new code ==
 
See [[Dumps/How to deploy]] for this.
 
== Bugs, known limitations, etc. ==
 
See [[Dumps/Known issues and wish list]] for this.
 
== File layout ==
 
* <base>/
** [http://dumps.wikimedia.org/index.html index.html] - Information about the server
** [http://dumps.wikimedia.org/backup-index.html backup-index.html] - List of all databases and their last-touched status
** [http://dumps.wikimedia.org/afwiki/ <db>/]
*** <date>/
**** [http://dumps.wikimedia.org/afwiki/20060122/ index.html] - List of items in the database
 
Sites are identified by raw database name currently. A 'friendly' name/hostname can be added for convenience of searching in future.
 
== See also ==
* [[dumpHTML]]: static HTML dumps
 
[[Category:How-To]]
[[Category:Risk management]]
[[Category:dumps]]

Revision as of 18:31, 27 August 2021

Wikimedia infrastructure

[edit]

These docs are for maintainers of the various dumps. Information for users of the dumps can be found on metawiki. Information for developers can be found on mediawiki.org.

Daily checks

Dumps maintainers should watch or check a few things every day:

Dumps types

We produce several types of dumps. For information about deployment of updates, architecture of the dumps, and troubleshooting each dump type, check the appropriate entry below.

  • xml/sql dumps which contain revision metadata and content for public Wikimedia projects, along with contents of select sql tables
  • adds/changes dumps which contain a daily xml dump of new pages or pages with new revisions since the previous run, for public Wikimedia projects
  • Wikidata entity dumps which contain dumps of 'entities' (Qxxx) in various formats, and a dump of lexemes, run once a week.
  • category dumps which contain weekly full and daily incremental category lists for public Wikimedia projects, in rdf format
  • other miscellaneous dumps including content translation dumps, cirrus search dumps, and global block information.

Other datasets are also provided for download, such as page view counts; these datasets are managed by other folks and are not documented here.

Hardware

  • Dumps snapshot hosts that run scripts to generate the dumps
  • Dumps datastores where the snapshot hosts write intermediate and final dump output files, which are later published to our web servers
  • Dumps servers that provide the dumps to the public, to our mirrors, and via nfs to Wikimedia Cloud Services and stats host users

Adding new dumps

If you are interested in adding a new dumpset, please check the guidelines (still in draft form).

If you are working with wikibase dumps of some sort, you might want to look at a code walkthrough; see Dumps/Wikibase dumps overview.

Testing changes to the dumps or new scripts

See Dumps/Testing for more about this.

Mirrors

If you are adding a mirror, see Dumps Mirror setup .