You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Dumps"

From Wikitech-static
Jump to navigation Jump to search
imported>ArielGlenn
(add reference to phab project, add dumps 2.0 discussion page)
imported>ArielGlenn
(→‎Daily checks: new mailing list link)
 
(23 intermediate revisions by 9 users not shown)
Line 1: Line 1:
Docs for end-users of the data dumps at [[meta:Data dumps]].
{{Navigation Wikimedia infrastructure|expand=mw}}{{Hatnote|See [[Help:Toolforge/Dumps]] for information on using Dumps data from [[Portal:Toolforge|Toolforge]].}}
These docs are for '''maintainers''' of the various dumps. Information for '''users''' of the dumps can be found on [[meta:Data dumps|metawiki]]. Information for '''developers''' can be found on [[mw:SQL/XML_Dumps|mediawiki.org]].


For a list of various information sources about the dumps, see [[Dumps/Other information sources]].
=== Daily checks ===
Dumps maintainers should watch or check a few things every day:
* email to the ops-dumps mail alias (get on it! [[SRE/Clinic_Duty#Mail_aliases]])
* [https://lists.wikimedia.org/postorius/lists/xmldatadumps-l.lists.wikimedia.org/ xmldatadumps-l mailing list]
* [https://phabricator.wikimedia.org/tag/dumps-generation/ phabricator dumps workboard]
* [https://dumps.wikimedia.org/ the current dumps run, if not idle]
* icinga for dumps hosts: [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=snapshot1 snapshots], [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=dumpsdata]


*For documentation on the "adds/changes" dumps, see [[Dumps/Adds-changes dumps]].
=== Dumps types ===
*For documentation on the media dumps, see [[Dumps/media]].
We produce several types of dumps. For information about deployment of updates, architecture of the dumps, and troubleshooting each dump type, check the appropriate entry below.
*For older development notes, see [[Dumps/Development 2012]].
*For current dumps issues, see the Dumps-generation project in Phabricator.
*For current redesign plans and discussion, see [[Dumps/Dumps 2.0: Redesign]]
*For historical information about the dumps, see [[Dumps/History]].


{| cellspacing="0" cellpadding="0" style="clear: {{{clear|right}}}; margin-bottom: .5em; float: right; padding: .5em 0 .8em 1.4em; background: none; width: {{{width|{{{1|auto}}}}}};"
* [[Dumps/XML-SQL Dumps|xml/sql dumps]] which contain '''revision metadata and content''' for public Wikimedia projects, along with contents of select '''sql tables'''
| __TOC__
* [[Dumps/Adds-changes_dumps|adds/changes dumps]] which contain a '''daily xml dump of new pages''' or pages with '''new revisions''' since the previous run, for public Wikimedia projects
|}
* [[Dumps/WikidataDumps|Wikidata entity dumps]] which contain dumps of ''' 'entities' (Qxxx)''' in various formats, and a dump of '''lexemes''', run once a week.
== Overview ==
* [[Dumps/CategoriesRDF|category dumps]] which contain weekly full and daily incremental '''category lists''' for public Wikimedia projects, in '''rdf format'''
* [[Dumps/OtherMisc|other miscellaneous dumps]] including '''content translation''' dumps, '''cirrus search''' dumps, and '''global block''' information.


User-visible files appear at http://dumps.wikimedia.org/backup-index.html
Other datasets are also provided for download, such as page view counts; these datasets are managed by other folks and are not documented here.


Dump activity involves a monitor node (running status sweeps) and arbitrarily many worker nodes running the dumps.
=== Hardware ===
* [[Dumps/Snapshot hosts | Dumps snapshot hosts]] that run scripts to generate the dumps
* [[Dumps/Dumpsdata hosts | Dumps datastores]] where the snapshot hosts write intermediate and final dump output files, which are later published to our web servers
* [[Dumps/Dump servers | Dumps servers]] that provide the dumps to the public, to our mirrors, and via nfs to Wikimedia Cloud Services and stats host users


=== Status ===
=== Adding new dumps ===


For which hosts are serving data, see [[Dumps/Dump servers]]. For which hosts are generating which dumps, see [[Dumps/Snapshot hosts]].
If you are interested in adding a new dumpset, please check the [[Dumps/New dumps and datasets|guidelines]] (still in draft form).
We want mirrors!  For more information see [[Dumps/Mirror status]].


=== Worker nodes ===
If you are working with wikibase dumps of some sort, you might want to look at a code walkthrough; see [[Dumps/Wikibase dumps overview]].


The worker processes go through the set of available wikis to dump automatically. Dumps are run on a "longest without a dump runs next" schedule. The plan is to have a complete dump for each wiki every 2 weeks, except for enwikipedia, which should have a complete dump once a month.
=== Testing changes to the dumps or new scripts ===


The shell script <code>worker</code> which starts one of these processes simply runs the python script <worker.py> in an endless loop. Multiple such workers can run at the same time on different hosts, as well as on the same host.  
See [[Dumps/Testing]] for more about this.


The <code>worker.py</code> script creates a lock file on the filesystem containing the dumps (as of this writing, <code>/mnt/data/xmldatadumps/</code>) in the subdirectory <code>private/name-of-wiki/lock</code>.  No other process will try to write dumps for that project while the lock file is in place.
=== Mirrors ===


Local copies of the shell script and the python script live on the snapshot hosts in the directory <code>/backups</code> and are run in screen sessions on the various hosts, as the user "backup".
If you are adding a mirror, see [[Dumps/Mirror status | Dumps Mirror setup ]].
 
[[Category: Dumps]]
=== Monitor node ===
 
The monitor node checks for and removes stale lock files from dump processes that have died, and updates the central <code>index.html</code> file which shows the dumps in progress and the status of the dumps that have completed (i.e. <code>http://dumps.wikimedia.org/backup-index.html</code> ). ''It does not start or stop worker processes.''
 
The shell script <code>monitor</code> which starts the process simply runs the python script <code>monitor.py</code> in an endless loop.
 
As with the worker nodes, local copies of the shell script and the python script <small>live on the snapshot hosts in the directory <code>/backups</code> but</small> currently are run out of /backups-atg (since this code is not yet in trunk) in a screen session on one host, as the user "backup".
 
=== Code ===
 
Check [https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=xmldumps-backup;hb=ariel /operations/dumps.git, branch 'ariel'] for the python code in use.  Eventually this will make its way back into master; it's still a bit gross right now.
 
Getting a copy:
: <code>git clone https://gerrit.wikimedia.org/r/p/operations/dumps.git</code>
: <code>git checkout ariel</code>
 
Getting a copy as a committer:
: <code>git clone ssh://<user>@gerrit.wikimedia.org:29418/operations/dumps.git</code>
: <code>git checkout ariel</code>
 
=== Programs used ===
 
See also [[Dumps/Software dependencies]].
 
The scripts call mysqldump, getSlaveServer.php, eval.php, dumpBackup.php, and dumpTextPass.php directly for dump generation. These in turn require backup.inc and backupPrefetch.inc and may call ActiveAbstract/AbstractFilter.php and fetchText.php.
 
The generation of XML files relies on Export.php under the hood and of course the entire MW infrastructure.
 
The worker.py script relies on a few C programs for various bz2 operations: checkforbz2footer and recompressxml, both in /usr/local/bin/. These are in the git repo, see [https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=xmldumps-backup/mwbzutils;h=e76ee6cb52fd40e570e2e62a969f8b57902de1b9;hb=ariel].
 
== Setup ==
 
=== Adding a new worker box ===
 
Install and add to site.pp, copying one of the existing snapshot stanzas in puppet.  This does, among other things:
# set up the base MW install without apache running
# Add worker to '''/etc/exports/''' on [[dataset2]]
# Add '''/mnt/data''' to '''/etc/fstab''' of worker host
# Build the utfnormal php module (done for lucid)
 
For now:
# Backups are running test code out of /backups-atg on each host so grab a copy of that from any existing host and copy it into /backups-atg on the new host. This will include conf files, you don't need to specify them separately.
#: '''In transition, being moved to /backups. To be updated as soon as move is complete.'''
# Check over the configuration file and make sure it looks sane, all the paths point to things that exist, etc.  For too many details see [https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=blob_plain;f=xmldumps-backup/README.config;hb=ariel the README.config file in the git repo].
#* We run enwiki on its own host.  If this host is going to do that work, check <code>/backups-atg/wikidump.conf.enwiki</code>.
#* The next 8 or so largest wikis are run on their own separate host so they don't backlog the smaller wikis.  For that, check <code>/backups-atg/wikidump.conf.bigwikis</code>.
#* The remainder of the wikis run on one host.  Check <code>/backups-atg/wikidump.conf</code> for those.
<!--We will eventually do...
# '''git pul something for public repo ...  /backups'''
# '''git pull something else for private repo with config files in it... /backups/conf'
# mv wikidump.conf ../.-->
 
== Dealing with problems ==
 
===Space ===
If the host serving the dumps runs low on disk space, you can reduce the number of backups that are kept.  Edit the appropriate file /backups-atg/wikidump.conf* on the host running the set of dumps you would like to adjust, en wiki = wikidump.con.enwiki, the next 8 or so big wikis = wikidump.conf.bigwiki, the rest = wikidump.conf) and change the line that says "keep=<some value>" to some smaller number.
 
===Failed runs===
Logs will be kept of each run. You can find them in the directory for the particular dump, filename <code>dumplog.txt</code>.  You can look at them to see if there are any error messages that were generated for a given run.
 
The worker script can send email if a dump does not complete successfully.  (Better enable this.)  It currently sends email to...
 
When one or more steps of a dump fail, the index.html file for that dump includes a notation of the failure and sometimes more information about it. Note that one step of a dump failing does not prevent other steps from running unless they depend on the data from that failed step as input.
 
See [[Dumps/Rerunning a job]] for how to rerun all or part of a given dump. This also explains what files may need to be cleaned up before rerunning.
 
===Dumps not running===
This covers restarting after: rebooting a host, rebooting the dataset host with the nfs share where dumps are written (which may cause dumps to hang), or when the dumps stop running for other reasons.
 
If the host crashes while the script running, the status files are left as-is and the display shows it as still running until the monitor node decides the lock file is stale enough to mark is as aborted.  To restart, start a screen session on the host as root and fire up the appropriate number of worker scripts with the appropriate config file option.  See [[Dumps/Snapshot hosts]] for which hosts do what; this lists which commands gets run on each host in how many windows.  If the monitor script is not running, restart it in a separate window of the same screen session; see the Dump servers page for the command and for which host it runs on.
 
If the worker script encounters more than three failed dumps in a row (currently configured as such? or did I hardcode that?) it will exit; this avoids generation of piles of broken dumps which later would need to be cleaned up.  Once the underlying problem is fixed, you can go to the screen session of the host running those wikis and rerun the previous command in all the windows. See [[Dumps/Snapshot hosts]] for which hosts do what if you're not sure.
 
===Running a specifc dump on request===
See [[Dumps/Rerunning a job]] for how to run a specific dump.  This is done for special cases only.
 
== Deploying new code ==
 
See [[Dumps/How to deploy]] for this.
 
== Bugs, known limitations, etc. ==
 
See [[Dumps/Known issues and wish list]] for this.
 
== File layout ==
 
* <base>/
** [http://dumps.wikimedia.org/index.html index.html] - Information about the server
** [http://dumps.wikimedia.org/backup-index.html backup-index.html] - List of all databases and their last-touched status
** [http://dumps.wikimedia.org/afwiki/ <db>/]
*** <date>/
**** [http://dumps.wikimedia.org/afwiki/20060122/ index.html] - List of items in the database
 
Sites are identified by raw database name currently. A 'friendly' name/hostname can be added for convenience of searching in future.
 
== See also ==
* [[dumpHTML]]: static HTML dumps
 
[[Category:How-To]]
[[Category:Risk management]]
[[Category:dumps]]

Latest revision as of 07:12, 20 September 2021

Wikimedia infrastructure

[edit]

These docs are for maintainers of the various dumps. Information for users of the dumps can be found on metawiki. Information for developers can be found on mediawiki.org.

Daily checks

Dumps maintainers should watch or check a few things every day:

Dumps types

We produce several types of dumps. For information about deployment of updates, architecture of the dumps, and troubleshooting each dump type, check the appropriate entry below.

  • xml/sql dumps which contain revision metadata and content for public Wikimedia projects, along with contents of select sql tables
  • adds/changes dumps which contain a daily xml dump of new pages or pages with new revisions since the previous run, for public Wikimedia projects
  • Wikidata entity dumps which contain dumps of 'entities' (Qxxx) in various formats, and a dump of lexemes, run once a week.
  • category dumps which contain weekly full and daily incremental category lists for public Wikimedia projects, in rdf format
  • other miscellaneous dumps including content translation dumps, cirrus search dumps, and global block information.

Other datasets are also provided for download, such as page view counts; these datasets are managed by other folks and are not documented here.

Hardware

  • Dumps snapshot hosts that run scripts to generate the dumps
  • Dumps datastores where the snapshot hosts write intermediate and final dump output files, which are later published to our web servers
  • Dumps servers that provide the dumps to the public, to our mirrors, and via nfs to Wikimedia Cloud Services and stats host users

Adding new dumps

If you are interested in adding a new dumpset, please check the guidelines (still in draft form).

If you are working with wikibase dumps of some sort, you might want to look at a code walkthrough; see Dumps/Wikibase dumps overview.

Testing changes to the dumps or new scripts

See Dumps/Testing for more about this.

Mirrors

If you are adding a mirror, see Dumps Mirror setup .