You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Dumps"

From Wikitech-static
Jump to navigation Jump to search
imported>GTirloni
(Update clushmaster)
imported>Neil P. Quinn-WMF
(Update link destination)
(8 intermediate revisions by 4 users not shown)
Line 1: Line 1:
'''We want mirrors!  For more information see [[Dumps/Mirror status]].'''
{{Navigation Wikimedia infrastructure|expand=mw}}{{Hatnote|See [[Help:Toolforge/Dumps]] for information on using Dumps data from [[Portal:Toolforge|Toolforge]].}}
These docs are for '''maintainers''' of the various dumps. Information for '''users''' of the dumps can be found on [[meta:Data dumps|metawiki]]. Information for '''developers''' can be found on [[mw:SQL/XML_Dumps|mediawiki.org]].


Docs for end-users of the data dumps at [[meta:Data dumps]]. If you're a Toolforge user and want to use the dumps, check out [[Help:Shared storage]] for information on where to find the files.
=== Daily checks ===
Dumps maintainers should watch or check a few things every day:
* email to the ops-dumps mail alias (get on it! [[SRE/Clinic_Duty#Mail_aliases]])
* [https://lists.wikimedia.org/pipermail/xmldatadumps-l/ xmldatadumps-l mailing list]
* [https://phabricator.wikimedia.org/tag/dumps-generation/ phabricator dumps workboard]
* [https://dumps.wikimedia.org/ the current dumps run, if not idle]
* icinga for dumps hosts: [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=snapshot1 snapshots], [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=dumpsdata]


For a list of various information sources about the dumps, see [[Dumps/Other information sources]].
=== Dumps types ===
We produce several types of dumps. For information about deployment of updates, architecture of the dumps, and troubleshooting each dump type, check the appropriate entry below.


*For documentation on the "adds/changes" dumps, see [[Dumps/Adds-changes dumps]].
* [[Dumps/XML-SQL Dumps|xml/sql dumps]] which contain '''revision metadata and content''' for public Wikimedia projects, along with contents of select '''sql tables'''
*For downloading older media dumps, go to [https://archive.org/details/wikimedia-mediatar?&sort=-downloads archive.org] (see [[Dumps/Archive.org]] for details).
* [[Dumps/Adds-changes_dumps|adds/changes dumps]] which contain a '''daily xml dump of new pages''' or pages with '''new revisions''' since the previous run, for public Wikimedia projects
*For current dumps issues, see the [https://phabricator.wikimedia.org/project/sprint/board/1519/ Dumps-generation project] in Phabricator.
* [[Dumps/WikidataDumps|Wikidata entity dumps]] which contain dumps of ''' 'entities' (Qxxx)''' in various formats, and a dump of '''lexemes''', run once a week.
:* See [[Dumps/Known issues and wish list]] for a much older wishlist.
* [[Dumps/CategoriesRDF|category dumps]] which contain weekly full and daily incremental '''category lists''' for public Wikimedia projects, in '''rdf format'''
*For current redesign plans and discussion, see [[Dumps/Dumps 2.0: Redesign]].
* [[Dumps/OtherMisc|other miscellaneous dumps]] including '''content translation''' dumps, '''cirrus search''' dumps, and '''global block''' information.
*For historical information about the dumps, see [[Dumps/History]].
*For info on HTML dumps, see  [[dumpHTML]].


''The following info is for folks who hack on, maintain and administer the dumps and the dump servers.''
Other datasets are also provided for download, such as page view counts; these datasets are managed by other folks and are not documented here.


{| cellspacing="0" cellpadding="0" style="clear: {{{clear|right}}}; margin-bottom: .5em; float: right; padding: .5em 0 .8em 1.4em; background: none; width: {{{width|{{{1|auto}}}}}};"
=== Hardware ===
| __TOC__
* [[Dumps/Snapshot hosts | Dumps snapshot hosts]] that run scripts to generate the dumps
|}
* [[Dumps/Dumpsdata hosts | Dumps datastores]] where the snapshot hosts write intermediate and final dump output files, which are later published to our web servers
* [[Dumps/Dump servers | Dumps servers]] that provide the dumps to the public, to our mirrors, and via nfs to Wikimedia Cloud Services and stats host users


== Setup ==
=== Adding new dumps ===


=== Current architecture ===
If you are interested in adding a new dumpset, please check the [[Dumps/New dumps and datasets|guidelines]] (still in draft form).


Rather than bore you with that here, see [[Dumps/Current Architecture]].
If you are working with wikibase dumps of some sort, you might want to look at a code walkthrough; see [[Dumps/Wikibase dumps overview]].


=== Current hosts ===
=== Testing changes to the dumps or new scripts ===


For which hosts are serving data, see [[Dumps/Dump servers]]. For which hosts are generating dumps, see [[Dumps/Snapshot hosts]]. For which hosts are providing space via NFS for the generated dumps, see [[Dumps/Dumpsdata hosts]].
See [[Dumps/Testing]] for more about this.


=== Adding a new snapshot host ===
=== Mirrors ===


Install and add to site.pp in the snapshot stanza (see snapshot1005-7).  Add the relevant hiera entries, documented in site.pp, according to whether the server will run en wiki dumps (only one server should do so), or misc cron jobs (one host should do so, not the same host running en wiki dumps).
If you are adding a mirror, see [[Dumps/Mirror status | Dumps Mirror setup ]].
 
[[Category: Dumps]]
Dumps run out of /srv/deployment/dumps/dumps/xmldumps-backup on each server.  Deployment is done via scap3 from the deployment server.
 
=== Starting dump runs ===
 
# Do nothing.  These jobs run out of cron.
 
== Troubleshooting ==
 
=== Fixing code ===
 
The dumps code is all in the repo [https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=xmldumps-backup;hb=master /operations/dumps.git, branch 'master'].  Various supporting scripts that are not part of the dumps proper, are in puppet; you can find those in the snapshot module.
 
Getting a copy as a committer:
: <code>git clone ssh://<user>@gerrit.wikimedia.org:29418/operations/dumps.git</code>
: <code>git checkout master</code>
 
ssh to the deployment host.
 
#cd /srv/deployment/dumps/dumps
#git pull
#scap deploy
 
Note: you likely need to be in the ops ldap group to do the scap.
Also note that changes pushed will not take place until the next dump run; any current run uses the existing dump code to complete.
 
=== Fixing configuration files ===
 
Configuration file setup is handled in the snapshot puppet module. You can check the config files themselves at /etc/dumps/confs on any snapshot host.
 
=== Out of space ===
 
See [[Dumps/Dumpsdata hosts#Space issues]] if we are running out of space on the hosts where the dumps are written as generated.
 
See [[Dumps/Dump servers#Space issues]] if we are running out of space on the dumps web or rsync servers.
 
=== Broken dumps ===
 
The dumps can break in a few interesting ways.
 
# They no longer appear to be running. Is the monitor running? See below. If it is running, perhaps all the workers are stuck on a stage waiting for a previous stage that failed.
:# Shoot them all and let the cron job sort it out. You can also look at the error notifications section and see if anything turns up; fix the underlying problem and wait for cron.
# A dump for a particular wiki has been aborted.  This may be due to me shooting the script because it was behaving badly, or because a host was powercycled in the middle of a run.
:# The next cron job should fix this up.
# A dump on a particular wiki has failed.
:# Check the information on error notifications, track down the underlying issue (db outage? MW deploy of bad code? Other?), fix it, and wait for cron to rerun it.
 
=== Error notifications ===
 
Email is ordinarily sent if a dump does not complete successfully, going to ops-dumps@wikimedia.org which is an alias. If you want to follow and fix failures, add yourself to that alias.
 
Logs are kept of each run. From any snapshot host, you can find the logs in the directory (<code>/mnt/data/xmldatadumps/private/<wikiname>/<date>/dumplog.txt</code>). From these you may glean more reasons for the failure.
 
Logs that capture the rest are available in /var/log/dumps/ and may also contain clues.
 
When one or more steps of a dump fail, the index.html file for that dump includes a notation of the failure and sometimes more information about it. Note that one step of a dump failing does not prevent other steps from running unless they depend on the data from that failed step as input.
 
=== Monitoring is broken ===
 
If the monitor does not appear to be running (the index.html file showing the dumps status is never updated), check which host should have it running (look for the host with <code>profile::dumps::generation::worker::monitor</code> in the role, at this writing snapshot1007).  This is a service that should be restarted with systemd or upstart, depending on the os version, so you'll want to see what change broke it.
 
=== Rerunning dumps ===
 
You really really don't want to do this.  These jobs run out of cron.  All by themselves. Trust me. Once the underlying problem (bad MW code, unhappy db server, out of space, etc) is fixed, it will get taken care of.
 
Okay, you don't trust me, or something's really broken.  See [[Dumps/Rerunning a job]] if you absolutely have to rerun a wiki/job.
 
=== A dump server (snapshot host) dies ===
 
If it can be brought back up within a day, don't bother to take any measures, just get the box back in service. If there are deployments scheduled in the meantime, you may want to remove it from scap targets for mediawiki: edit hieradata/common/scap/dsh.yaml for that.
 
If it's the testbed host (check the role in site.pp), just leave everything alone, no services will be impacted
 
If it will take more than a day to be fixed, swap it for the testbed/canary box, and remove it from scap targets for mediawiki:
* open manifests/site.pp and find the stanza for the broken snapshot host, grab that role
* now look for the snapshot host with <code>role(dumps::generation::worker::testbed)</code>, and put the broken host's role there
* in hieradata/hosts, git mv <code>brokenhost</code> to that testbed hostname, if there is such a file
* edit hieradata/common/scap/dsh.yaml to remove the broken host as a mediawki scap target
* merge all the things
 
=== A dumpsdata host dies ===
 
Coming soon...
 
=== A labstore host dies (web or nfs server for dumps) ===
 
These are managed by Wikimedia Cloud Services.  When this situation should arise, someone on that team should conduct the procedure below.
 
At current writing there are two labstore boxes that we care about; one serves web to the public + NFS to stats hosts; the other serves NFS to cloud VPS instances/toolforge.
 
*Determine which box went down. You can look at hieradata/common.yaml and the values for dumps_dist_active_web, dumps_dist_nfs_servers, and dumps_dist_active_vps for this.
*Remove the host from dumps_dist_nfs_servers.
*Change dumps_dist_active_vps to the other server, if the dead server was the vps NFS server.
*Change dumps_dist_active_web to the other server, if the dead server was NOT the vps NFS server (this means it was the stats NFS server, which is all that this setting controls).
*Forcibly unmount the NFS mount for the dead host everywhere you can in ToolForge. Try Cumin first, if that fails try clush for ToolForge. See [[#Notes on NFS issues and ToolForge load]] for more about this.
** Hint: If using clush under pressure, try: <pre>clush -w @all 'sudo umount -fl /mnt/nfs/dumps-[FQDN of down server]'</pre> on tools-clushmaster-02.tools.eqiad.wmflabs
*If the dead server was the web server:
**Switch the values in hieradata/hosts/<deadhostname> and hieradata/hosts/<vps nfs hostname> so that the other server has do_acme: true.  Without this, https will likely fail due to an expired certificate.
**Change the 'dumps' entry [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/templates/wikimedia.org here], and deploy to gdns according to https://wikitech.wikimedia.org/wiki/DNS#authdns-update
**Once that change has had some time to propagate (check the TTL), test to see that it successfully picked up a cert (checking https://dumps.wikimedia.org should work).  Trying puppet runs on the working server might be helpful here.
 
==== Notes on NFS issues and ToolForge load ====
Both hosts' NFS filesystems are mounted on all hosts that use either server for NFS, and the clients determine which nfs filesystem to use based on a symlink that varies from cluster to cluster. The dumps_dist_active_web setting only affects the symlink to the NFS filesystem on the stats hosts. Likewise, the dumps_dist_active_vps only affects the symlink to NFS filesystem on the VPSes (including Toolforge).
 
If the server is the vps NFS server (the value of dumps_dist_active_vps), [[Toolforge]] is probably losing its mind by now.  The best that can be done is to remove it from dumps_dist_nfs_servers and change dumps_dist_active_vps to the working server and '''unmount that NFS share everywhere you possibly can'''.  The earlier this is done, the better.  Load will be climbing like mad on any [[Cloud_vps|Cloud VPS]] server, including Toolforge nodes the entire time.  This may or may not stop because you unmounted everything.
 
 
[[Category:How-To]]
[[Category:Risk management]]
[[Category:dumps]]

Revision as of 18:31, 27 August 2021

Wikimedia infrastructure

[edit]

These docs are for maintainers of the various dumps. Information for users of the dumps can be found on metawiki. Information for developers can be found on mediawiki.org.

Daily checks

Dumps maintainers should watch or check a few things every day:

Dumps types

We produce several types of dumps. For information about deployment of updates, architecture of the dumps, and troubleshooting each dump type, check the appropriate entry below.

  • xml/sql dumps which contain revision metadata and content for public Wikimedia projects, along with contents of select sql tables
  • adds/changes dumps which contain a daily xml dump of new pages or pages with new revisions since the previous run, for public Wikimedia projects
  • Wikidata entity dumps which contain dumps of 'entities' (Qxxx) in various formats, and a dump of lexemes, run once a week.
  • category dumps which contain weekly full and daily incremental category lists for public Wikimedia projects, in rdf format
  • other miscellaneous dumps including content translation dumps, cirrus search dumps, and global block information.

Other datasets are also provided for download, such as page view counts; these datasets are managed by other folks and are not documented here.

Hardware

  • Dumps snapshot hosts that run scripts to generate the dumps
  • Dumps datastores where the snapshot hosts write intermediate and final dump output files, which are later published to our web servers
  • Dumps servers that provide the dumps to the public, to our mirrors, and via nfs to Wikimedia Cloud Services and stats host users

Adding new dumps

If you are interested in adding a new dumpset, please check the guidelines (still in draft form).

If you are working with wikibase dumps of some sort, you might want to look at a code walkthrough; see Dumps/Wikibase dumps overview.

Testing changes to the dumps or new scripts

See Dumps/Testing for more about this.

Mirrors

If you are adding a mirror, see Dumps Mirror setup .