You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Dumps/New dumps and datasets
This page is currently a draft.
More information and discussion about changes to this draft on the talk page.
This is a work in progress and may be completely rewritten/thrown out/flushed down the toilet. You Have Been Warned.
New dumps or datasets
Adding new dumps
So you want to generate dumps for a new extension or for new content; what should you do?
These guidelines describe what is ncessary to get your dumps and datasets generated and added to our public webserver.
- Talk to us first. How big will these dumps grow? How long will they take to run on one CPU? How much memory will they need? What resources will they need over the next three to five years? We need ths information so that we can plan properly for server capacity.
- Dumps that communicate with MediaWiki databases must use the vslow (dump) db server group, as described in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/db-eqiad.php or analogous files. We do this because long-running queries, as dump queries typically are, cause a problem when run on the databases used by the application servers. Typically one can use code like the following to get a connection to a database server in the right group:
lb = wfGetLBFactory()->newMainLB(); db = lb->getConnection( DB_REPLICA, 'dump' );
- If your dump process retrieves revision content and not just metadata, it must be written to run in two passes, one pass to write out the metadata, and a second pass to re-use revision content from the previous run if available. Because retrieval of revision content from our external storage is very expensive, reusing previously retrieved content whenever possible is paramount, both for speed of the dump run, and for reducing the load on the external storage database servers.
- Expect database servers to be depooled for maintenance without warning during your dump run. This means that any given dump job should be broken own into small tasks that take no longer than a few hours, and that can be rerun automatically up to some number of retries.
- If there are consistency checks that can be done on your data to be sure that the output is valid, you should do so. A bug in deployed code can cause all kinds of things to go wrong; you can, for example, check that files have the right starting and ending content (for xml files), and that compressed files were written completely (gzip or bzip2 files).
- Dumps of content stored by extensions (Flow, FlaggedRevs, etc) should be part of the bi-monthly dump run, and should be generated in xml format, with a corresponding schema, see https://www.mediawiki.org/xml/export-0.10.xsd for an example.
- All other dumps are run on a weekly basis, more or less, and their files are stored in a directory tree with the following structure: other/dumpname/date/files where all files for all wikis generated on the same date go into the same directory. These dumps are listed on the page of 'other' dumps, see https://dumps.wikimedia.org/other/
- If you want to add your weekly dumps to the index.html page there, you may submit a gerrit patchset to puppet, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/dumps/files/web/html/other_index.html.
- If you want an index.html page for your dumps, it should be provided with a gerrit patch to puppet, adding a file to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/dumps/files/web/html (see the files in that directory for existing examples).
- There are shell scripts with helper functions available for weekly dump jobs of this sort. You can look at an example, https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/snapshot/files/cron/dump-global-blocks.sh. These may change in the future, but if that happens, all such jobs will be migrated at once and you probably won't be asked to do anything except give your ok.
- Cron jobs to run weekly dumps should not generate output except on error. They should not however direct all whines to /dev/null; if there is an error, we need to know about it so we can ask you to fix it.
- It's best to schedule your dump's weekly cron job so that it doesn't overlap with others (except for the Wikidata dumps), to the extent possible. For now, this means doublechecking the dates and times in <> and estimating your own job's run time.
We are available to support you in all of these things, and to coordinate merging and deployment of all puppet patches. The basic dump script should come from your team; after that we can work with you to iron out the rest of the details.
Adding new datasets
- Other datasets provied by you may be served to the public; the datasets should be checked to make sure contain no private or sensitive information, the number of old datasets you want to keep should be specified, and an estimate of how much space they will take should be provided, both now and over the next three to five years. These will be listed on the 'other' index.html page at https://dumps.wikimedia.org/other/, and you may add your content to that page via a gerrit change to puppet, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/dumps/files/web/html/other_index.html.
- If your dataset is to be copied to the public webserver via rsync, you should add a gerrit patch to puppet that does the fetch, see files in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/dumps/manifests/web/fetches/ for examples. This will require setting up rsyncd on the source host and configuring it to allow access by the dumps web server, as well as setting up ferm rules to permit rsync through. We recommend rsyncs be done no more often than once a day.
- Index.html pages for your new datasets can be provided by adding an html page to our puppet repo, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/dumps/files/web/html for examples.
- All such datasets will be automatically made available to users of stats1005/6 and to users of WMF Cloud instances, via nfs.