Dumps/OtherMisc
This page documents various dumpsets that are produced daily or weekly, not part of the generation of the xml/sql dumps .
All of these dumps run on database servers designated 'vslow, dumps', on a snapshot host dedicated to 'misc' dump generation (everything other than the xml/sql dumps).
The dump scripts are in our git puppet repo .
If errors are encountered when the specific cron job runs, the output is sent to ops-dumps@wikimedia.org.
-
Cirrus search dumps
:
- dumped weekly
- contains text indices, the file index (for commons) and the metadata index (for the entire cirrus cluster) in json format
- run by a maintenance script in mw:Extension:CirrusSearch ( code )
- Issues: it's been quite reliable so far
-
Content Translation dumps
:
- dumped weekly
- contains parallel corpora that can be used by developers working on machine translation.
- run by a maintenance script in mw:Extension:ContentTranslation ( code )
- Issues: it has run out of memory when the language files being dumped have too much data; these can be split apart in order to resolve the problem. Example: see this phab task .
-
Media info
:
- dumped weekly
- two files for each wiki, consisting of titles of media files stored locally, and those used on the project stored remotely (on Commons).
- run by a shell wrapper around the onallwikis.py script in the operations/dumps repo ( code )
- Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.
-
Page titles
:
- dumped daily
- contains a list of all page titles in the main namespace (NS 0) per project
- run by the onallwikis.py script in the operations/dumps repo ( code )
- Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.
-
Media titles
:
- dumped daily
- contains a list of all titles in the Media namespace (NS 6) per project
- run by the onallwikis.py script in the operations/dumps repo ( code )
- Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.
-
Short url mappings
:
- dumped weekly
- each line contains an entry of the form short-url|log-url
- run by the onallwikis.py script in the operations/dumps repo ( code )
- Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.