Jump to content

This is a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Dumps/OtherMisc

From Wikitech

This page documents various dumpsets that are produced daily or weekly, not part of the generation of the xml/sql dumps .

All of these dumps run on database servers designated 'vslow, dumps', on a snapshot host dedicated to 'misc' dump generation (everything other than the xml/sql dumps).

The dump scripts are in our git puppet repo .

If errors are encountered when the specific cron job runs, the output is sent to ops-dumps@wikimedia.org.

  • Cirrus search dumps :
    • dumped weekly
    • contains text indices, the file index (for commons) and the metadata index (for the entire cirrus cluster) in json format
    • run by a maintenance script in mw:Extension:CirrusSearch ( code )
    • Issues: it's been quite reliable so far
  • Content Translation dumps :
    • dumped weekly
    • contains parallel corpora that can be used by developers working on machine translation.
    • run by a maintenance script in mw:Extension:ContentTranslation ( code )
    • Issues: it has run out of memory when the language files being dumped have too much data; these can be split apart in order to resolve the problem. Example: see this phab task .
  • Media info :
    • dumped weekly
    • two files for each wiki, consisting of titles of media files stored locally, and those used on the project stored remotely (on Commons).
    • run by a shell wrapper around the onallwikis.py script in the operations/dumps repo ( code )
    • Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.
  • Page titles :
    • dumped daily
    • contains a list of all page titles in the main namespace (NS 0) per project
    • run by the onallwikis.py script in the operations/dumps repo ( code )
    • Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.
  • Media titles :
    • dumped daily
    • contains a list of all titles in the Media namespace (NS 6) per project
    • run by the onallwikis.py script in the operations/dumps repo ( code )
    • Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.
  • Short url mappings :
    • dumped weekly
    • each line contains an entry of the form short-url|log-url
    • run by the onallwikis.py script in the operations/dumps repo ( code )
    • Issues: if the database server is unavailable, up to three retries will be attempted, after which the script will give up.