You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Dumps: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Quiddity
(clearer link label)
imported>Krinkle
(mention operations/dumps.git, import information from from Key Wikimedia software projects so as to merge it into mw:Developers/Maintainers)
 
Line 2: Line 2:
These docs are for '''maintainers''' of the various dumps. Information for '''users''' of the dumps can be found at Meta-wiki's [[m:Data dumps]] page. Information for '''developers''' can be found on [[mw:SQL/XML_Dumps|mediawiki.org]].
These docs are for '''maintainers''' of the various dumps. Information for '''users''' of the dumps can be found at Meta-wiki's [[m:Data dumps]] page. Information for '''developers''' can be found on [[mw:SQL/XML_Dumps|mediawiki.org]].


=== Daily checks ===
== Daily checks ==
Dumps maintainers should watch or check a few things every day:
Dumps maintainers should watch or check a few things every day:
* email to the ops-dumps mail alias (get on it! [[SRE/Clinic_Duty#Mail_aliases]])
* email to the ops-dumps mail alias (get on it! [[SRE/Clinic_Duty#Mail_aliases]])
* [https://lists.wikimedia.org/postorius/lists/xmldatadumps-l.lists.wikimedia.org/ xmldatadumps-l mailing list]
* [https://lists.wikimedia.org/postorius/lists/xmldatadumps-l.lists.wikimedia.org/ xmldatadumps-l mailing list]
* [https://phabricator.wikimedia.org/tag/dumps-generation/ phabricator dumps workboard]
* [[phab:tag/dumps-generation/|Phabricator Dumps-Generation workboard]]
* [https://dumps.wikimedia.org/ the current dumps run, if not idle]
* https://dumps.wikimedia.org/ (mentions the current run, unless idle)
* icinga for dumps hosts: [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=snapshot1 snapshots], [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=dumpsdata]
* icinga for dumps hosts: [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=snapshot1 snapshots], [https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=dumpsdata]


=== Dumps types ===
== Dumps types ==
We produce several types of dumps. For information about deployment of updates, architecture of the dumps, and troubleshooting each dump type, check the appropriate entry below.
We produce several types of dumps. For information about deployment of updates, architecture of the dumps, and troubleshooting each dump type, check the appropriate entry below.


Line 20: Line 20:


Other datasets are also provided for download, such as page view counts; these datasets are managed by other folks and are not documented here.
Other datasets are also provided for download, such as page view counts; these datasets are managed by other folks and are not documented here.
== Service ==


=== Hardware ===
=== Hardware ===
Line 38: Line 40:
=== Mirrors ===
=== Mirrors ===


If you are adding a mirror, see [[Dumps/Mirror status | Dumps Mirror setup ]].
If you are adding a mirror, see [[Dumps/Mirror status |Dumps Mirror setup]].
 
=== Source code ===
 
* [https://gerrit.wikimedia.org/r/q/project:operations%252Fdumps operations/dumps.git]
 
[[Category: Dumps]]
[[Category: Dumps]]

Latest revision as of 21:59, 5 July 2022

These docs are for maintainers of the various dumps. Information for users of the dumps can be found at Meta-wiki's m:Data dumps page. Information for developers can be found on mediawiki.org.

Daily checks

Dumps maintainers should watch or check a few things every day:

Dumps types

We produce several types of dumps. For information about deployment of updates, architecture of the dumps, and troubleshooting each dump type, check the appropriate entry below.

  • xml/sql dumps which contain revision metadata and content for public Wikimedia projects, along with contents of select sql tables
  • adds/changes dumps which contain a daily xml dump of new pages or pages with new revisions since the previous run, for public Wikimedia projects
  • Wikidata entity dumps which contain dumps of 'entities' (Qxxx) in various formats, and a dump of lexemes, run once a week.
  • category dumps which contain weekly full and daily incremental category lists for public Wikimedia projects, in rdf format
  • other miscellaneous dumps including content translation dumps, cirrus search dumps, and global block information.

Other datasets are also provided for download, such as page view counts; these datasets are managed by other folks and are not documented here.

Service

Hardware

  • Dumps snapshot hosts that run scripts to generate the dumps
  • Dumps datastores where the snapshot hosts write intermediate and final dump output files, which are later published to our web servers
  • Dumps servers that provide the dumps to the public, to our mirrors, and via nfs to Wikimedia Cloud Services and stats host users

Adding new dumps

If you are interested in adding a new dumpset, please check the guidelines (still in draft form).

If you are working with wikibase dumps of some sort, you might want to look at a code walkthrough; see Dumps/Wikibase dumps overview.

Testing changes to the dumps or new scripts

See Dumps/Testing for more about this.

Mirrors

If you are adding a mirror, see Dumps Mirror setup.

Source code