You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Dumps/Adds-changes dumps"

From Wikitech-static
Jump to navigation Jump to search
imported>ArielGlenn
imported>Paladox
Line 3: Line 3:
We have an <strong>experimental</strong> service available which produces dumps of added/changed content on a daily basis for all projects that have not been closed and are not private.
We have an <strong>experimental</strong> service available which produces dumps of added/changed content on a daily basis for all projects that have not been closed and are not private.


The code for this service is available in [https://git.wikimedia.org/tree/operations%2Fdumps.git/master/xmldumps-backup%2Fincrementals our git repository (master branch)].  It relies on the python modules used by the regular dumps, at [https://git.wikimedia.org/tree/operations%2Fdumps.git/master/xmldumps-backup%2Fdumps in the regular dumps repo].
The code for this service is available in [https://phabricator.wikimedia.org/diffusion/ODUM/browse/master/xmldumps-backup/incrementals;33f94043534172bbf131684819275c6e8ae5332f our git repository (master branch)].  It relies on the python modules used by the regular dumps, at [https://phabricator.wikimedia.org/diffusion/ODUM/browse/master/xmldumps-backup/dumps in the regular dumps repo].


The job runs out of cron, currently on snapshot1003, as the ''datasets'' user.  Everything except initial script deployment is puppetized.  Scripts live in /srv/addschanges on snapshot1003.
The job runs out of cron, currently on snapshot1003, as the ''datasets'' user.  Everything except initial script deployment is puppetized.  Scripts live in /srv/addschanges on snapshot1003.

Revision as of 18:27, 1 July 2016

Adds/changes dumps overview

We have an experimental service available which produces dumps of added/changed content on a daily basis for all projects that have not been closed and are not private.

The code for this service is available in our git repository (master branch). It relies on the python modules used by the regular dumps, at in the regular dumps repo.

The job runs out of cron, currently on snapshot1003, as the datasets user. Everything except initial script deployment is puppetized. Scripts live in /srv/addschanges on snapshot1003.

Directory structure:

Everything for a given run is stored in dumproot/projectname/yyyymmdd/ much as we do for regular dumps.

How it works

We record the largest revision id for the given project, in the file maxrevid.txt, older than a configurable cuttof (currently at least 12 hours old). All revisions between this and the previously recorded revision for the previous day will be dumped. The delay gives editors on the specific wiki some time to have weeded out vandalism, advertising spam and so on.

We generate a stubs file containing metadata in xml format for each revision added since the previous day, consulting the file maxrevid.txt for the previous day to get the start of the range. We then generate a meta-history xml file which contains the text of these revisions grouped together and sorted by page id. Md5 sums of these are available in an md5sums.txt file. A status.txt file is available to indicate whether we had a successful run ("done") or not.

After all wikis have run, we check the directories for successful runs and writes a main index.html file with links for each project to the stub and content files for the latest successful run.

When stuff breaks

VERY OBSOLETE, WILL BE UPDATED SOON

You can rerun various jobs by hand for specified dates.

From the directory /srv/addschanges, as the datasets user, you can run

  • python ./generatemaxrevids.py to retrieve the maxrevids for today at the time of the run. (avoid this when possible, let cron do it at the scheduled hour.)
  • python ./generateincrementals.py yyyymmdd to generate the stubs and revs text files for a given date; this presumes that the revids form the previous step are already recorded in the file maxrevids.txt in the directory for the given date and project. You can add --verbose to get information about what it's doing. If it complains about lock files in place you can remove these by hand, providing that the cron job is not running at the time and there is no other copy of this job running.
  • python ./incrmonitor.py to regenerate the index.html file listing all projects, after the previous two steps are complete.

Internals

Locks:

During phase one, the retrieval of the max revision id from each project in turn, the job attempts to create a lock file with name in the format dbname-yyyymmdd-maxrevid.lock in the root of the directory tree for that project's add/changes dumps. For example, a job running on Nov. 24 2011 for frwiki would try to create the lockfile frwiki/20111124/frwiki-20111124-maxrevid.lock.

During phase two, the creation of stub and content xml files, the job tries to create a lockfile in the top level directory for the add/changes dumps for the given project, with name in the format dbname-yyyymmdd-incrdump.lock.

TODOs

Add stale lock removal option. Need to write interpolation of maxrevids in case step one fails on some days. Need to walk through older directories and write older stub/revision files if we see those files don't exist (this allows recovery if the job is out of action or fails for a few days).

Be able to restart (a given date) from a particular wiki, allow option to force removal of locks for the run.

Some numbers

Here's a few fun numbers from the November 23 2011 run. Writing the stubs file for 167985 revisions for en wikipedia took 2 minutes, and writing the revisions text file took 24 minutes. Writing the stubs file for 36272 revisions for de wikipedia took less than a minute, and writing the revisions text file took 5 minutes. Writing the stubs file for 43133 revisions for commons took 1 minute, and writing the revisions text file took 2 minutes.