You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
WMDE/Wikidata/PropertySuggester update
Jump to navigation
Jump to search
Occasionally, the data for the property suggester needs to be updated from the latest JSON dumps; we usually try to do this once a month. Here’s how it works:
One-time setup
Run the following commands on Toolforge, in your home directory.
wget https://gist.githubusercontent.com/mariushoch/22f4ead44f75c5133e403f465bc279a5/raw/scheduleUpdateSuggester
wget https://gist.githubusercontent.com/mariushoch/22f4ead44f75c5133e403f465bc279a5/raw/updateSuggester.sh
chmod +x scheduleUpdateSuggester updateSuggester.sh
git clone https://github.com/wikimedia/wikibase-property-suggester-scripts.git
python3 -m venv wikibase-property-suggester-scripts/
(source wikibase-property-suggester-scripts/bin/activate && pip install --upgrade pip && pip install -r wikibase-property-suggester-scripts/requirements.txt)
(cd wikibase-property-suggester-scripts/ && source bin/activate && python setup.py build)
Run the following commands on a production maintenance host (currently mwmaint1002), in your home directory.
wget https://gist.githubusercontent.com/mariushoch/22f4ead44f75c5133e403f465bc279a5/raw/T132839-Workarounds.sh
chmod +x T132839-Workarounds.sh
TODO: Move the scripts somewhere else?
Each update
Instructions based on this gist.
- Find the latest JSON dump beneath
/public/dumps/public/wikidatawiki/entities/
. We’ll use yyyymmdd as a placeholder for its name below.- Note that the dumps take several days to run – the date is when the dumps started, but the results will not be available that day.
- Run
./scheduleUpdateSuggester yyyymmdd
on Toolforge.- This will take almost three days (as of 2019-03-18).
- Check the logs at
updateSuggester.err
for progress or problems during the creation. It will first log “processed XMB” lines (up to 706838.54MB as of 2019-03-14), then “processed Y entities” (see d:Special:Statistics for the approximate current number of entities), then “rows Z” (up to 1919000 as of 2019-03-18)
jsub -sync y sha1sum analyzed-out
(or whatever hashing algorithm you prefer)jsub -sync y gzip analyzed-out
- Rsync analyzed-out.gz to your local machine, commit to the wbs_propertypairs repo with the commit message
Add propertypairs from the yyyymmdd dump
. - Load it down to the maintenance host with
https_proxy=http://webproxy.eqiad.wmnet:8080 wget 'https://github.com/wmde/wbs_propertypairs/raw/master/yyyymmdd/wbs_propertypairs.csv.gz'
. - Unpack it:
gzip -d wbs_propertypairs.csv
- Compare the checksum to the one obtained on Toolforge
- Update the actual table:
mwscript extensions/PropertySuggester/maintenance/UpdateTable.php --wiki wikidatawiki --file wbs_propertypairs.csv
.- This will take some four minutes.
- It will first log (to your terminal) a bunch of “deleting a batch” lines, then “X rows inserted” up to the total number of lines in the CSV file (which you can count with
wc -l wbs_propertypairs.csv
beforehand).
- Run
T132839-Workarounds.sh
(on the maintenance host).- This takes about three minutes.
- Log your changes:
!log Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds