You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

WMDE/Wikidata/ORES

From Wikitech-static
< WMDE‎ | Wikidata
Jump to navigation Jump to search

ORES is a machine learning platform maintained by WMF. We use ORES extensively in Wikidata and have developed parts of it related to Wikidata. Most notably item quality model.

How to build item quality model

For models, you need to find the repo that holds the model and its features. For item quality it's articlequality. Clone it.

After installing python requirements (requirements.txt file), you can simply run "make wikidatawiki_models". The Makefile has more details on what commands will be run and what can cause issues. The makefile will also create the model info file which exists in model_info/wikidatawiki.item_quality.md. It also creates the model binary file in models/wikidatawiki.item_quality.gradient_boosting.model.

Model info file has critical information on ratio of false positives, thresholds, ROC-AUC, etc.

To test the model file you can simply run such python code:

  import mwapi
  from revscoring import Model
  from revscoring.extractors.api.extractor import Extractor

  with open("models/wikidatawiki.item_quality.gradient_boosting.model") as f:
       scorer_model = Model.load(f)

  extractor = Extractor(mwapi.Session(host="https://www.wikidata.org",
                                          user_agent="test"))

  feature_values = list(extractor.extract(123456789, scorer_model.features))

  print(scorer_model.score(feature_values))

Preparing the model for deployment

If you added features and you're sure it'll improve the system, after getting the features reviewed and merged, you need to retrain the model. The binary files merged to the articlequality repo must be the exact same as production (yeah, I know...). You need to find out what OS ores100x is on (currently stretch) and retrain the model in on a VM that has the same OS. Retraining the model as said is simply running "make wikidata_models". If this command doesn't do anything, simply delete the model file (and datasets/wikidatawiki.labeling_revisions.w_cache*) first.

If you're adding more labeled data (and not adding features), you need to change the makefile, look at dependencies of the wikidata model in the make file, look where the datasets are coming from and how they are built, simply add another dataset, cat it in the final step and then run "make wikidata_models" to trigger a retrain.

For deployment, follow the general guideline: ORES/Deployment

Run the model on a dump

If you re-trained the model and now want to have scores on revisions of wikidata based on a dump, you can run extract_scores utility in articlequality (there is no guarantee it would work, it's ancient, you might have to tweak it). Here's the bash file that produces the dump monthly in stat1005:

month=$(date +"%Y%m")
day="${month}01"
source /home/ladsgroup/p3/bin/activate
cd /home/ladsgroup/articlequality
./utility extract_scores /mnt/data/xmldatadumps/public/wikidatawiki/${day}/wikidatawiki-${day}-pages-articles[1234567890]?*.xml-*.bz2 --model models/wikidatawiki.item_quality.gradient_boosting.model --sunset ${day}000000 --processes=10 --score-at monthly --verbose > run_${month}.out 2> run_${month}.err
grep ${month} run_${month}.out > run_${month}_1.out
mv run_${month}_1.out run_${month}.out
mv run_${month}.out wikidata_quality_snapshot_${month}.tsv
gzip -k wikidata_quality_snapshot_${month}.tsv

The resulting output is something like:

page_id         item_id         rev_id          timestamp       class   weighted_sum
15791782        Q14126127       1006231840      20191001000000  C       2.9696165555741985
21934496        Q20219489       1022984964      20191001000000  D       2.029816923665906
20434497        Q18881996       900205935       20191001000000  C       3.3918374898232493
25711961        Q23708018       982636318       20191001000000  C       2.91177394658644
23914320        Q21877494       1013198591      20191001000000  C       2.9855789615867576
14840084        Q13223924       881834471       20191001000000  C       3.0241936745417326
25414320        Q23405795       914178204       20191001000000  C       2.918603674849337
21934498        Q20219490       1015121079      20191001000000  D       2.029353280459076
23914321        Q21877495       1011135621      20191001000000  B       3.555012588220033