You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User:AKhatun/Wikidata Item ORES Score Analysis: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>AKhatun
(→‎Quantitative analysis: Add takeaways. remove table 3, not much information here)
imported>AKhatun
No edit summary
 
Line 1: Line 1:
= ORES score =
= ORES score =


The [[mw:ORES|ORES]] scores provides us with information about the quality of an item through the <code>itemquality</code> model. ORES is available for many wikimedia sister projects, for this discussion we stick to Wikidata. ORES also has multiple models, for this discussion we stick to <code>itemquality</code> model, which, for every revision (or state) of a Wikidata item, tells us how good the item is. This model classifies each item into one of 5 classes (A-E), where A signifies a high-quality item that has all relevant statements with solid references, translations, aliases, etc. And E signifies the lowest quality item. More information about what each category means can be found here: [[Wikidata:Wikidata:Item_quality|Item_quality]]. Besides, the probability of each class is combined to find one number, called the ORES score ([[Wikidata:Wikidata:Item_quality#ORES|Item_quality#ORES]]).
The [[mw:ORES|ORES]] scores provides us with information about the quality of an item through the <code>itemquality</code> model. ORES is available for many Wikimedia sister projects, for this discussion we stick to Wikidata. ORES also has multiple models, for this discussion we stick to <code>itemquality</code> model, which, for every revision (or state) of a Wikidata item, tells us how good the item is. This model classifies each item into one of 5 classes (A-E), where A signifies a high-quality item that has all relevant statements with solid references, translations, aliases, etc. And E signifies the lowest quality item. More information about what each category means can be found here: [[Wikidata:Wikidata:Item_quality|Item_quality]]. Besides, the probability of each class is combined to find one number, called the ORES score ([[Wikidata:Wikidata:Item_quality#ORES|Item_quality#ORES]]).


Some more resources:
Some more resources:
* Web service to get the quality of Wikidata items: [https://item-quality-evaluator.toolforge.org/ item-quality-evaluator.toolforge.org]
* Web service to get the quality of Wikidata items: [https://item-quality-evaluator.toolforge.org/ item-quality-evaluator.toolforge.org]
* List of features that determine the score of an item: [[Wikidata:Wikidata:ORES/List_of_features|List_of_features]]
* List of features that determine the score of an item: [[Wikidata:Wikidata:ORES/List_of_features|List_of_features]]
* See how the probability of each class is converted to a single score and how you can see the ORES score directly on the Wikidata webpage for each item: [[Wikidata:Wikidata:Item_quality#ORES|Item_quality#ORES]]
* See how the probability of each class is converted to a single score and how you can see the ORES score directly on the Wikidata web page for each item: [[Wikidata:Wikidata:Item_quality#ORES|Item_quality#ORES]]
* ORES FAQ: [[mw:ORES/FAQ|FAQ]]
* ORES FAQ: [[mw:ORES/FAQ|FAQ]]


= Data source =  
= Data source =  


The data used for the following analysis was taken from the <code>event_sanitized.mediawiki_revision_score</code> ([[Analytics/Data_Lake/ORES|Analytics/Data_Lake/ORES]]) table till <code>year=2022/month=1/day=17/hour=11</code>. This table contains ORES prediction and probabilities per class for each new revision of Wikidata Items. The ORES model for Wikidata was deployed around 2018, so not all items have a score recorded(especially those that haven't been edited in a long time). Some older revisions use <code>0.4.0</code> version of the model as opposed to the current <code>0.5.0</code> version. Some revisions have scores with both versions, in that case we pick the latest version of the model.  
The data used for the following analysis was taken from the <code>event_sanitized.mediawiki_revision_score</code> ([[Analytics/Data_Lake/ORES|Analytics/Data_Lake/ORES]]) table till <code>year=2022/month=1/day=17/hour=11</code>. This table contains ORES prediction and probabilities per class for each new revision of Wikidata Items. The ORES model for Wikidata was deployed around 2018, so not all items have a score recorded (especially those that haven't been edited in a long time). Some older revisions use <code>0.4.0</code> version of the model as opposed to the current <code>0.5.0</code> version. Some revisions have scores with both versions, in that case we pick the latest version of the model.  


'''For each item, we take the latest revision that has a ORES score, and choose the score from the latest model version if available.'''
'''For each item, we take the latest revision that has a ORES score, and choose the score from the latest model version if available.'''
Line 69: Line 69:
* Total items in top 341 subgraphs: 89,051,118 (89.53% of Wikidata items)
* Total items in top 341 subgraphs: 89,051,118 (89.53% of Wikidata items)
** Total items in top 341 subgraphs that have a score recorded: 80,897,270 ('''90.84%''' of items in top 341 subgraphs, 81% of Wikidata items)
** Total items in top 341 subgraphs that have a score recorded: 80,897,270 ('''90.84%''' of items in top 341 subgraphs, 81% of Wikidata items)
* [https://github.com/tanny411/Wikidata-WDQS-Analysis/blob/master/ORES_score/subgraph_prediction_classes_diff.csv CSV file] with number of items per prediction class in each subgraph along with the percentage in individual subgraph, percentage in whole Wikidata, and difference with typical distribution.


=== Qualitative analysis ===
=== Qualitative analysis ===
* For the analysis below, we only consider the items of the top 50 subgraphs.
* For the analysis below, we only consider the items of the top 50 subgraphs.
* The figure below shows the deviation of distribution for each subgraph from the distribution in the whole of Wikidata.
* The figure below shows the deviation of distribution for each subgraph from the distribution in the whole of Wikidata.
* Most subgraphs have the typical distribution, mostly C and D class items.
* Most subgraphs have the typical distribution, i.e, they have mostly C and D class items.
* ''Wikimedia Category'' subgraph has a lot less high-quality items, and has more E class (lowest-quality) items. Similar is the case with ''Wikimedia disambiguation page'', ''Wikimedia Template'', ''branch post office'', and ''primary school''. '''5''' subgraph in the figure below.
* ''Wikimedia Category'' subgraph has a lot less high-quality items, and has more E class (lowest-quality) items. Similar is the case with ''Wikimedia disambiguation page'', ''Wikimedia Template'', ''branch post office'', and ''primary school''. '''5''' subgraph in the figure below.
* Some items have less high-quality items, but more D class items. '''18''' subgraphs in the figure below are in this category. They all have less of C and more of D than typical distribution. Some of the significant examples are: ''position'', ''group of stereoisomers'', ''prime number'', ''print'', ''clinical trial'', ''collection'', ''chemical compound'', etc.
* Some items have less high-quality items, but more D class items. '''18''' subgraphs in the figure below are in this category. They all have less of C and more of D than typical distribution. Some of the significant examples are: ''position'', ''group of stereoisomers'', ''prime number'', ''print'', ''clinical trial'', ''collection'', ''chemical compound'', etc.
Line 84: Line 85:


* The top 341 subgraphs were used for the following analysis. Items from these subgraphs form 90% of all Wikidata items. We only consider items that have a score in the event table, which form 81% of all Wikidata items.
* The top 341 subgraphs were used for the following analysis. Items from these subgraphs form 90% of all Wikidata items. We only consider items that have a score in the event table, which form 81% of all Wikidata items.
* Each table lists 5 top subgraphs for each prediction class (A to E). The tables also show the number of items in each category, what percent this count is in terms of entire Wikidata, in terms of respective subgraph, and also the difference in distribution from the typical scenario (where typical scenario is the distribution of the classes in all of Wikidata).
* Each table lists 5 top subgraphs for each prediction class (A to E). The tables also show the number of items in each category, what percent this count is in terms of entire Wikidata, in terms of respective subgraph, and also the difference in distribution from the typical scenario (where typical scenario is the distribution of the classes in all of Wikidata). The table with all subgraphs can be found here: [https://github.com/tanny411/Wikidata-WDQS-Analysis/blob/master/ORES_score/subgraph_prediction_classes_diff.csv CSV file].
* Takeaways from [[#table1|table 1]]:
* Takeaways from [[#table1|table 1]]:
** The first table shows subgraphs with the most items per prediction class. Note that despite ''scholarly article'' subgraph being significantly larger in size compared to its successors ''human'' and ''astronomical objects'', the ''human'' subgraph has the most A class high-quality items. That is not to say these items form the bulk of this subgraph, they are only 0.28% of the human subgraph items. Nevertheless, compared to other subgraphs, this is significant. Next come ''commune of France'', ''taxon'', ''film'', ''chemical comound'' subgarphs with the most A class items.
** The first table shows subgraphs with the most items per prediction class. Note that despite ''scholarly article'' subgraph being significantly larger in size compared to its successors ''human'' and ''astronomical objects'', the ''human'' subgraph has the most A class high-quality items. That is not to say these items form the bulk of this subgraph, they are only 0.28% of the human subgraph items. Nevertheless, compared to other subgraphs, this is significant. Next come ''commune of France'', ''taxon'', ''film'', ''chemical compound'' subgraphs with the most A class items.
** Both class B and C have the 4 largest subgraphs in the top positions possibly due to the size of these subgraph. Simialrly for D class, the top subgraphs are indeed some of the largest subgraphs.
** Both class B and C have the 4 largest subgraphs in the top positions possibly due to the size of these subgraph. Similarly for D class, the top subgraphs are indeed some of the largest subgraphs.
* Takeaways from [[#table2|table 2]]:
* Takeaways from [[#table2|table 2]]:
** In class A, ''commune of France'' seems to be the only subgraph with a significant amount (25%) of it's item having high quality. The rest of the subgraphs have ~2% or less of their items in this category. This is a large distinction.
** In class A, ''commune of France'' seems to be the only subgraph with a significant amount (25%) of it's item having high quality. The rest of the subgraphs have ~2% or less of their items in this category. This is a large distinction.

Latest revision as of 13:03, 20 January 2022

ORES score

The ORES scores provides us with information about the quality of an item through the itemquality model. ORES is available for many Wikimedia sister projects, for this discussion we stick to Wikidata. ORES also has multiple models, for this discussion we stick to itemquality model, which, for every revision (or state) of a Wikidata item, tells us how good the item is. This model classifies each item into one of 5 classes (A-E), where A signifies a high-quality item that has all relevant statements with solid references, translations, aliases, etc. And E signifies the lowest quality item. More information about what each category means can be found here: Item_quality. Besides, the probability of each class is combined to find one number, called the ORES score (Item_quality#ORES).

Some more resources:

Data source

The data used for the following analysis was taken from the event_sanitized.mediawiki_revision_score (Analytics/Data_Lake/ORES) table till year=2022/month=1/day=17/hour=11. This table contains ORES prediction and probabilities per class for each new revision of Wikidata Items. The ORES model for Wikidata was deployed around 2018, so not all items have a score recorded (especially those that haven't been edited in a long time). Some older revisions use 0.4.0 version of the model as opposed to the current 0.5.0 version. Some revisions have scores with both versions, in that case we pick the latest version of the model.

For each item, we take the latest revision that has a ORES score, and choose the score from the latest model version if available.

Wikidata Q-Item ORES prediction

  • Q-items are Namespace 0.
  • Re-directs were removed.
  • Percentage of Wikidata items that have a score recorded: 88.37%. Scores of the rest of the items were not recorded in the event table.

ORES class distribution

  • ORES predicted class (A to E) is the class that has the highest probability.
  • Most items are in the C class, meaning they are okay items. They have enough statements and some references. Second most popular class is D, meaning they have some basic statements but are lacking in references. D items are less than okay, but at least recognizable as distinct items.
ORES predicted class distribution of Wikidata items
Class Number of items Percent of Wikidata items
A 63596 0.064
B 8324047 8.369
C 42391757 42.620
D 26990372 27.136
E 10131002 10.186
None 11619871 11.682
File:ORES model prediction of WD QItems.png File:ORES model versiosn of WD QItems.png

ORES score distribution

  • ORES Score is calculated as (5 * probability of class A) + (4 * probability of class B) + (3 * probability of class C) + (2 * probability of class D) + (1 * probability of class E) as per Item_quality#ORES
ORES score distribution of Wikidata items
max min avg stddev Q1 (25th percentile) Q2 (Median) Q3 (75th percentile)
4.97 1.01 2.56 0.71 2.04 2.84 2.99

ORES class distribution by subgraph

  • Total items in Wikidata (as of 20220103 dump): 99,464,418
    • Total items in Wikidata that have a score recorded: 87,900,774 (88.37% of Wikidata items)
  • Total items in top 341 subgraphs: 89,051,118 (89.53% of Wikidata items)
    • Total items in top 341 subgraphs that have a score recorded: 80,897,270 (90.84% of items in top 341 subgraphs, 81% of Wikidata items)
  • CSV file with number of items per prediction class in each subgraph along with the percentage in individual subgraph, percentage in whole Wikidata, and difference with typical distribution.

Qualitative analysis

  • For the analysis below, we only consider the items of the top 50 subgraphs.
  • The figure below shows the deviation of distribution for each subgraph from the distribution in the whole of Wikidata.
  • Most subgraphs have the typical distribution, i.e, they have mostly C and D class items.
  • Wikimedia Category subgraph has a lot less high-quality items, and has more E class (lowest-quality) items. Similar is the case with Wikimedia disambiguation page, Wikimedia Template, branch post office, and primary school. 5 subgraph in the figure below.
  • Some items have less high-quality items, but more D class items. 18 subgraphs in the figure below are in this category. They all have less of C and more of D than typical distribution. Some of the significant examples are: position, group of stereoisomers, prime number, print, clinical trial, collection, chemical compound, etc.
  • All other subgraphs seem to have almost similar distribution, i.e, low number of A and B class items. None have any higher percentage of high-quality items.
  • It is safe to assume, at least for some of the largest subgraphs, that they don't have much high-quality item (A or B class). And either have more C, or sometimes much more D or E (low-quality) items.

File:Distribution diff of ORES per subgraph.png

Quantitative analysis

  • The top 341 subgraphs were used for the following analysis. Items from these subgraphs form 90% of all Wikidata items. We only consider items that have a score in the event table, which form 81% of all Wikidata items.
  • Each table lists 5 top subgraphs for each prediction class (A to E). The tables also show the number of items in each category, what percent this count is in terms of entire Wikidata, in terms of respective subgraph, and also the difference in distribution from the typical scenario (where typical scenario is the distribution of the classes in all of Wikidata). The table with all subgraphs can be found here: CSV file.
  • Takeaways from table 1:
    • The first table shows subgraphs with the most items per prediction class. Note that despite scholarly article subgraph being significantly larger in size compared to its successors human and astronomical objects, the human subgraph has the most A class high-quality items. That is not to say these items form the bulk of this subgraph, they are only 0.28% of the human subgraph items. Nevertheless, compared to other subgraphs, this is significant. Next come commune of France, taxon, film, chemical compound subgraphs with the most A class items.
    • Both class B and C have the 4 largest subgraphs in the top positions possibly due to the size of these subgraph. Similarly for D class, the top subgraphs are indeed some of the largest subgraphs.
  • Takeaways from table 2:
    • In class A, commune of France seems to be the only subgraph with a significant amount (25%) of it's item having high quality. The rest of the subgraphs have ~2% or less of their items in this category. This is a large distinction.
    • All other subgraphs listed seem to have mostly a specific quality of item (B,C,D,E), as much as 99-100%. Also all of them have ~10K items in the respective category indicating the size of these subgraphs is also around 10K. Few have 50-100K items.
Top 5 subgraphs in each prediction class with greatest number of items
Prediction Class Subgraph Subgraph label # items % items % items in respective subgraph Diff from typical distribution in respective prediction class
A Q5 human 25,967 0.03 0.28 0.21
Q484170 commune of France 11,513 0.01 25.26 25.19
Q16521 taxon 6,461 0.01 0.19 0.12
Q11424 film 3,739 0.0 1.44 1.36
Q11173 chemical compound 1,525 0.0 0.12 0.05
B Q13442814 scholarly article 5,168,895 6.39 15.24 5.76
Q6999 astronomical object 1,516,261 1.87 18.04 8.56
Q5 human 693,124 0.86 7.56 -1.91
Q16521 taxon 349,153 0.43 10.46 0.98
Q7187 gene 184,886 0.23 26.2 16.72
C Q13442814 scholarly article 24,539,904 30.34 72.34 24.08
Q6999 astronomical object 4,107,902 5.08 48.88 0.62
Q5 human 3,572,265 4.42 38.97 -9.29
Q16521 taxon 2,407,628 2.98 72.12 23.87
Q4167836 Wikimedia category 808,759 1.0 23.35 -24.9
D Q5 human 4,247,974 5.25 46.34 15.62
Q13442814 scholarly article 3,816,100 4.72 11.25 -19.48
Q6999 astronomical object 2,552,206 3.16 30.37 -0.36
Q11173 chemical compound 1,218,961 1.51 97.97 67.25
Q4167836 Wikimedia category 1,107,390 1.37 31.98 1.25
E Q4167836 Wikimedia category 1,546,826 1.91 44.67 33.14
Q4167410 Wikimedia disambiguation page 1,070,194 1.32 77.65 66.11
Q11266439 Wikimedia template 791,070 0.98 93.04 81.5
Q5 human 626,990 0.78 6.84 -4.69
Q13442814 scholarly article 398,161 0.49 1.17 -10.36


Top 5 subgraphs in each prediction class with greatest percentage of items in respective subgraphs (also shows most (+)'ve diffs)
Prediction Class Subgraph Subgraph label # items % items % items in respective subgraph Diff from typical distribution in respective prediction class
A Q484170 commune of France 11,513 0.01 25.26 25.19
Q34770 language 174 0.0 1.84 1.77
Q7889 video game 794 0.0 1.76 1.69
Q891723 public company 194 0.0 1.57 1.5
Q11424 film 3,739 0.0 1.44 1.36
B Q107103143 Induced pluripotent stem cell line 12,712 0.02 99.86 90.38
Q107102664 cell line from embryonic stem cells 16,079 0.02 99.85 90.38
Q27555384 transformed cell line 47,875 0.06 99.77 90.3
Q27671617 finite cell line 11,099 0.01 99.52 90.04
Q21014462 cell line 129,906 0.16 99.2 89.73
C Q6453643 decree law 12,386 0.02 99.96 51.7
Q814254 feature 10,702 0.01 99.39 51.13
Q104093746 lake or pond 31,050 0.04 99.28 51.03
Q22969563 bodendenkmal 49,768 0.06 99.18 50.92
Q21199 natural number 10,156 0.01 97.78 49.52
D Q106474968 ethnic group by settlement in Macedonia 14,280 0.02 100.0 69.28
Q6451276 Congressional Research Service report 13,777 0.02 100.0 69.28
Q7604693 Statutory Rules of Northern Ireland 17,121 0.02 100.0 69.28
Q100532807 Irish Statutory Instrument 33,420 0.04 100.0 69.28
Q1260524 time of the day 87,869 0.11 99.98 69.26
E Q26267864 Wikimedia KML file 2,561 0.0 99.88 88.35
Q459297 qanat 13,541 0.02 99.6 88.06
Q15184295 Wikimedia module 49,080 0.06 99.33 87.8
Q19855165 rural school 67,463 0.08 99.17 87.64
Q6503489 Law of the Republic of China 13,117 0.02 99.17 87.64