You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

User:AKhatun/Wikidata Item ORES Score Analysis

From Wikitech-static
< User:AKhatun
Revision as of 20:56, 18 January 2022 by imported>AKhatun (→‎ORES score distribution by subgraph: Initial analysis, diff analysis)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

ORES score

The ORES scores provides us with information about the quality of an item through the itemquality model. ORES is available for many wikimedia sister projects, for this discussion we stick to Wikidata. ORES also has multiple models, for this discussion we stick to itemquality model, which, for every revision (or state) of a Wikidata item, tells us how good the item is. This model classifies each item into one of 5 classes (A-E), where A signifies a high-quality item that has all relevant statements with solid references, translations, aliases, etc. And E signifies the lowest quality item. More information about what each category means can be found here: Item_quality. Besides, the probability of each class is combined to find one number, called the ORES score (Item_quality#ORES).

Some more resources:

Data source

The data used for the following analysis was taken from the event_sanitized.mediawiki_revision_score (Analytics/Data_Lake/ORES) table till year=2022/month=1/day=17/hour=11. This table contains ORES prediction and probabilities per class for each new revision of Wikidata Items. The ORES model for Wikidata was deployed around 2018, so not all items have a score recorded(especially those that haven't been edited in a long time). Some older revisions use 0.4.0 version of the model as opposed to the current 0.5.0 version. Some revisions have scores with both versions, in that case we pick the latest version of the model.

For each item, we take the latest revision that has a ORES score, and choose the score from the latest model version if available.

Wikidata Q-Item ORES prediction

  • Q-items are Namespace 0.
  • Re-directs were removed.
  • Percentage of Wikidata items that have a score recorded: 88.37%. Scores of the rest of the items were not recorded in the event table.

ORES class distribution

  • ORES predicted class (A to E) is the class that has the highest probability.
  • Most items are in the C class, meaning they are okay items. They have enough statements and some references. Second most popular class is D, meaning they have some basic statements but are lacking in references. D items are less than okay, but at least recognizable as distinct items.
ORES predicted class distribution of Wikidata items
Class Number of items Percent of Wikidata items
A 63596 0.064
B 8324047 8.369
C 42391757 42.620
D 26990372 27.136
E 10131002 10.186
None 11619871 11.682
File:ORES model prediction of WD QItems.png File:ORES model versiosn of WD QItems.png

ORES score distribution

  • ORES Score is calculated as (5 * probability of class A) + (4 * probability of class B) + (3 * probability of class C) + (2 * probability of class D) + (1 * probability of class E) as per Item_quality#ORES
ORES score distribution of Wikidata items
max min avg stddev Q1 (25th percentile) Q2 (Median) Q3 (75th percentile)
4.97 1.01 2.56 0.71 2.04 2.84 2.99

ORES score distribution by subgraph

  • Total items in Wikidata (as of 20220103 dump): 99,464,418
    • Total items in Wikidata that have a score recorded: 87,900,774 (88.37% of Wikidata items)
  • Total items in top 341 subgraphs: 89,051,118 (89.53% of Wikidata items)
    • Total items in top 341 subgraphs that have a score recorded: 80,897,270 (90.84% of items in top 341 subgraphs, 81% of Wikidata items)

For the analysis below, we only consider the items of the top subgraphs that have a score (81% of Wikidata items).

File:Distribution diff of ORES per subgraph.png