You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "User:AKhatun/Wikidata Analysis"

From Wikitech-static
Jump to navigation Jump to search
imported>AKhatun
(Create triples analysis # Add labels analysis)
 
imported>AKhatun
(Add descriptions stats)
Line 17: Line 17:
Before we begin, the total number of triples in this specific snapshot of wikidata is 12853426190, approximately '''12.8 billion'''. According to [https://grafana.wikimedia.org/goto/pyO_iMRnk grafana dashboard] Wikidata grows at a rate of '''4.77''' million triples per day. Thats a lot! Rate was calculated from the number of triples at the start and end of a 90-day interval (11/3/21 to 6/6/21). During this period wikidata grew '''3.38%'''!!. <br>
Before we begin, the total number of triples in this specific snapshot of wikidata is 12853426190, approximately '''12.8 billion'''. According to [https://grafana.wikimedia.org/goto/pyO_iMRnk grafana dashboard] Wikidata grows at a rate of '''4.77''' million triples per day. Thats a lot! Rate was calculated from the number of triples at the start and end of a 90-day interval (11/3/21 to 6/6/21). During this period wikidata grew '''3.38%'''!!. <br>
Following analyses are done assuming 4.77M triples per day growth where applicable, therefore take numbers as an approximate only.
Following analyses are done assuming 4.77M triples per day growth where applicable, therefore take numbers as an approximate only.
=== Description ===
The number of triples with the predicate <code>schema:description</code> is 2455403027, '''19.1%''' of all triples.
{| class="wikitable sortable"
|-
! Description !! Triple Count!! Triple % !! Number of days for Wikidata to recover
|-
| English || 70422489 || 0.55 || 15
|-
| Other Languages || 2384980538 || 18.56 || 500
|-
| Total || 2455403027 || 19.1 || 515
|}
==== Additional Info ====
Some more information of descriptions.
{| class="wikitable"
|-
!Number of items that have a description
|86420757
|-
!Avgerage description per item
|28.4
|-
!Maximum description count per item
|274
|-
!Number of item with one description
|9865826 (11% of items)
|-
!Number of item with more than one description
|76554931 (88% of items)
|-
!Number of items that have a English description
|70422478
|-
!Number of items that don't have English descriptions
|15998279 (18.5%)
|}
Therefore, 18.5% of all items that have a description don't have english descriptions. If we were to remove all non-English description, 18.5% items that had a description won't have a description anymore.
==== Distribution of descriptions per item ====
{| class="wikitable sortable"
|-
|+ Top 10 number of descriptions per item
|-
! Description per Item !! Count !! Count % !! Cummulative %
|-
| 1    || 9865826 || 11.42 || 11.42
|-
| 2    || 10956928 || 12.68 || 24.10
|-
| 3     || 15865887 || 18.36 || 42.46
|-
| 4    || 4841350 || 5.60 || 48.06
|-
| 5     || 2137571 || 2.47 || 50.53
|-
| 6     || 1245943 || 1.44 || 51.97
|-
| 7     || 1246877 || 1.44 || 53.41
|-
| 8     || 1023007 || 1.18 || 54.59
|-
| 9     || 1064717 || 1.23 || 55.82
|-
| 10 || 650730 || 0.75 || 56.57
|}
[[File:Desc_item_dist.png|1100px]]
==== Language distribution of descriptions ====
439 different language tags in descriptions. 50% of the descriptions are of 32 languages and 90% of the descriptions are of 94 languages.
{| class="wikitable sortable"
|+ Top language tags in descriptions
|-
! Language tag !! Description count !! Description %
|-
| nl || 74500705 || 3.03
|-
| en || 70422489 || 2.87
|-
| de || 61191372 || 2.49
|-
| ar || 43859817 || 1.79
|-
| fr || 42688436 || 1.74
|-
| es || 39794209 || 1.62
|-
| uk || 39735526 || 1.62
|-
| ast || 37501531 || 1.53
|-
| ca || 36750923 || 1.50
|-
| it || 36597393 || 1.49
|}
Extra distribution figures in [[File:WikidataAnalysis_LabelsId.ipynb|Jupyter Notebook # Description ## Distribution of language tags]]


=== Labels ===
=== Labels ===
Line 31: Line 138:
|-
|-
| Total || 490755363 || 3.8 || 102
| Total || 490755363 || 3.8 || 102
|-
|}
|}


Line 97: Line 203:


{| class="wikitable sortable"
{| class="wikitable sortable"
|+ Top langugae tags in labels
|+ Top language tags in labels
|-
|-
! Language tag !! Label count !! Label %
! Language tag !! Label count !! Label %
Line 122: Line 228:
|}
|}


Extra distribution figures in [[File:WikidataAnalysis_LabelsId.ipynb|Jupyter Notebook#Distribution of language tags]]
Extra distribution figures in [[File:WikidataAnalysis_LabelsId.ipynb|Jupyter Notebook # Labels ## Distribution of language tags]]

Revision as of 08:37, 14 June 2021

Wikidata is an open knowledge base in the form of a graph accessible through SPARQL queries among other things). The graph is formed using triples in the form - (Subject, Predicate, Object). These components connect each other forming a huge interconnected web of data. Wikidata is growing super fast and it is time to think scaling - even more! With this aim, this page shows some analysis on wikidata to find out:

  • Basic understanding of Wikidata
  • Amount of certain kinds of data like labels, descriptions, scientific articles etc
  • Possible disconnected (or atleast not too connected) subgraphs.
  • Frequenty queried subgraphs (This requires analyzing user queries)
  • Whether distinct subgraphs are connected through such queries. (This requires analyzing user queries along with wikidata)

Find out more about what is being done to help scale Wikidata in Phab:T282790.

Vertical Data Analysis

If blazegraph (wikidatas backend) were to fail, what can we remove from wikidata so that it can still keep functioning? Some data points found across items in wikidata such as labels, descriptions, identifiers etc are possible candidates. Some analysis done on these vertical data are described in the following sections. Wikidata snapshot of 20210517 was used for this analysis.

TL;DR

Labels dist.png

Total triples

Before we begin, the total number of triples in this specific snapshot of wikidata is 12853426190, approximately 12.8 billion. According to grafana dashboard Wikidata grows at a rate of 4.77 million triples per day. Thats a lot! Rate was calculated from the number of triples at the start and end of a 90-day interval (11/3/21 to 6/6/21). During this period wikidata grew 3.38%!!.
Following analyses are done assuming 4.77M triples per day growth where applicable, therefore take numbers as an approximate only.

Description

The number of triples with the predicate schema:description is 2455403027, 19.1% of all triples.

Description Triple Count Triple % Number of days for Wikidata to recover
English 70422489 0.55 15
Other Languages 2384980538 18.56 500
Total 2455403027 19.1 515

Additional Info

Some more information of descriptions.

Number of items that have a description 86420757
Avgerage description per item 28.4
Maximum description count per item 274
Number of item with one description 9865826 (11% of items)
Number of item with more than one description 76554931 (88% of items)
Number of items that have a English description 70422478
Number of items that don't have English descriptions 15998279 (18.5%)

Therefore, 18.5% of all items that have a description don't have english descriptions. If we were to remove all non-English description, 18.5% items that had a description won't have a description anymore.

Distribution of descriptions per item

Top 10 number of descriptions per item
Description per Item Count Count % Cummulative %
1 9865826 11.42 11.42
2 10956928 12.68 24.10
3 15865887 18.36 42.46
4 4841350 5.60 48.06
5 2137571 2.47 50.53
6 1245943 1.44 51.97
7 1246877 1.44 53.41
8 1023007 1.18 54.59
9 1064717 1.23 55.82
10 650730 0.75 56.57

1100px

Language distribution of descriptions

439 different language tags in descriptions. 50% of the descriptions are of 32 languages and 90% of the descriptions are of 94 languages.

Top language tags in descriptions
Language tag Description count Description %
nl 74500705 3.03
en 70422489 2.87
de 61191372 2.49
ar 43859817 1.79
fr 42688436 1.74
es 39794209 1.62
uk 39735526 1.62
ast 37501531 1.53
ca 36750923 1.50
it 36597393 1.49

Extra distribution figures in Jupyter Notebook # Description ## Distribution of language tags

Labels

The number of triples with the predicate rdfs:label is 490755363, 3.8% of all triples.

Label Triple Count Triple % Number of days for Wikidata to recover
English 79360158 0.6 16
Other Languages 411395205 3.2 86
Total 490755363 3.8 102

Additional Info

Some more information of labels.

Number of items that have a label 93011973
Avgerage label per item 5.28
Maximum label count per item 476
Number of item with one label 21060266 (22% of items)
Number of item with more than one label 71951707 (77% of items)
Number of items that have a English label 79360145
Number of items that don't have English labels 13651828 (14.67%)

Therefore, 14.7% of all items that have a label don't have english labels. If we were to remove all non-English labels, 14.7% that had a label won't have a label anymore.

Distribution of labels per item

Top ten labels per item
Label per Item Count Count % Cummulative %
1 21060266 22.64 22.64
2 41568826 44.69 67.33
3 9980993 10.73 78.06
4 4415435 4.75 82.81
5 2464210 2.65 85.46
6 1788127 1.92 87.38
7 1098244 1.18 88.56
8 1551459 1.67 90.23
9 758017 0.81 91.04
10 646455 0.70 91.74

Language distribution of labels

470 different language tags in labels. 40% of the labels are of only 6 languages and 50% of the labels are of 12 languages.

Top language tags in labels
Language tag Label count Label %
en 79360158 16.17
nl 56588139 11.53
ast 14902568 3.04
fr 14365247 2.93
de 14079558 2.87
es 12836907 2.62
it 8944767 1.82
ga 8407993 1.71
pt 7831211 1.60
sv 7610178 1.55

Extra distribution figures in Jupyter Notebook # Labels ## Distribution of language tags