You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

# What are subgraphs?

Wikidata contains all kinds of data from various aspects of knowledge. All of these data are highly inter-connected, but we can find some patterns. We find subgraphs within Wikidata and find out how large these subgraphs are, how connected they are, and finally how much these subgraphs are used (queried).

In order to find subgraphs, the following steps were taken:

• Consider all items that are instance of `(P31)` the same item to be under a subgraph. For example: all items that are `instance of` Q13442814 are part of one subgraph.
• Some subgraphs were merged where it was obvious. For example: all subclasses of astronomical object were considered part of astronomical object as they were all indeed some sort of astronomical object. This method of sublcass merging is not applicable everywhere without manual inspection.
• Some large subgraphs were almost completely part of another subgraph. For example: all items under Review Articles are also instance of scholarly article. In such case, review articles was not considered a separate subgraph.

# Subgraph sizes

Using only `instance of`, Wikidata has `82,919` subgraphs. The distribution of the sizes of these subgraphs has a clear long tail, with very few subgraphs incorporating most items in Wikidata. Subgraph size can be calculated in two ways:

• The number of items it contains
• The number of triples related to the items in a subgraph. This is what we refer as subgraph size from here on.

Takeaways:

• Most calculations from here on will take the top 50 subgraphs, which form 85% of Wikidata
• 340 top subgraphs (0.5% of all subgraphs, after merging some) form 90% of Wikidata (91% of all items and 90% of all triples). These subgraphs have >=10,000 items each.
• Rest 99.5% of the subgraphs have <10,000 items each, and together form 10% of Wikidata.

Below is the distribution of the number of items in a subgraphs.

To be more specific,

Subgraph item distribution
Number of subgraphs Number of items
There are 54,602 subgraph(s) with more than 1 item(s)
23,724 10
6,625 100
1,712 1,000
392 10,000
63 100,000
10 1,000,000
1 10,000,000

Below is the subgraph size comparison of top 340 subgraphs in Wikidata (90%).

Below is the subgraph size comparison of top 50 subgraphs in Wikidata (85%).

Here is an interactive graph showing the comparison of subgraph sizes in terms of item count and triple count: subgraph stats.

Here are some subgraph size visualizations in WDQS:

• Size as percentage of Wikidata each subgraph occupies: query link
• Size as percentage of Wikidata items each subgraph contains: query link

Number of days to recovery Given the current rate of growth, how long would it take wikidata to get back to its original size again if some amount of triples were removed from it? This helps us estimate what to temporarily remove from Wikidata in the siatuation Wikidata backend maxes out. The growth rate of triples is not constant, but considering the growth an approximate straight line, in grafana dashboard, Wikidata grows at a rate of 4.77M triples per day. This rate was calculated from the number of triples at the start and end of a 90-day interval (11/3/21 to 6/6/21). It could be faster or a bit slower than this. This will give us a wide approximation of the number of days we can gain by removing some parts of Wikidata.

Top 50 Subgraphs in Wikidata
Rank Subgraph Subgraph Name Number of items % of WD items Number of triples % of WD Triples Number of days to recover
1 Q13442814 scholarly article 37,362,641 39.75 6,539,020,889 49.73 1370.86
2 Q6999 astronomical object 8,412,914 8.95 1,136,682,291 8.64 238.3
3 Q5 human 9,315,444 9.91 954,536,943 7.26 200.11
4 Q4167836 Wikimedia category 4,840,195 5.15 753,127,982 5.73 157.89
5 Q16521 taxon 3,180,248 3.38 367,926,462 2.8 77.13
6 Q101352 family name 481,445 0.51 187,299,892 1.42 39.27
7 Q4167410 Wikimedia disambiguation page 1,359,804 1.45 180,124,174 1.37 37.76
8 Q7187 gene 1,196,361 1.27 122,421,508 0.93 25.66
9 Q11266439 Wikimedia template 845,852 0.9 114,308,711 0.87 23.96
10 Q11173 chemical compound 1,223,387 1.3 91,228,463 0.69 19.13
11 Q8054 protein 986,599 1.05 88,483,828 0.67 18.55
12 Q3305213 painting 539,468 0.57 56,769,083 0.43 11.9
13 Q13100073 village-level division in China 588,477 0.63 51,615,572 0.39 10.82
14 Q11424 film 263,070 0.28 47,176,067 0.36 9.89
15 Q486972 human settlement 563,958 0.6 39,590,792 0.3 8.3
16 Q13406463 Wikimedia list article 334,939 0.36 33,742,245 0.26 7.07
17 Q13433827 encyclopedia article 512,141 0.55 33,373,227 0.25 7.0
18 Q8502 mountain 525,553 0.56 33,340,188 0.25 6.99
19 Q2668072 collection 500,968 0.53 32,670,637 0.25 6.85
20 Q79007 street 578,926 0.62 30,252,119 0.23 6.34
21 Q4022 river 399,552 0.42 28,833,476 0.22 6.04
22 Q30612 clinical trial 356,838 0.38 27,731,502 0.21 5.81
23 Q532 village 274,840 0.29 26,483,275 0.2 5.55
24 Q17633526 Wikinews article 286,950 0.3 21,830,150 0.17 4.58
25 Q482994 album 269,095 0.29 21,181,015 0.16 4.44
26 Q23397 lake 260,135 0.28 18,053,096 0.14 3.78
27 Q54050 hill 327,277 0.35 17,228,390 0.13 3.61
28 Q16970 church building 211,291 0.22 16,821,530 0.13 3.53
29 Q41176 building 265,925 0.28 16,293,008 0.12 3.42
30 Q56436498 village in India 145,824 0.16 15,383,416 0.12 3.23
31 Q4830453 business 193,858 0.21 14,101,220 0.11 2.96
32 Q47150325 calendar day of a given year 189,366 0.2 14,078,486 0.11 2.95
33 Q3947 house 197,736 0.21 12,468,434 0.1 2.61
34 Q3331189 version, edition, or translation 157,486 0.17 10,997,589 0.08 2.31
35 Q18593264 item of collection or exhibition 147,402 0.16 10,732,969 0.08 2.25
36 Q27020041 sports season 158,877 0.17 10,693,504 0.08 2.24
37 Q355304 watercourse 174,620 0.19 10,080,421 0.08 2.11
38 Q7725634 literary work 164,860 0.18 10,049,521 0.08 2.11
39 Q23442 island 148,587 0.16 9,885,277 0.08 2.07
40 Q11060274 print 119,806 0.13 9,700,063 0.07 2.03
41 Q811979 architectural structure 145,957 0.16 9,666,936 0.07 2.03
42 Q5084 hamlet 118,188 0.13 9,013,534 0.07 1.89
43 Q9842 primary school 157,451 0.17 8,916,373 0.07 1.87
44 Q19389637 biographical article 151,026 0.16 8,238,397 0.06 1.73
45 Q21014462 cell line 128,805 0.14 7,955,975 0.06 1.67
46 Q47521 stream 124,853 0.13 6,654,366 0.05 1.4
47 Q59199015 group of stereoisomers 111,599 0.12 5,843,270 0.04 1.23
48 Q61443690 branch post office 129,183 0.14 5,313,033 0.04 1.11
49 Q49008 prime number 127,545 0.14 5,188,768 0.04 1.09
50 Q4164871 position 120,117 0.13 4,720,668 0.04 0.99