You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
User:AKhatun/Wikidata Subgraph Analysis: Difference between revisions
imported>AKhatun (Add triples per item section) |
imported>AKhatun (Predicate usage in subgraphs) |
||
Line 13: | Line 13: | ||
* The number of triples related to the items in a subgraph. This is what we refer as '''subgraph size''' from here on. | * The number of triples related to the items in a subgraph. This is what we refer as '''subgraph size''' from here on. | ||
Takeaways: | '''Takeaways''': | ||
* Most calculations from here on will take the top 50 subgraphs, which form 85% of Wikidata | * Most calculations from here on will take the top 50 subgraphs, which form 85% of Wikidata | ||
* 340 top subgraphs (0.5% of all subgraphs, after merging some) form 90% of Wikidata (91% of all items and 90% of all triples). These subgraphs have '''>=10,000''' items each. | * 340 top subgraphs (0.5% of all subgraphs, after merging some) form 90% of Wikidata (91% of all items and 90% of all triples). These subgraphs have '''>=10,000''' items each. | ||
Line 72: | Line 72: | ||
Given the current rate of growth, how long would it take wikidata to get back to its original size again if some amount of triples were removed from it? This helps us estimate what to temporarily remove from Wikidata in the siatuation Wikidata backend maxes out. The growth rate of triples is not constant, but considering the growth an approximate straight line, in [https://grafana.wikimedia.org/goto/pyO_iMRnk grafana dashboard], Wikidata grows at a rate of '''4.77M''' triples per day. This rate was calculated from the number of triples at the start and end of a 90-day interval (11/3/21 to 6/6/21). It could be faster or a bit slower than this. This will give us a wide approximation of the number of days we can gain by removing some parts of Wikidata. | Given the current rate of growth, how long would it take wikidata to get back to its original size again if some amount of triples were removed from it? This helps us estimate what to temporarily remove from Wikidata in the siatuation Wikidata backend maxes out. The growth rate of triples is not constant, but considering the growth an approximate straight line, in [https://grafana.wikimedia.org/goto/pyO_iMRnk grafana dashboard], Wikidata grows at a rate of '''4.77M''' triples per day. This rate was calculated from the number of triples at the start and end of a 90-day interval (11/3/21 to 6/6/21). It could be faster or a bit slower than this. This will give us a wide approximation of the number of days we can gain by removing some parts of Wikidata. | ||
= Triples = | |||
The triples within a subgraph can be of various types. They can be: | The triples within a subgraph can be of various types. They can be: | ||
* truthy triples like <code>wdt</code> | * truthy triples like <code>wdt</code> | ||
Line 85: | Line 85: | ||
|+ Top 50 Subgraphs in Wikidata | |+ Top 50 Subgraphs in Wikidata | ||
|- | |- | ||
! Rank !! Subgraph !! Subgraph Name !! Number of items !! % of WD items !! Number of triples !! % of WD Triples !! Number of days to recover!!%of truthy statements!!%of non-wikidata direct statements!!%of full statements | ! Rank !! Subgraph !! Subgraph Name !! Number of items !! % of WD items !! Number of triples !! % of WD Triples !! Number of days to recover!!%of truthy statements!!%of non-wikidata direct statements!!%of full statements!!Number of unique properties | ||
|- | |- | ||
|1||Q13442814||scholarly article||37,362,641||39.75||6,539,020,889||49.73||1370.86||12.62||24.28||63.25 | |1||Q13442814||scholarly article||37,362,641||39.75||6,539,020,889||49.73||1370.86||12.62||24.28||63.25||722 | ||
|- | |- | ||
|2||Q6999||astronomical object||8,412,914||8.95||1,136,682,291||8.64||238.3||10.20||14.24||76.07 | |2||Q6999||astronomical object||8,412,914||8.95||1,136,682,291||8.64||238.3||10.20||14.24||76.07||578 | ||
|- | |- | ||
|3||Q5||human||9,315,444||9.91||954,536,943||7.26||200.11||13.31||20.06||60.94 | |3||Q5||human||9,315,444||9.91||954,536,943||7.26||200.11||13.31||20.06||60.94||4482 | ||
|- | |- | ||
|4||Q4167836||Wikimedia category||4,840,195||5.15||753,127,982||5.73||157.89||1.19||86.06||5.19 | |4||Q4167836||Wikimedia category||4,840,195||5.15||753,127,982||5.73||157.89||1.19||86.06||5.19||610 | ||
|- | |- | ||
|5||Q16521||taxon||3,180,248||3.38||367,926,462||2.8||77.13||10.11||37.22||42.97 | |5||Q16521||taxon||3,180,248||3.38||367,926,462||2.8||77.13||10.11||37.22||42.97||963 | ||
|- | |- | ||
|6||Q101352||family name||481,445||0.51||187,299,892||1.42||39.27||1.59||93.49||6.62 | |6||Q101352||family name||481,445||0.51||187,299,892||1.42||39.27||1.59||93.49||6.62||375 | ||
|- | |- | ||
|7||Q4167410||Wikimedia disambiguation page||1,359,804||1.45||180,124,174||1.37||37.76||0.89||88.32||3.83 | |7||Q4167410||Wikimedia disambiguation page||1,359,804||1.45||180,124,174||1.37||37.76||0.89||88.32||3.83||796 | ||
|- | |- | ||
|8||Q7187||gene||1,196,361||1.27||122,421,508||0.93||25.66||14.13||12.51||73.11 | |8||Q7187||gene||1,196,361||1.27||122,421,508||0.93||25.66||14.13||12.51||73.11||218 | ||
|- | |- | ||
|9||Q11266439||Wikimedia template||845,852||0.9||114,308,711||0.87||23.96||0.78||87.06||3.20 | |9||Q11266439||Wikimedia template||845,852||0.9||114,308,711||0.87||23.96||0.78||87.06||3.20||222 | ||
|- | |- | ||
|10||Q11173||chemical compound||1,223,387||1.3||91,228,463||0.69||19.13||12.69||35.72||50.86 | |10||Q11173||chemical compound||1,223,387||1.3||91,228,463||0.69||19.13||12.69||35.72||50.86||591 | ||
|- | |- | ||
|11||Q8054||protein||986,599||1.05||88,483,828||0.67||18.55||14.26||13.51||72.21 | |11||Q8054||protein||986,599||1.05||88,483,828||0.67||18.55||14.26||13.51||72.21||267 | ||
|- | |- | ||
|12||Q3305213||painting||539,468||0.57||56,769,083||0.43||11.9||12.10||24.46||63.35 | |12||Q3305213||painting||539,468||0.57||56,769,083||0.43||11.9||12.10||24.46||63.35||785 | ||
|- | |- | ||
|13||Q13100073||village-level division in China||588,477||0.63||51,615,572||0.39||10.82||6.84||62.73||30.41 | |13||Q13100073||village-level division in China||588,477||0.63||51,615,572||0.39||10.82||6.84||62.73||30.41||77 | ||
|- | |- | ||
|14||Q11424||film||263,070||0.28||47,176,067||0.36||9.89||14.37||12.74||66.01 | |14||Q11424||film||263,070||0.28||47,176,067||0.36||9.89||14.37||12.74||66.01||1029 | ||
|- | |- | ||
|15||Q486972||human settlement||563,958||0.6||39,590,792||0.3||8.3||10.84||22.94||49.32 | |15||Q486972||human settlement||563,958||0.6||39,590,792||0.3||8.3||10.84||22.94||49.32||1120 | ||
|- | |- | ||
|16||Q13406463||Wikimedia list article||334,939||0.36||33,742,245||0.26||7.07||2.45||78.06||10.48 | |16||Q13406463||Wikimedia list article||334,939||0.36||33,742,245||0.26||7.07||2.45||78.06||10.48||880 | ||
|- | |- | ||
|17||Q13433827||encyclopedia article||512,141||0.55||33,373,227||0.25||7.0||9.03||45.52||39.76 | |17||Q13433827||encyclopedia article||512,141||0.55||33,373,227||0.25||7.0||9.03||45.52||39.76||164 | ||
|- | |- | ||
|18||Q8502||mountain||525,553||0.56||33,340,188||0.25||6.99||11.37||27.79||50.39 | |18||Q8502||mountain||525,553||0.56||33,340,188||0.25||6.99||11.37||27.79||50.39||709 | ||
|- | |- | ||
|19||Q2668072||collection||500,968||0.53||32,670,637||0.25||6.85||15.24||12.30||72.71 | |19||Q2668072||collection||500,968||0.53||32,670,637||0.25||6.85||15.24||12.30||72.71||665 | ||
|- | |- | ||
|20||Q79007||street||578,926||0.62||30,252,119||0.23||6.34||13.86||24.24||59.88 | |20||Q79007||street||578,926||0.62||30,252,119||0.23||6.34||13.86||24.24||59.88||572 | ||
|- | |- | ||
|21||Q4022||river||399,552||0.42||28,833,476||0.22||6.04||11.34||24.99||52.15 | |21||Q4022||river||399,552||0.42||28,833,476||0.22||6.04||11.34||24.99||52.15||580 | ||
|- | |- | ||
|22||Q30612||clinical trial||356,838||0.38||27,731,502||0.21||5.81||16.05||12.33||71.78 | |22||Q30612||clinical trial||356,838||0.38||27,731,502||0.21||5.81||16.05||12.33||71.78||124 | ||
|- | |- | ||
|23||Q532||village||274,840||0.29||26,483,275||0.2||5.55||9.69||26.25||45.22 | |23||Q532||village||274,840||0.29||26,483,275||0.2||5.55||9.69||26.25||45.22||818 | ||
|- | |- | ||
|24||Q17633526||Wikinews article||286,950||0.3||21,830,150||0.17||4.58||3.68||72.76||16.46 | |24||Q17633526||Wikinews article||286,950||0.3||21,830,150||0.17||4.58||3.68||72.76||16.46||256 | ||
|- | |- | ||
|25||Q482994||album||269,095||0.29||21,181,015||0.16||4.44||12.20||22.64||53.05 | |25||Q482994||album||269,095||0.29||21,181,015||0.16||4.44||12.20||22.64||53.05||638 | ||
|- | |- | ||
|26||Q23397||lake||260,135||0.28||18,053,096||0.14||3.78||11.32||25.43||53.19 | |26||Q23397||lake||260,135||0.28||18,053,096||0.14||3.78||11.32||25.43||53.19||602 | ||
|- | |- | ||
|27||Q54050||hill||327,277||0.35||17,228,390||0.13||3.61||12.67||23.07||56.77 | |27||Q54050||hill||327,277||0.35||17,228,390||0.13||3.61||12.67||23.07||56.77||470 | ||
|- | |- | ||
|28||Q16970||church building||211,291||0.22||16,821,530||0.13||3.53||14.03||15.06||63.56 | |28||Q16970||church building||211,291||0.22||16,821,530||0.13||3.53||14.03||15.06||63.56||1036 | ||
|- | |- | ||
|29||Q41176||building||265,925||0.28||16,293,008||0.12||3.42||14.21||14.99||68.06 | |29||Q41176||building||265,925||0.28||16,293,008||0.12||3.42||14.21||14.99||68.06||1243 | ||
|- | |- | ||
|30||Q56436498||village in India||145,824||0.16||15,383,416||0.12||3.23||8.34||22.23||64.50 | |30||Q56436498||village in India||145,824||0.16||15,383,416||0.12||3.23||8.34||22.23||64.50||210 | ||
|- | |- | ||
|31||Q4830453||business||193,858||0.21||14,101,220||0.11||2.96||13.06||16.23||60.48 | |31||Q4830453||business||193,858||0.21||14,101,220||0.11||2.96||13.06||16.23||60.48||1790 | ||
|- | |- | ||
|32||Q47150325||calendar day of a given year||189,366||0.2||14,078,486||0.11||2.95||6.85||56.25||34.18 | |32||Q47150325||calendar day of a given year||189,366||0.2||14,078,486||0.11||2.95||6.85||56.25||34.18||56 | ||
|- | |- | ||
|33||Q3947||house||197,736||0.21||12,468,434||0.1||2.61||15.26||15.18||70.67 | |33||Q3947||house||197,736||0.21||12,468,434||0.1||2.61||15.26||15.18||70.67||760 | ||
|- | |- | ||
|34||Q3331189||version, edition, or translation||157,486||0.17||10,997,589||0.08||2.31||15.41||14.70||69.39 | |34||Q3331189||version, edition, or translation||157,486||0.17||10,997,589||0.08||2.31||15.41||14.70||69.39||1006 | ||
|- | |- | ||
|35||Q18593264||item of collection or exhibition||147,402||0.16||10,732,969||0.08||2.25||16.96||9.69||73.38 | |35||Q18593264||item of collection or exhibition||147,402||0.16||10,732,969||0.08||2.25||16.96||9.69||73.38||306 | ||
|- | |- | ||
|36||Q27020041||sports season||158,877||0.17||10,693,504||0.08||2.24||12.11||15.12||53.43 | |36||Q27020041||sports season||158,877||0.17||10,693,504||0.08||2.24||12.11||15.12||53.43||353 | ||
|- | |- | ||
|37||Q355304||watercourse||174,620||0.19||10,080,421||0.08||2.11||12.17||25.07||54.79 | |37||Q355304||watercourse||174,620||0.19||10,080,421||0.08||2.11||12.17||25.07||54.79||302 | ||
|- | |- | ||
|38||Q7725634||literary work||164,860||0.18||10,049,521||0.08||2.11||13.85||16.69||58.88 | |38||Q7725634||literary work||164,860||0.18||10,049,521||0.08||2.11||13.85||16.69||58.88||1093 | ||
|- | |- | ||
|39||Q23442||island||148,587||0.16||9,885,277||0.08||2.07||11.40||22.52||50.06 | |39||Q23442||island||148,587||0.16||9,885,277||0.08||2.07||11.40||22.52||50.06||829 | ||
|- | |- | ||
|40||Q11060274||print||119,806||0.13||9,700,063||0.07||2.03||14.85||9.74||76.96 | |40||Q11060274||print||119,806||0.13||9,700,063||0.07||2.03||14.85||9.74||76.96||269 | ||
|- | |- | ||
|41||Q811979||architectural structure||145,957||0.16||9,666,936||0.07||2.03||10.06||9.91||52.63 | |41||Q811979||architectural structure||145,957||0.16||9,666,936||0.07||2.03||10.06||9.91||52.63||994 | ||
|- | |- | ||
|42||Q5084||hamlet||118,188||0.13||9,013,534||0.07||1.89||10.94||18.63||55.55 | |42||Q5084||hamlet||118,188||0.13||9,013,534||0.07||1.89||10.94||18.63||55.55||423 | ||
|- | |- | ||
|43||Q9842||primary school||157,451||0.17||8,916,373||0.07||1.87||13.99||16.22||68.98 | |43||Q9842||primary school||157,451||0.17||8,916,373||0.07||1.87||13.99||16.22||68.98||410 | ||
|- | |- | ||
|44||Q19389637||biographical article||151,026||0.16||8,238,397||0.06||1.73||12.76||21.07||57.51 | |44||Q19389637||biographical article||151,026||0.16||8,238,397||0.06||1.73||12.76||21.07||57.51||131 | ||
|- | |- | ||
|45||Q21014462||cell line||128,805||0.14||7,955,975||0.06||1.67||10.62||36.18||54.60 | |45||Q21014462||cell line||128,805||0.14||7,955,975||0.06||1.67||10.62||36.18||54.60||60 | ||
|- | |- | ||
|46||Q47521||stream||124,853||0.13||6,654,366||0.05||1.4||12.95||19.67||58.11 | |46||Q47521||stream||124,853||0.13||6,654,366||0.05||1.4||12.95||19.67||58.11||280 | ||
|- | |- | ||
|47||Q59199015||group of stereoisomers||111,599||0.12||5,843,270||0.04||1.23||15.43||18.82||67.23 | |47||Q59199015||group of stereoisomers||111,599||0.12||5,843,270||0.04||1.23||15.43||18.82||67.23||216 | ||
|- | |- | ||
|48||Q61443690||branch post office||129,183||0.14||5,313,033||0.04||1.11||14.59||14.86||70.54 | |48||Q61443690||branch post office||129,183||0.14||5,313,033||0.04||1.11||14.59||14.86||70.54||22 | ||
|- | |- | ||
|49||Q49008||prime number||127,545||0.14||5,188,768||0.04||1.09||10.01||36.78||52.40 | |49||Q49008||prime number||127,545||0.14||5,188,768||0.04||1.09||10.01||36.78||52.40||101 | ||
|- | |- | ||
|50||Q4164871||position||120,117||0.13||4,720,668||0.04||0.99||12.72||32.76||52.28 | |50||Q4164871||position||120,117||0.13||4,720,668||0.04||0.99||12.72||32.76||52.28||654 | ||
|} | |} | ||
Line 193: | Line 193: | ||
[[File:subgraph_boxplot.png|1100px]] | [[File:subgraph_boxplot.png|1100px]] | ||
== | = Predicates = | ||
=== Top | Predicates are used in all subgraph. Sometimes some predicates are almost exclusively used in a particular subgraphs, other times a predicate may be used 99% of times in that particular subgraph. Moreover, the unique predicates used in a subgraph can inform us of the range of diverse statements a subgraph contains. These and some more analysis were done on predicates below. | ||
== Number of unique predicates == | |||
The number of unique predicates a subgraph uses has been listed in the table above ([[#Table of top 50 subgraph information]]). Feel free to sort by property column to view the most/least diverse subgraph. | |||
[[File:pred_count.png]] | |||
== Predicate distribution == | |||
There are <code>~7500</code> unique predicates across the top 50 subgraphs. Among them <code>~3500</code>(46%) are used in any 1 subgraph only. Below is a figure showing this distribution. | |||
{| | |||
[[File:subgraph_pred_dist.png]] | [[File:subgraph_pred_dist_2.png]] | |||
|} | |||
Here is a csv file to the subgraph-predicate count: [[https://github.com/tanny411/Wikidata-WDQS-Analysis/subgraph_analysis/data/subgraph_pred_df_info.csv|TBA]]. | |||
== Top predicates == | |||
While it is interesting to see the top predicates for each subgraph, it is too much to view for this page. Below is a table of only the top 3 predicates per subgraph. Here is a csv file with the top 5 predicates per subgraph: [[https://github.com/tanny411/Wikidata-WDQS-Analysis/subgraph_analysis/data/top_subgraph_pred.csv|TBA]]. You can view more from [[https://github.com/tanny411/Wikidata-WDQS-Analysis/subgraph_analysis/data/subgraph_pred_df_info.csv|TBA]] with some filtering and grouping. | |||
'''Note that''': | |||
* The most common top predicates are <code>description</code>, <code>rdf:type</code>, <code>wikibase:rank</code>, and <code>references</code>. | |||
* Only scholarly articles' <code>descriptions</code> are 10% of Wikidata, with <code>cites work</code> and <code>wikibase:rank</code> being ~6%. | |||
* Wikimedia category <code>descriptions</code> are 4.5%, and the rest is ~1% or less of Wikidata triples. | |||
{| class="wikitable sortable" | |||
|+ Top predicates per subgraph and their triple distribution | |||
|- | |||
! | |||
! colspan="4" | 1st top predicate | |||
! colspan="4" | 2nd top predicate | |||
! colspan="4" | 3rd top predicate | |||
|- | |||
! Subgraph | |||
! Predicate !! #of triples !! %triples in subgraph !! %triples in Wikidata | |||
! Predicate !! #of triples !! %triples in subgraph !! %triples in Wikidata | |||
! Predicate !! #of triples !! %triples in subgraph !! %triples in Wikidata | |||
|- | |||
|Wikimedia category||description||596672076||79.226||4.517||22-rdf-syntax-ns#type||22761531||3.022||0.172||rdf-schema#label||15094771||2.004||0.114 | |||
|- | |||
|Wikimedia disambiguation page||description||112264922||62.326||0.85||rdf-schema#label||39587180||21.978||0.3||22-rdf-syntax-ns#type||4146198||2.302||0.031 | |||
|- | |||
|Wikimedia list article||description||23714609||70.282||0.18||22-rdf-syntax-ns#type||1449015||4.294||0.011||instance of||1022424||3.03||0.008 | |||
|- | |||
|Wikimedia template||description||93286385||81.609||0.706||22-rdf-syntax-ns#type||2918907||2.554||0.022||instance of||2557094||2.237||0.019 | |||
|- | |||
|Wikinews article||description||14128436||64.72||0.107||22-rdf-syntax-ns#type||1114921||5.107||0.008||instance of||862083||3.949||0.007 | |||
|- | |||
|album||22-rdf-syntax-ns#type||2793067||13.187||0.021||ontology#rank||2250457||10.625||0.017||rdf-schema#label||1871641||8.836||0.014 | |||
|- | |||
|architectural structure||ontology#rank||1290969||13.354||0.01||22-rdf-syntax-ns#type||1289149||13.336||0.01||prov#wasDerivedFrom||725555||7.506||0.005 | |||
|- | |||
|astronomical object||ontology#rank||144578828||12.719||1.095||prov#wasDerivedFrom||128331955||11.29||0.972||22-rdf-syntax-ns#type||117137727||10.305||0.887 | |||
|- | |||
|biographical article||22-rdf-syntax-ns#type||1212529||14.718||0.009||ontology#rank||1061395||12.884||0.008||description||532327||6.462||0.004 | |||
|- | |||
|branch post office||22-rdf-syntax-ns#type||775436||14.595||0.006||ontology#rank||775436||14.595||0.006||prov#wasDerivedFrom||645628||12.152||0.005 | |||
|- | |||
|building||22-rdf-syntax-ns#type||2372760||14.563||0.018||ontology#rank||2262925||13.889||0.017||prov#wasDerivedFrom||1130158||6.936||0.009 | |||
|- | |||
|business||22-rdf-syntax-ns#type||1948087||13.815||0.015||ontology#rank||1669295||11.838||0.013||prov#wasDerivedFrom||919184||6.518||0.007 | |||
|- | |||
|calendar day of a given year||rdf-schema#label||6152052||43.698||0.047||instance of||1146460||8.143||0.009||22-rdf-syntax-ns#type||1040850||7.393||0.008 | |||
|- | |||
|cell line||rdf-schema#label||1060304||13.327||0.008||description||1042089||13.098||0.008||22-rdf-syntax-ns#type||833878||10.481||0.006 | |||
|- | |||
|chemical compound||description||24753956||27.134||0.187||22-rdf-syntax-ns#type||10660602||11.686||0.081||ontology#rank||10484519||11.493||0.079 | |||
|- | |||
|church building||22-rdf-syntax-ns#type||2541716||15.11||0.019||ontology#rank||2256786||13.416||0.017||prov#wasDerivedFrom||972599||5.782||0.007 | |||
|- | |||
|clinical trial||22-rdf-syntax-ns#type||4453735||16.06||0.034||ontology#rank||4453642||16.06||0.034||minimum age||1595252||5.752||0.012 | |||
|- | |||
|collection||22-rdf-syntax-ns#type||5206241||15.936||0.039||ontology#rank||5204557||15.93||0.039||prov#wasDerivedFrom||2150646||6.583||0.016 | |||
|- | |||
|encyclopedia article||description||11968615||35.863||0.091||22-rdf-syntax-ns#type||3302636||9.896||0.025||ontology#rank||2913370||8.73||0.022 | |||
|- | |||
|family name||description||59885567||31.973||0.453||rdf-schema#label||57692163||30.802||0.437||core#altLabel||51377678||27.431||0.389 | |||
|- | |||
|film||22-rdf-syntax-ns#type||7284888||15.442||0.055||ontology#rank||6480184||13.736||0.049||prov#wasDerivedFrom||3597645||7.626||0.027 | |||
|- | |||
|gene||prov#wasDerivedFrom||16328401||13.338||0.124||22-rdf-syntax-ns#type||16046507||13.108||0.121||ontology#rank||15992810||13.064||0.121 | |||
|- | |||
|group of stereoisomers||22-rdf-syntax-ns#type||874827||14.972||0.007||ontology#rank||871313||14.911||0.007||found in taxon||612204||10.477||0.005 | |||
|- | |||
|hamlet||22-rdf-syntax-ns#type||1222418||13.562||0.009||ontology#rank||924362||10.255||0.007||prov#wasDerivedFrom||813808||9.029||0.006 | |||
|- | |||
|hill||22-rdf-syntax-ns#type||2215050||12.857||0.017||ontology#rank||1857312||10.781||0.014||GeoNames ID||1568224||9.103||0.012 | |||
|- | |||
|house||22-rdf-syntax-ns#type||1893455||15.186||0.014||ontology#rank||1825282||14.639||0.014||coordinate location||746416||5.986||0.006 | |||
|- | |||
|human||22-rdf-syntax-ns#type||127210431||13.327||0.963||ontology#rank||115226473||12.071||0.872||rdf-schema#label||82571873||8.65||0.625 | |||
|- | |||
|human settlement||22-rdf-syntax-ns#type||5214907||13.172||0.039||ontology#rank||3840166||9.7||0.029||description||3222856||8.14||0.024 | |||
|- | |||
|island||22-rdf-syntax-ns#type||1282293||12.972||0.01||ontology#rank||958245||9.694||0.007||description||897631||9.08||0.007 | |||
|- | |||
|item of collection or exhibition||22-rdf-syntax-ns#type||1821311||16.969||0.014||ontology#rank||1820436||16.961||0.014||part of||712209||6.636||0.005 | |||
|- | |||
|lake||description||2406074||13.328||0.018||22-rdf-syntax-ns#type||2199894||12.186||0.017||ontology#rank||1813415||10.045||0.014 | |||
|- | |||
|literary work||22-rdf-syntax-ns#type||1478965||14.717||0.011||ontology#rank||1252618||12.464||0.009||instance of||617093||6.141||0.005 | |||
|- | |||
|mountain||description||4367836||13.101||0.033||22-rdf-syntax-ns#type||4046473||12.137||0.031||ontology#rank||3238261||9.713||0.025 | |||
|- | |||
|painting||description||10311414||18.164||0.078||22-rdf-syntax-ns#type||6848409||12.064||0.052||ontology#rank||6791199||11.963||0.051 | |||
|- | |||
|position||22-rdf-syntax-ns#type||616685||13.064||0.005||ontology#rank||580022||12.287||0.004||rdf-schema#label||544323||11.531||0.004 | |||
|- | |||
|primary school||22-rdf-syntax-ns#type||1252277||14.045||0.009||ontology#rank||1236386||13.866||0.009||prov#wasDerivedFrom||1008762||11.314||0.008 | |||
|- | |||
|prime number||description||795712||15.335||0.006||ontology#rank||644583||12.423||0.005||22-rdf-syntax-ns#type||526967||10.156||0.004 | |||
|- | |||
|print||22-rdf-syntax-ns#type||1425563||14.696||0.011||ontology#rank||1425199||14.693||0.011||prov#wasDerivedFrom||1078309||11.117||0.008 | |||
|- | |||
|protein||prov#wasDerivedFrom||12427165||14.045||0.094||22-rdf-syntax-ns#type||11368057||12.848||0.086||ontology#rank||11357567||12.836||0.086 | |||
|- | |||
|river||description||3651037||12.662||0.028||22-rdf-syntax-ns#type||3567310||12.372||0.027||ontology#rank||2829316||9.813||0.021 | |||
|- | |||
|scholarly article||description||1324177494||20.25||10.025||cites work||853611996||13.054||6.462||ontology#rank||796548851||12.181||6.03 | |||
|- | |||
|sports season||22-rdf-syntax-ns#type||1572731||14.707||0.012||ontology#rank||1156720||10.817||0.009||instance of||496113||4.639||0.004 | |||
|- | |||
|stream||22-rdf-syntax-ns#type||873978||13.134||0.007||ontology#rank||734964||11.045||0.006||GeoNames ID||587699||8.832||0.004 | |||
|- | |||
|street||22-rdf-syntax-ns#type||4236711||14.005||0.032||ontology#rank||4096946||13.543||0.031||description||2256816||7.46||0.017 | |||
|- | |||
|taxon||rdf-schema#label||69840848||18.982||0.529||description||45308808||12.315||0.343||22-rdf-syntax-ns#type||40988244||11.14||0.31 | |||
|- | |||
|version, edition, or translation||22-rdf-syntax-ns#type||1591714||14.473||0.012||ontology#rank||1538937||13.993||0.012||prov#wasDerivedFrom||712852||6.482||0.005 | |||
|- | |||
|village||22-rdf-syntax-ns#type||3307961||12.491||0.025||description||3212844||12.132||0.024||ontology#rank||2304145||8.7||0.017 | |||
|- | |||
|village in India||description||2336722||15.19||0.018||ontology#rank||1467450||9.539||0.011||22-rdf-syntax-ns#type||1392224||9.05||0.011 | |||
|- | |||
|village-level division in China||description||24717636||47.888||0.187||rdf-schema#label||4720063||9.145||0.036||22-rdf-syntax-ns#type||3533117||6.845||0.027 | |||
|- | |||
|watercourse||22-rdf-syntax-ns#type||1256378||12.464||0.01||ontology#rank||1039967||10.317||0.008||description||1009377||10.013||0.008 | |||
|} | |||
== Predicates across subgraphs == | |||
From the predicates point of view: how widely are they used? We already know some predicates are used in 1 subgraph. What about the others? Following is a diagram showing the usage of the 60 most used predicates in Wikidata across various subgraphs. Note that the usage were calculated only for the top 50 subgraphs, which account for ~85% of Wikidata. So this should give us an idea of the high use cases of each of these predicates. The rest can be considered a long tail to each of these plots. | |||
The x-axis shows the rank of the subgraph instead of the name to save space. The rank-name mapping can be found in [[#Table of top 50 subgraph information]]. The figures are large but can be zoomed in without loss of resolution for better viewing. | |||
[[File:60_preds_powerlaw.png]] | |||
[[File:60_preds_powerlaw_log.png]] | |||
=== Top usage === | |||
As mentioned above, we can isolate some predicates that are used >=99% in a particular subgraph. Some are even used 100% of the times in that particular subgraph. The following graph shows the distribution of the highest percentage usage in a subgraph a predicate has. | |||
[[File:perc_pred_usage.png]] | |||
For the predicates that are used a lot in a particular subgraph, it is possible that it is used a very small number of times in other subgraphs (second max usage count is low) or it is also used a lot in other subgraphs (second max usage count is quite high). In short: we also want to look at second max percentages. | |||
Here is an interactive plot showing the max percent, second max percent, color coded with the number of subgraphs the predicate is used in: [https://tanny411.github.io/Wikidata-WDQS-Analysis/predicate_usage.html predicate_usage] | |||
= Rate of growth of subgraphs = | = Rate of growth of subgraphs = | ||
Here is an interactive chart showing the growth of the top 50 subgraphs over a period of one month: [https://tanny411.github.io/Wikidata-WDQS-Analysis/subgraph_growth_rate.html subgraph_growth_rate] | Here is an interactive chart showing the growth of the top 50 subgraphs over a period of one month: [https://tanny411.github.io/Wikidata-WDQS-Analysis/subgraph_growth_rate.html subgraph_growth_rate] |
Revision as of 10:12, 3 November 2021
TL;DR
What are subgraphs?
Wikidata contains all kinds of data from various aspects of knowledge. All of these data are highly inter-connected, but we can find some patterns. We find subgraphs within Wikidata and find out how large these subgraphs are, how connected they are, and finally how much these subgraphs are used (queried).
In order to find subgraphs, the following steps were taken:
- Consider all items that are instance of
(P31)
the same item to be under a subgraph. For example: all items that areinstance of
Q13442814 are part of one subgraph. - Some subgraphs were merged where it was obvious. For example: all subclasses of astronomical object were considered part of astronomical object as they were all indeed some sort of astronomical object. This method of sublcass merging is not applicable everywhere without manual inspection.
- Some large subgraphs were almost completely part of another subgraph. For example: all items under Review Articles are also instance of scholarly article. In such case, review articles was not considered a separate subgraph.
Subgraph sizes
Using only instance of
, Wikidata has 82,919
subgraphs. The distribution of the sizes of these subgraphs has a clear long tail, with very few subgraphs incorporating most items in Wikidata. Subgraph size can be calculated in two ways:
- The number of items it contains
- The number of triples related to the items in a subgraph. This is what we refer as subgraph size from here on.
Takeaways:
- Most calculations from here on will take the top 50 subgraphs, which form 85% of Wikidata
- 340 top subgraphs (0.5% of all subgraphs, after merging some) form 90% of Wikidata (91% of all items and 90% of all triples). These subgraphs have >=10,000 items each.
- Rest 99.5% of the subgraphs have <10,000 items each, and together form 10% of Wikidata.
Below is the distribution of the number of items in a subgraphs.
File:Number of groups vs number of items.png | File:Number of groups vs number of items log.png |
To be more specific,
Number of subgraphs | Number of items | |||
---|---|---|---|---|
There are | 54,602 | subgraph(s) with more than | 1 | item(s) |
23,724 | 10 | |||
6,625 | 100 | |||
1,712 | 1,000 | |||
392 | 10,000 | |||
63 | 100,000 | |||
10 | 1,000,000 | |||
1 | 10,000,000 |
Below is the subgraph size comparison of top 340 subgraphs in Wikidata (90%).
File:Subgraph distribution triples.png File:Subgraph distribution percents.png
Below is the subgraph size comparison of top 50 subgraphs in Wikidata (85%).
File:Top 50 subgraph distribution triples.png File:Top 50 subgraph distribution percents.png
Here is an interactive graph showing the comparison of subgraph sizes in terms of item count and triple count: subgraph stats.
Here are some subgraph size visualizations in WDQS:
- Size as percentage of Wikidata each subgraph occupies: query link
- Size as percentage of Wikidata items each subgraph contains: query link
Number of days to recovery Given the current rate of growth, how long would it take wikidata to get back to its original size again if some amount of triples were removed from it? This helps us estimate what to temporarily remove from Wikidata in the siatuation Wikidata backend maxes out. The growth rate of triples is not constant, but considering the growth an approximate straight line, in grafana dashboard, Wikidata grows at a rate of 4.77M triples per day. This rate was calculated from the number of triples at the start and end of a 90-day interval (11/3/21 to 6/6/21). It could be faster or a bit slower than this. This will give us a wide approximation of the number of days we can gain by removing some parts of Wikidata.
Triples
The triples within a subgraph can be of various types. They can be:
- truthy triples like
wdt
- non-wikidata direct triples like rdfs:label, schema:name etc
- full statements that hold other information
See more about these statement types here: RDF_Dump_Format#Statement_types.
Table of top 50 subgraph information
Rank | Subgraph | Subgraph Name | Number of items | % of WD items | Number of triples | % of WD Triples | Number of days to recover | %of truthy statements | %of non-wikidata direct statements | %of full statements | Number of unique properties |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Q13442814 | scholarly article | 37,362,641 | 39.75 | 6,539,020,889 | 49.73 | 1370.86 | 12.62 | 24.28 | 63.25 | 722 |
2 | Q6999 | astronomical object | 8,412,914 | 8.95 | 1,136,682,291 | 8.64 | 238.3 | 10.20 | 14.24 | 76.07 | 578 |
3 | Q5 | human | 9,315,444 | 9.91 | 954,536,943 | 7.26 | 200.11 | 13.31 | 20.06 | 60.94 | 4482 |
4 | Q4167836 | Wikimedia category | 4,840,195 | 5.15 | 753,127,982 | 5.73 | 157.89 | 1.19 | 86.06 | 5.19 | 610 |
5 | Q16521 | taxon | 3,180,248 | 3.38 | 367,926,462 | 2.8 | 77.13 | 10.11 | 37.22 | 42.97 | 963 |
6 | Q101352 | family name | 481,445 | 0.51 | 187,299,892 | 1.42 | 39.27 | 1.59 | 93.49 | 6.62 | 375 |
7 | Q4167410 | Wikimedia disambiguation page | 1,359,804 | 1.45 | 180,124,174 | 1.37 | 37.76 | 0.89 | 88.32 | 3.83 | 796 |
8 | Q7187 | gene | 1,196,361 | 1.27 | 122,421,508 | 0.93 | 25.66 | 14.13 | 12.51 | 73.11 | 218 |
9 | Q11266439 | Wikimedia template | 845,852 | 0.9 | 114,308,711 | 0.87 | 23.96 | 0.78 | 87.06 | 3.20 | 222 |
10 | Q11173 | chemical compound | 1,223,387 | 1.3 | 91,228,463 | 0.69 | 19.13 | 12.69 | 35.72 | 50.86 | 591 |
11 | Q8054 | protein | 986,599 | 1.05 | 88,483,828 | 0.67 | 18.55 | 14.26 | 13.51 | 72.21 | 267 |
12 | Q3305213 | painting | 539,468 | 0.57 | 56,769,083 | 0.43 | 11.9 | 12.10 | 24.46 | 63.35 | 785 |
13 | Q13100073 | village-level division in China | 588,477 | 0.63 | 51,615,572 | 0.39 | 10.82 | 6.84 | 62.73 | 30.41 | 77 |
14 | Q11424 | film | 263,070 | 0.28 | 47,176,067 | 0.36 | 9.89 | 14.37 | 12.74 | 66.01 | 1029 |
15 | Q486972 | human settlement | 563,958 | 0.6 | 39,590,792 | 0.3 | 8.3 | 10.84 | 22.94 | 49.32 | 1120 |
16 | Q13406463 | Wikimedia list article | 334,939 | 0.36 | 33,742,245 | 0.26 | 7.07 | 2.45 | 78.06 | 10.48 | 880 |
17 | Q13433827 | encyclopedia article | 512,141 | 0.55 | 33,373,227 | 0.25 | 7.0 | 9.03 | 45.52 | 39.76 | 164 |
18 | Q8502 | mountain | 525,553 | 0.56 | 33,340,188 | 0.25 | 6.99 | 11.37 | 27.79 | 50.39 | 709 |
19 | Q2668072 | collection | 500,968 | 0.53 | 32,670,637 | 0.25 | 6.85 | 15.24 | 12.30 | 72.71 | 665 |
20 | Q79007 | street | 578,926 | 0.62 | 30,252,119 | 0.23 | 6.34 | 13.86 | 24.24 | 59.88 | 572 |
21 | Q4022 | river | 399,552 | 0.42 | 28,833,476 | 0.22 | 6.04 | 11.34 | 24.99 | 52.15 | 580 |
22 | Q30612 | clinical trial | 356,838 | 0.38 | 27,731,502 | 0.21 | 5.81 | 16.05 | 12.33 | 71.78 | 124 |
23 | Q532 | village | 274,840 | 0.29 | 26,483,275 | 0.2 | 5.55 | 9.69 | 26.25 | 45.22 | 818 |
24 | Q17633526 | Wikinews article | 286,950 | 0.3 | 21,830,150 | 0.17 | 4.58 | 3.68 | 72.76 | 16.46 | 256 |
25 | Q482994 | album | 269,095 | 0.29 | 21,181,015 | 0.16 | 4.44 | 12.20 | 22.64 | 53.05 | 638 |
26 | Q23397 | lake | 260,135 | 0.28 | 18,053,096 | 0.14 | 3.78 | 11.32 | 25.43 | 53.19 | 602 |
27 | Q54050 | hill | 327,277 | 0.35 | 17,228,390 | 0.13 | 3.61 | 12.67 | 23.07 | 56.77 | 470 |
28 | Q16970 | church building | 211,291 | 0.22 | 16,821,530 | 0.13 | 3.53 | 14.03 | 15.06 | 63.56 | 1036 |
29 | Q41176 | building | 265,925 | 0.28 | 16,293,008 | 0.12 | 3.42 | 14.21 | 14.99 | 68.06 | 1243 |
30 | Q56436498 | village in India | 145,824 | 0.16 | 15,383,416 | 0.12 | 3.23 | 8.34 | 22.23 | 64.50 | 210 |
31 | Q4830453 | business | 193,858 | 0.21 | 14,101,220 | 0.11 | 2.96 | 13.06 | 16.23 | 60.48 | 1790 |
32 | Q47150325 | calendar day of a given year | 189,366 | 0.2 | 14,078,486 | 0.11 | 2.95 | 6.85 | 56.25 | 34.18 | 56 |
33 | Q3947 | house | 197,736 | 0.21 | 12,468,434 | 0.1 | 2.61 | 15.26 | 15.18 | 70.67 | 760 |
34 | Q3331189 | version, edition, or translation | 157,486 | 0.17 | 10,997,589 | 0.08 | 2.31 | 15.41 | 14.70 | 69.39 | 1006 |
35 | Q18593264 | item of collection or exhibition | 147,402 | 0.16 | 10,732,969 | 0.08 | 2.25 | 16.96 | 9.69 | 73.38 | 306 |
36 | Q27020041 | sports season | 158,877 | 0.17 | 10,693,504 | 0.08 | 2.24 | 12.11 | 15.12 | 53.43 | 353 |
37 | Q355304 | watercourse | 174,620 | 0.19 | 10,080,421 | 0.08 | 2.11 | 12.17 | 25.07 | 54.79 | 302 |
38 | Q7725634 | literary work | 164,860 | 0.18 | 10,049,521 | 0.08 | 2.11 | 13.85 | 16.69 | 58.88 | 1093 |
39 | Q23442 | island | 148,587 | 0.16 | 9,885,277 | 0.08 | 2.07 | 11.40 | 22.52 | 50.06 | 829 |
40 | Q11060274 | 119,806 | 0.13 | 9,700,063 | 0.07 | 2.03 | 14.85 | 9.74 | 76.96 | 269 | |
41 | Q811979 | architectural structure | 145,957 | 0.16 | 9,666,936 | 0.07 | 2.03 | 10.06 | 9.91 | 52.63 | 994 |
42 | Q5084 | hamlet | 118,188 | 0.13 | 9,013,534 | 0.07 | 1.89 | 10.94 | 18.63 | 55.55 | 423 |
43 | Q9842 | primary school | 157,451 | 0.17 | 8,916,373 | 0.07 | 1.87 | 13.99 | 16.22 | 68.98 | 410 |
44 | Q19389637 | biographical article | 151,026 | 0.16 | 8,238,397 | 0.06 | 1.73 | 12.76 | 21.07 | 57.51 | 131 |
45 | Q21014462 | cell line | 128,805 | 0.14 | 7,955,975 | 0.06 | 1.67 | 10.62 | 36.18 | 54.60 | 60 |
46 | Q47521 | stream | 124,853 | 0.13 | 6,654,366 | 0.05 | 1.4 | 12.95 | 19.67 | 58.11 | 280 |
47 | Q59199015 | group of stereoisomers | 111,599 | 0.12 | 5,843,270 | 0.04 | 1.23 | 15.43 | 18.82 | 67.23 | 216 |
48 | Q61443690 | branch post office | 129,183 | 0.14 | 5,313,033 | 0.04 | 1.11 | 14.59 | 14.86 | 70.54 | 22 |
49 | Q49008 | prime number | 127,545 | 0.14 | 5,188,768 | 0.04 | 1.09 | 10.01 | 36.78 | 52.40 | 101 |
50 | Q4164871 | position | 120,117 | 0.13 | 4,720,668 | 0.04 | 0.99 | 12.72 | 32.76 | 52.28 | 654 |
Triples per item
While it is interesting to see how big a subgraph is and how many items it has, it is helpful to know how many triples each item has typically in a given subgraph. A very basic idea can be gained from density of a subgraph, where density = #of triples / #of items
, in subgraph stats. Below is a diagram of box plot showing triples per item distribution for the top 50 subgraphs. The box plot omits min and max values, shows only mean, median, Q1, and Q3.
Predicates
Predicates are used in all subgraph. Sometimes some predicates are almost exclusively used in a particular subgraphs, other times a predicate may be used 99% of times in that particular subgraph. Moreover, the unique predicates used in a subgraph can inform us of the range of diverse statements a subgraph contains. These and some more analysis were done on predicates below.
Number of unique predicates
The number of unique predicates a subgraph uses has been listed in the table above (#Table of top 50 subgraph information). Feel free to sort by property column to view the most/least diverse subgraph.
Predicate distribution
There are ~7500
unique predicates across the top 50 subgraphs. Among them ~3500
(46%) are used in any 1 subgraph only. Below is a figure showing this distribution.
Here is a csv file to the subgraph-predicate count: [[1]].
Top predicates
While it is interesting to see the top predicates for each subgraph, it is too much to view for this page. Below is a table of only the top 3 predicates per subgraph. Here is a csv file with the top 5 predicates per subgraph: [[2]]. You can view more from [[3]] with some filtering and grouping.
Note that:
- The most common top predicates are
description
,rdf:type
,wikibase:rank
, andreferences
. - Only scholarly articles'
descriptions
are 10% of Wikidata, withcites work
andwikibase:rank
being ~6%. - Wikimedia category
descriptions
are 4.5%, and the rest is ~1% or less of Wikidata triples.
1st top predicate | 2nd top predicate | 3rd top predicate | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Subgraph | Predicate | #of triples | %triples in subgraph | %triples in Wikidata | Predicate | #of triples | %triples in subgraph | %triples in Wikidata | Predicate | #of triples | %triples in subgraph | %triples in Wikidata |
Wikimedia category | description | 596672076 | 79.226 | 4.517 | 22-rdf-syntax-ns#type | 22761531 | 3.022 | 0.172 | rdf-schema#label | 15094771 | 2.004 | 0.114 |
Wikimedia disambiguation page | description | 112264922 | 62.326 | 0.85 | rdf-schema#label | 39587180 | 21.978 | 0.3 | 22-rdf-syntax-ns#type | 4146198 | 2.302 | 0.031 |
Wikimedia list article | description | 23714609 | 70.282 | 0.18 | 22-rdf-syntax-ns#type | 1449015 | 4.294 | 0.011 | instance of | 1022424 | 3.03 | 0.008 |
Wikimedia template | description | 93286385 | 81.609 | 0.706 | 22-rdf-syntax-ns#type | 2918907 | 2.554 | 0.022 | instance of | 2557094 | 2.237 | 0.019 |
Wikinews article | description | 14128436 | 64.72 | 0.107 | 22-rdf-syntax-ns#type | 1114921 | 5.107 | 0.008 | instance of | 862083 | 3.949 | 0.007 |
album | 22-rdf-syntax-ns#type | 2793067 | 13.187 | 0.021 | ontology#rank | 2250457 | 10.625 | 0.017 | rdf-schema#label | 1871641 | 8.836 | 0.014 |
architectural structure | ontology#rank | 1290969 | 13.354 | 0.01 | 22-rdf-syntax-ns#type | 1289149 | 13.336 | 0.01 | prov#wasDerivedFrom | 725555 | 7.506 | 0.005 |
astronomical object | ontology#rank | 144578828 | 12.719 | 1.095 | prov#wasDerivedFrom | 128331955 | 11.29 | 0.972 | 22-rdf-syntax-ns#type | 117137727 | 10.305 | 0.887 |
biographical article | 22-rdf-syntax-ns#type | 1212529 | 14.718 | 0.009 | ontology#rank | 1061395 | 12.884 | 0.008 | description | 532327 | 6.462 | 0.004 |
branch post office | 22-rdf-syntax-ns#type | 775436 | 14.595 | 0.006 | ontology#rank | 775436 | 14.595 | 0.006 | prov#wasDerivedFrom | 645628 | 12.152 | 0.005 |
building | 22-rdf-syntax-ns#type | 2372760 | 14.563 | 0.018 | ontology#rank | 2262925 | 13.889 | 0.017 | prov#wasDerivedFrom | 1130158 | 6.936 | 0.009 |
business | 22-rdf-syntax-ns#type | 1948087 | 13.815 | 0.015 | ontology#rank | 1669295 | 11.838 | 0.013 | prov#wasDerivedFrom | 919184 | 6.518 | 0.007 |
calendar day of a given year | rdf-schema#label | 6152052 | 43.698 | 0.047 | instance of | 1146460 | 8.143 | 0.009 | 22-rdf-syntax-ns#type | 1040850 | 7.393 | 0.008 |
cell line | rdf-schema#label | 1060304 | 13.327 | 0.008 | description | 1042089 | 13.098 | 0.008 | 22-rdf-syntax-ns#type | 833878 | 10.481 | 0.006 |
chemical compound | description | 24753956 | 27.134 | 0.187 | 22-rdf-syntax-ns#type | 10660602 | 11.686 | 0.081 | ontology#rank | 10484519 | 11.493 | 0.079 |
church building | 22-rdf-syntax-ns#type | 2541716 | 15.11 | 0.019 | ontology#rank | 2256786 | 13.416 | 0.017 | prov#wasDerivedFrom | 972599 | 5.782 | 0.007 |
clinical trial | 22-rdf-syntax-ns#type | 4453735 | 16.06 | 0.034 | ontology#rank | 4453642 | 16.06 | 0.034 | minimum age | 1595252 | 5.752 | 0.012 |
collection | 22-rdf-syntax-ns#type | 5206241 | 15.936 | 0.039 | ontology#rank | 5204557 | 15.93 | 0.039 | prov#wasDerivedFrom | 2150646 | 6.583 | 0.016 |
encyclopedia article | description | 11968615 | 35.863 | 0.091 | 22-rdf-syntax-ns#type | 3302636 | 9.896 | 0.025 | ontology#rank | 2913370 | 8.73 | 0.022 |
family name | description | 59885567 | 31.973 | 0.453 | rdf-schema#label | 57692163 | 30.802 | 0.437 | core#altLabel | 51377678 | 27.431 | 0.389 |
film | 22-rdf-syntax-ns#type | 7284888 | 15.442 | 0.055 | ontology#rank | 6480184 | 13.736 | 0.049 | prov#wasDerivedFrom | 3597645 | 7.626 | 0.027 |
gene | prov#wasDerivedFrom | 16328401 | 13.338 | 0.124 | 22-rdf-syntax-ns#type | 16046507 | 13.108 | 0.121 | ontology#rank | 15992810 | 13.064 | 0.121 |
group of stereoisomers | 22-rdf-syntax-ns#type | 874827 | 14.972 | 0.007 | ontology#rank | 871313 | 14.911 | 0.007 | found in taxon | 612204 | 10.477 | 0.005 |
hamlet | 22-rdf-syntax-ns#type | 1222418 | 13.562 | 0.009 | ontology#rank | 924362 | 10.255 | 0.007 | prov#wasDerivedFrom | 813808 | 9.029 | 0.006 |
hill | 22-rdf-syntax-ns#type | 2215050 | 12.857 | 0.017 | ontology#rank | 1857312 | 10.781 | 0.014 | GeoNames ID | 1568224 | 9.103 | 0.012 |
house | 22-rdf-syntax-ns#type | 1893455 | 15.186 | 0.014 | ontology#rank | 1825282 | 14.639 | 0.014 | coordinate location | 746416 | 5.986 | 0.006 |
human | 22-rdf-syntax-ns#type | 127210431 | 13.327 | 0.963 | ontology#rank | 115226473 | 12.071 | 0.872 | rdf-schema#label | 82571873 | 8.65 | 0.625 |
human settlement | 22-rdf-syntax-ns#type | 5214907 | 13.172 | 0.039 | ontology#rank | 3840166 | 9.7 | 0.029 | description | 3222856 | 8.14 | 0.024 |
island | 22-rdf-syntax-ns#type | 1282293 | 12.972 | 0.01 | ontology#rank | 958245 | 9.694 | 0.007 | description | 897631 | 9.08 | 0.007 |
item of collection or exhibition | 22-rdf-syntax-ns#type | 1821311 | 16.969 | 0.014 | ontology#rank | 1820436 | 16.961 | 0.014 | part of | 712209 | 6.636 | 0.005 |
lake | description | 2406074 | 13.328 | 0.018 | 22-rdf-syntax-ns#type | 2199894 | 12.186 | 0.017 | ontology#rank | 1813415 | 10.045 | 0.014 |
literary work | 22-rdf-syntax-ns#type | 1478965 | 14.717 | 0.011 | ontology#rank | 1252618 | 12.464 | 0.009 | instance of | 617093 | 6.141 | 0.005 |
mountain | description | 4367836 | 13.101 | 0.033 | 22-rdf-syntax-ns#type | 4046473 | 12.137 | 0.031 | ontology#rank | 3238261 | 9.713 | 0.025 |
painting | description | 10311414 | 18.164 | 0.078 | 22-rdf-syntax-ns#type | 6848409 | 12.064 | 0.052 | ontology#rank | 6791199 | 11.963 | 0.051 |
position | 22-rdf-syntax-ns#type | 616685 | 13.064 | 0.005 | ontology#rank | 580022 | 12.287 | 0.004 | rdf-schema#label | 544323 | 11.531 | 0.004 |
primary school | 22-rdf-syntax-ns#type | 1252277 | 14.045 | 0.009 | ontology#rank | 1236386 | 13.866 | 0.009 | prov#wasDerivedFrom | 1008762 | 11.314 | 0.008 |
prime number | description | 795712 | 15.335 | 0.006 | ontology#rank | 644583 | 12.423 | 0.005 | 22-rdf-syntax-ns#type | 526967 | 10.156 | 0.004 |
22-rdf-syntax-ns#type | 1425563 | 14.696 | 0.011 | ontology#rank | 1425199 | 14.693 | 0.011 | prov#wasDerivedFrom | 1078309 | 11.117 | 0.008 | |
protein | prov#wasDerivedFrom | 12427165 | 14.045 | 0.094 | 22-rdf-syntax-ns#type | 11368057 | 12.848 | 0.086 | ontology#rank | 11357567 | 12.836 | 0.086 |
river | description | 3651037 | 12.662 | 0.028 | 22-rdf-syntax-ns#type | 3567310 | 12.372 | 0.027 | ontology#rank | 2829316 | 9.813 | 0.021 |
scholarly article | description | 1324177494 | 20.25 | 10.025 | cites work | 853611996 | 13.054 | 6.462 | ontology#rank | 796548851 | 12.181 | 6.03 |
sports season | 22-rdf-syntax-ns#type | 1572731 | 14.707 | 0.012 | ontology#rank | 1156720 | 10.817 | 0.009 | instance of | 496113 | 4.639 | 0.004 |
stream | 22-rdf-syntax-ns#type | 873978 | 13.134 | 0.007 | ontology#rank | 734964 | 11.045 | 0.006 | GeoNames ID | 587699 | 8.832 | 0.004 |
street | 22-rdf-syntax-ns#type | 4236711 | 14.005 | 0.032 | ontology#rank | 4096946 | 13.543 | 0.031 | description | 2256816 | 7.46 | 0.017 |
taxon | rdf-schema#label | 69840848 | 18.982 | 0.529 | description | 45308808 | 12.315 | 0.343 | 22-rdf-syntax-ns#type | 40988244 | 11.14 | 0.31 |
version, edition, or translation | 22-rdf-syntax-ns#type | 1591714 | 14.473 | 0.012 | ontology#rank | 1538937 | 13.993 | 0.012 | prov#wasDerivedFrom | 712852 | 6.482 | 0.005 |
village | 22-rdf-syntax-ns#type | 3307961 | 12.491 | 0.025 | description | 3212844 | 12.132 | 0.024 | ontology#rank | 2304145 | 8.7 | 0.017 |
village in India | description | 2336722 | 15.19 | 0.018 | ontology#rank | 1467450 | 9.539 | 0.011 | 22-rdf-syntax-ns#type | 1392224 | 9.05 | 0.011 |
village-level division in China | description | 24717636 | 47.888 | 0.187 | rdf-schema#label | 4720063 | 9.145 | 0.036 | 22-rdf-syntax-ns#type | 3533117 | 6.845 | 0.027 |
watercourse | 22-rdf-syntax-ns#type | 1256378 | 12.464 | 0.01 | ontology#rank | 1039967 | 10.317 | 0.008 | description | 1009377 | 10.013 | 0.008 |
Predicates across subgraphs
From the predicates point of view: how widely are they used? We already know some predicates are used in 1 subgraph. What about the others? Following is a diagram showing the usage of the 60 most used predicates in Wikidata across various subgraphs. Note that the usage were calculated only for the top 50 subgraphs, which account for ~85% of Wikidata. So this should give us an idea of the high use cases of each of these predicates. The rest can be considered a long tail to each of these plots.
The x-axis shows the rank of the subgraph instead of the name to save space. The rank-name mapping can be found in #Table of top 50 subgraph information. The figures are large but can be zoomed in without loss of resolution for better viewing.
File:60 preds powerlaw.png File:60 preds powerlaw log.png
Top usage
As mentioned above, we can isolate some predicates that are used >=99% in a particular subgraph. Some are even used 100% of the times in that particular subgraph. The following graph shows the distribution of the highest percentage usage in a subgraph a predicate has.
For the predicates that are used a lot in a particular subgraph, it is possible that it is used a very small number of times in other subgraphs (second max usage count is low) or it is also used a lot in other subgraphs (second max usage count is quite high). In short: we also want to look at second max percentages.
Here is an interactive plot showing the max percent, second max percent, color coded with the number of subgraphs the predicate is used in: predicate_usage
Rate of growth of subgraphs
Here is an interactive chart showing the growth of the top 50 subgraphs over a period of one month: subgraph_growth_rate