You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User:AKhatun/Wikidata Subgraph Query Analysis: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>AKhatun
imported>AKhatun
(→‎Query count and time: Add time to recover in table)
Line 32: Line 32:
|+ Top 50 most queries subgraphs in Wikidata with subgraph size information
|+ Top 50 most queries subgraphs in Wikidata with subgraph size information
|-
|-
! Subgraph rank by size !! Subgraph rank by query count !! Subgraph rank by query time !! Subgraph !! Subgraph label !! %of triples !! %of entities !! Query count !! %count of all queries !! Query time (hr) !! %time of all queries !! %count of query from Qid !! %count of query from instance items !! %count of query from items !! %count of query from properties !! %count of query from literals
! Subgraph rank by size !! Subgraph rank by query count !! Subgraph rank by query time !! Subgraph !! Subgraph label !! %of triples !! %of entities !! Days to recover (4.77M rate) !!Query count !! %count of all queries !! Query time (hr) !! %time of all queries !! %count of query from Qid !! %count of query from instance items !! %count of query from items !! %count of query from properties !! %count of query from literals
|-
|-
|3||1||1||Q5||human||7.324||9.986||68,659,369||31.058||6314||0.393||1.827||17.705||10.324||20.176||1.11
|3||1||1||Q5||human||7.324||9.986||203||68,659,369||31.058||6314||0.393||1.827||17.705||10.324||20.176||1.11
|-
|-
|5||2||4||Q16521||taxon||2.871||3.427||56,437,140||25.529||495||0.031||22.986||1.251||23.665||0.965||0.496
|5||2||4||Q16521||taxon||2.871||3.427||79||56,437,140||25.529||495||0.031||22.986||1.251||23.665||0.965||0.496
|-
|-
|6||3||3||Q101352||family name||1.546||0.509||5,564,173||2.517||640||0.04||0.064||2.425||2.34||0.016||0.032
|6||3||3||Q101352||family name||1.546||0.509||43||5,564,173||2.517||640||0.04||0.064||2.425||2.34||0.016||0.032
|-
|-
|15||4||2||Q11424||film||0.364||0.281||4,757,084||2.152||1613||0.1||0.563||1.308||1.089||0.008||0.407
|15||4||2||Q11424||film||0.364||0.281||10||4,757,084||2.152||1613||0.1||0.563||1.308||1.089||0.008||0.407
|-
|-
|34||5||7||Q4830453||business||0.108||0.207||4,041,395||1.828||343||0.021||0.953||0.788||0.416||0.0||0.101
|34||5||7||Q4830453||business||0.108||0.207||3||4,041,395||1.828||343||0.021||0.953||0.788||0.416||0.0||0.101
|-
|-
|7||6||9||Q4167410||Wikimedia disambiguation page||1.374||1.459||3,737,550||1.691||223||0.014||0.195||0.484||0.554||0.0||0.938
|7||6||9||Q4167410||Wikimedia disambiguation page||1.374||1.459||38||3,737,550||1.691||223||0.014||0.195||0.484||0.554||0.0||0.938
|-
|-
|177||7||20||Q34770||language||0.013||0.011||1,713,196||0.775||73||0.005||0.008||0.757||0.009||0.0||0.005
|177||7||20||Q34770||language||0.013||0.011||0||1,713,196||0.775||73||0.005||0.008||0.757||0.009||0.0||0.005
|-
|-
|1||8||13||Q13442814||scholarly article||49.668||39.794||1,649,268||0.746||142||0.009||0.005||0.261||0.278||0.124||0.386
|1||8||13||Q13442814||scholarly article||49.668||39.794||1375||1,649,268||0.746||142||0.009||0.005||0.261||0.278||0.124||0.386
|-
|-
|4||9||17||Q4167836||Wikimedia category||5.85||5.165||1,383,343||0.626||96||0.006||0.019||0.594||0.152||0.0||0.01
|4||9||17||Q4167836||Wikimedia category||5.85||5.165||162||1,383,343||0.626||96||0.006||0.019||0.594||0.152||0.0||0.01
|-
|-
|10||10||14||Q11173||chemical compound||0.693||1.302||1,307,852||0.592||133||0.008||0.022||0.548||0.449||0.001||0.014
|10||10||14||Q11173||chemical compound||0.693||1.302||19||1,307,852||0.592||133||0.008||0.022||0.548||0.449||0.001||0.014
|-
|-
|20||11||22||Q13406463||Wikimedia list article||0.252||0.352||1,283,160||0.58||73||0.005||0.018||0.409||0.357||0.0||0.048
|20||11||22||Q13406463||Wikimedia list article||0.252||0.352||7||1,283,160||0.58||73||0.005||0.018||0.409||0.357||0.0||0.048
|-
|-
|63||12||6||Q5398426||television series||0.055||0.062||1,206,285||0.546||366||0.023||0.05||0.332||0.252||0.0||0.128
|63||12||6||Q5398426||television series||0.055||0.062||2||1,206,285||0.546||366||0.023||0.05||0.332||0.252||0.0||0.128
|-
|-
|243||13||24||Q14204246||Wikimedia project page||0.008||0.033||1,114,113||0.504||62||0.004||0.009||0.227||0.016||0.0||0.275
|243||13||24||Q14204246||Wikimedia project page||0.008||0.033||0||1,114,113||0.504||62||0.004||0.009||0.227||0.016||0.0||0.275
|-
|-
|92||14||11||Q6881511||enterprise||0.036||0.052||943,613||0.427||164||0.01||0.034||0.338||0.144||0.0||0.042
|92||14||11||Q6881511||enterprise||0.036||0.052||1||943,613||0.427||164||0.01||0.034||0.338||0.144||0.0||0.042
|-
|-
|26||15||29||Q484170||commune of France||0.18||0.043||866,766||0.392||46||0.003||0.006||0.278||0.004||0.098||0.007
|26||15||29||Q484170||commune of France||0.18||0.043||5||866,766||0.392||46||0.003||0.006||0.278||0.004||0.098||0.007
|-
|-
|165||16||12||Q891723||public company||0.015||0.013||837,595||0.379||157||0.01||0.034||0.277||0.061||0.0||0.054
|165||16||12||Q891723||public company||0.015||0.013||0||837,595||0.379||157||0.01||0.034||0.277||0.061||0.0||0.054
|-
|-
|12||17||19||Q3305213||painting||0.432||0.578||834,752||0.378||79||0.005||0.012||0.332||0.187||0.005||0.012
|12||17||19||Q3305213||painting||0.432||0.578||12||834,752||0.378||79||0.005||0.012||0.332||0.187||0.005||0.012
|-
|-
|91||18||16||Q43229||organization||0.037||0.08||806,840||0.365||123||0.008||0.128||0.213||0.097||0.0||0.012
|91||18||16||Q43229||organization||0.037||0.08||1||806,840||0.365||123||0.008||0.128||0.213||0.097||0.0||0.012
|-
|-
|89||19||8||Q4164871||position||0.037||0.128||788,077||0.356||332||0.021||0.004||0.343||0.016||0.0||0.003
|89||19||8||Q4164871||position||0.037||0.128||1||788,077||0.356||332||0.021||0.004||0.343||0.016||0.0||0.003
|-
|-
|28||20||30||Q482994||album||0.161||0.287||776,845||0.351||37||0.002||0.012||0.287||0.209||0.0||0.016
|28||20||30||Q482994||album||0.161||0.287||4||776,845||0.351||37||0.002||0.012||0.287||0.209||0.0||0.016
|-
|-
|86||21||23||Q47461344||written work||0.038||0.078||774,947||0.351||67||0.004||0.244||0.085||0.039||0.0||0.003
|86||21||23||Q47461344||written work||0.038||0.078||1||774,947||0.351||67||0.004||0.244||0.085||0.039||0.0||0.003
|-
|-
|62||22||35||Q7889||video game||0.056||0.047||741,401||0.335||30||0.002||0.006||0.195||0.256||0.005||0.007
|62||22||35||Q7889||video game||0.056||0.047||2||741,401||0.335||30||0.002||0.006||0.195||0.256||0.005||0.007
|-
|-
|16||23||21||Q486972||human settlement||0.302||0.602||721,789||0.327||73||0.005||0.095||0.22||0.107||0.0||0.006
|16||23||21||Q486972||human settlement||0.302||0.602||8||721,789||0.327||73||0.005||0.095||0.22||0.107||0.0||0.006
|-
|-
|8||24||18||Q7187||gene||0.927||1.273||628,916||0.284||94||0.006||0.107||0.063||0.007||0.021||0.113
|8||24||18||Q7187||gene||0.927||1.273||26||628,916||0.284||94||0.006||0.107||0.063||0.007||0.021||0.113
|-
|-
|25||25||46||Q532||village||0.201||0.292||584,789||0.265||21||0.001||0.001||0.246||0.109||0.0||0.013
|25||25||46||Q532||village||0.201||0.292||6||584,789||0.265||21||0.001||0.001||0.246||0.109||0.0||0.013
|-
|-
|70||26||27||Q732577||publication||0.048||0.076||512,416||0.232||53||0.003||0.229||0.003||0.23||0.0||0.0
|70||26||27||Q732577||publication||0.048||0.076||1||512,416||0.232||53||0.003||0.229||0.003||0.23||0.0||0.0
|-
|-
|42||27||45||Q7725634||literary work||0.077||0.176||468,204||0.212||22||0.001||0.017||0.16||0.104||0.0||0.007
|42||27||45||Q7725634||literary work||0.077||0.176||2||468,204||0.212||22||0.001||0.017||0.16||0.104||0.0||0.007
|-
|-
|138||28||57||Q18340514||events in a specific year or time period||0.019||0.048||463,683||0.21||17||0.001||0.0||0.2||0.056||0.0||0.005
|138||28||57||Q18340514||events in a specific year or time period||0.019||0.048||1||463,683||0.21||17||0.001||0.0||0.2||0.056||0.0||0.005
|-
|-
|54||29||60||Q215380||musical group||0.063||0.087||461,181||0.209||17||0.001||0.009||0.164||0.073||0.0||0.008
|54||29||60||Q215380||musical group||0.063||0.087||2||461,181||0.209||17||0.001||0.009||0.164||0.073||0.0||0.008
|-
|-
|2||30||28||Q6999||astronomical object||8.75||8.942||448,032||0.203||51||0.003||0.0||0.175||0.085||0.015||0.003
|2||30||28||Q6999||astronomical object||8.75||8.942||242||448,032||0.203||51||0.003||0.0||0.175||0.085||0.015||0.003
|-
|-
|41||31||56||Q22808320||Wikimedia human name disambiguation page||0.078||0.075||433,986||0.196||17||0.001||0.0||0.174||0.154||0.0||0.001
|41||31||56||Q22808320||Wikimedia human name disambiguation page||0.078||0.075||2||433,986||0.196||17||0.001||0.0||0.174||0.154||0.0||0.001
|-
|-
|53||32||63||Q134556||single||0.065||0.103||431,003||0.195||16||0.001||0.001||0.167||0.138||0.0||0.004
|53||32||63||Q134556||single||0.065||0.103||2||431,003||0.195||16||0.001||0.001||0.167||0.138||0.0||0.004
|-
|-
|37||33||32||Q3331189||version, edition, or translation||0.087||0.19||410,352||0.186||34||0.002||0.103||0.053||0.118||0.004||0.028
|37||33||32||Q3331189||version, edition, or translation||0.087||0.19||2||410,352||0.186||34||0.002||0.103||0.053||0.118||0.004||0.028
|-
|-
|31||34||41||Q16970||church building||0.129||0.226||396,936||0.18||25||0.002||0.005||0.172||0.112||0.0||0.001
|31||34||41||Q16970||church building||0.129||0.226||4||396,936||0.18||25||0.002||0.005||0.172||0.112||0.0||0.001
|-
|-
|71||35||25||Q86850539||Whitaker's Latin frequency type C||0.048||0.011||355,247||0.161||56||0.003||0.0||0.0||0.0||0.0||0.16
|71||35||25||Q86850539||Whitaker's Latin frequency type C||0.048||0.011||1||355,247||0.161||56||0.003||0.0||0.0||0.0||0.0||0.16
|-
|-
|11||36||65||Q8054||protein||0.67||1.05||349,573||0.158||16||0.001||0.079||0.034||0.002||0.02||0.066
|11||36||65||Q8054||protein||0.67||1.05||19||349,573||0.158||16||0.001||0.079||0.034||0.002||0.02||0.066
|-
|-
|49||37||167||Q2225692||fourth-level administrative division in Indonesia||0.07||0.088||344,964||0.156||5||0.0||0.0||0.147||0.098||0.0||0.009
|49||37||167||Q2225692||fourth-level administrative division in Indonesia||0.07||0.088||2||344,964||0.156||5||0.0||0.0||0.147||0.098||0.0||0.009
|-
|-
|223||38||87||Q571||book||0.009||0.022||340,900||0.154||12||0.001||0.114||0.016||0.01||0.0||0.023
|223||38||87||Q571||book||0.009||0.022||0||340,900||0.154||12||0.001||0.114||0.016||0.01||0.0||0.023
|-
|-
|112||39||76||Q476028||association football club||0.026||0.038||320,422||0.145||14||0.001||0.006||0.12||0.029||0.0||0.003
|112||39||76||Q476028||association football club||0.026||0.038||1||320,422||0.145||14||0.001||0.006||0.12||0.029||0.0||0.003
|-
|-
|21||40||10||Q2668072||collection||0.248||0.534||312,822||0.142||166||0.01||0.056||0.084||0.058||0.0||0.001
|21||40||10||Q2668072||collection||0.248||0.534||7||312,822||0.142||166||0.01||0.056||0.084||0.058||0.0||0.001
|-
|-
|113||41||54||Q15632617||fictional human||0.026||0.056||306,319||0.139||18||0.001||0.006||0.1||0.05||0.0||0.003
|113||41||54||Q15632617||fictional human||0.026||0.056||1||306,319||0.139||18||0.001||0.006||0.1||0.05||0.0||0.003
|-
|-
|121||42||42||Q3957||town||0.023||0.015||294,685||0.133||24||0.001||0.047||0.079||0.014||0.0||0.002
|121||42||42||Q3957||town||0.023||0.015||1||294,685||0.133||24||0.001||0.047||0.079||0.014||0.0||0.002
|-
|-
|133||43||58||Q506240||television film||0.02||0.019||290,899||0.132||17||0.001||0.009||0.098||0.07||0.0||0.02
|133||43||58||Q506240||television film||0.02||0.019||1||290,899||0.132||17||0.001||0.009||0.098||0.07||0.0||0.02
|-
|-
|136||44||5||Q15416||television program||0.019||0.05||286,609||0.13||386||0.024||0.024||0.084||0.072||0.0||0.01
|136||44||5||Q15416||television program||0.019||0.05||1||286,609||0.13||386||0.024||0.024||0.084||0.072||0.0||0.01
|-
|-
|72||45||79||Q105543609||musical work/composition||0.048||0.099||285,889||0.129||13||0.001||0.004||0.095||0.061||0.004||0.009
|72||45||79||Q105543609||musical work/composition||0.048||0.099||1||285,889||0.129||13||0.001||0.004||0.095||0.061||0.004||0.009
|-
|-
|64||46||38||Q811979||architectural structure||0.055||0.119||282,739||0.128||28||0.002||0.09||0.035||0.024||0.0||0.001
|64||46||38||Q811979||architectural structure||0.055||0.119||2||282,739||0.128||28||0.002||0.09||0.035||0.024||0.0||0.001
|-
|-
|23||47||51||Q4022||river||0.219||0.425||280,190||0.127||20||0.001||0.002||0.12||0.045||0.0||0.002
|23||47||51||Q4022||river||0.219||0.425||6||280,190||0.127||20||0.001||0.002||0.12||0.045||0.0||0.002
|-
|-
|32||48||31||Q41176||building||0.125||0.287||271,666||0.123||36||0.002||0.034||0.084||0.065||0.002||0.001
|32||48||31||Q41176||building||0.125||0.287||3||271,666||0.123||36||0.002||0.034||0.084||0.065||0.002||0.001
|-
|-
|45||49||50||Q55488||railway station||0.075||0.104||258,862||0.117||20||0.001||0.001||0.109||0.072||0.0||0.001
|45||49||50||Q55488||railway station||0.075||0.104||2||258,862||0.117||20||0.001||0.001||0.109||0.072||0.0||0.001
|-
|-
|192||50||143||Q3464665||television series season||0.011||0.02||254,318||0.115||6||0.0||0.031||0.077||0.009||0.0||0.0
|192||50||143||Q3464665||television series season||0.011||0.02||0||254,318||0.115||6||0.0||0.031||0.077||0.009||0.0||0.0
|}
|}
== More on query time ==
== More on query time ==


Line 146: Line 147:
Analysis on user-agent is an approximation because these don't completely represent distint users. For example lots people use the same bot or script without changing the user-agent, or the same person or bot uses multiple user-agent strings. Yet based on the available data we can get an idea nevertheless.  
Analysis on user-agent is an approximation because these don't completely represent distint users. For example lots people use the same bot or script without changing the user-agent, or the same person or bot uses multiple user-agent strings. Yet based on the available data we can get an idea nevertheless.  


=== User agent count ===
* Total number of unique user agents across all subgraphs: 981,180
* Total number of unique user agents across all subgraphs: 981,180
* First, a list of subgraphs with most and least distinct user-agents is listed. It seems the least number of user-agents a subgraph has is atleast 10. So the large subgraphs are used by mutiple users.
* First, a list of subgraphs with most and least distinct user-agents is listed. It seems the least number of user-agents a subgraph has is atleast 10. So the large subgraphs are used by mutiple users.
Line 248: Line 250:
[[File:ua_lessthan1k_dist.png]]
[[File:ua_lessthan1k_dist.png]]


=== User agent distribution in subgraphs ===
* Next, the user agent vs query count distribution was analyzed for some of the top subgraphs. While user agent count gives us an idea of how many users may be using a subgraph, it is not clear whether all of them query the subgraph equally, or very few user agents perform most of the queries.
* Next, the user agent vs query count distribution was analyzed for some of the top subgraphs. While user agent count gives us an idea of how many users may be using a subgraph, it is not clear whether all of them query the subgraph equally, or very few user agents perform most of the queries.
* ~30 out of 341 subgraphs have a user agent that queries >=50% of all queries of that particular subgraphs.
* ~30 out of 341 subgraphs have a user agent that queries >=50% of all queries of that particular subgraphs.
Line 257: Line 260:


The figure below shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail of the distribution
The figure below shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail of the distribution
[[File:subgraph_ua_hist.png|1100px|This figure shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail 10% of the distribution.]]  
[[File:subgraph_ua_hist.png|1100px|This figure shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail 10% of the distribution.]]
 
=== Top user agents in subgraphs ===
* The top user agents in various subgraphs is listed below. More analysis on Q5 (human) and Q16521 (taxon) is done at the end of the page as they are the most queried subgraphs.
 
{| class="wikitable sortable"
|+ Top user agents in various subgraphs
|-
! Subgraph !! Subgraph label !! User agent !! Query count (in subgraph) !! Query percent (within subgraph) !! Query percent overall
|-
|Q16521||taxon||mix-n-match||50622670||89.697||22.899
|-
|Q5||human||UA # 2||9017930||13.134||4.079
|-
|Q5||human||mix-n-match||8548335||12.45||3.867
|-
|Q5||human||UA # 3||5059258||7.369||2.289
|-
|Q5||human||UA # 4||4020496||5.856||1.819
|-
|Q5||human||UA # 5||3828747||5.576||1.732
|-
|Q101352||family name||UA # 5||3828747||68.811||1.732
|-
|Q5||human||UA # 6||2685807||3.912||1.215
|-
|Q5||human||UA # 7||2434486||3.546||1.101
|-
|Q4830453||business||UA # 8||2403677||59.476||1.087
|-
|Q5||human||UA # 9||2020598||2.943||0.914
|-
|Q16521||taxon||Hub||1984437||3.516||0.898
|-
|Q5||human||UA # 11||1877700||2.735||0.849
|-
|Q5||human||UA # 12||1781161||2.594||0.806
|-
|Q16521||taxon||UA # 13||1294113||2.293||0.585
|}


== User agent vs Subgraph ==
=== User agent vs Subgraph ===


So far we have explored the user-agent count and distribution per subgraph. It is also important to note the user agent's query across subgraphs. In other words,  
So far we have explored the user-agent count and distribution per subgraph. It is also important to note the user agent's query across subgraphs. In other words,  

Revision as of 18:53, 7 December 2021

Analysis on Subgraphs in Wikidata showed how large each of the subgraphs are in Wikidata and how connected they are. This page shows the results from analysis on the queries that relate to these subgraph. The questions that needed to be answered were:

  • How many(percent) queries access each subgraph?
  • How many queries access multiple subgraphs at once? i.e, how much overlap can we expect in subgraphs?
  • How long do these queries take?
  • How many user-agents access each subgraph? How many of them access lots of subgraphs, or are they confined to a small set of subgraphs? Do some of them dominate queries in mutiple subgraphs?
  • Are there chunks of similar queries in these subgraphs? i.e, how diverse the queries in each subgraph are.

TL;DR

What are subgraph related queries

We define some parameters to identify whether a query touches on a subgraph based on the items and properties a query uses. Some queries may even touch on mutiple subgraphs. See more on what a subgraph means here. Note: Subgraphs have overlaps.

The parameters that define which subgraph a query belongs to are:

  1. If the query uses the subgraph's Qid. Example: Q5 containing queries are part of Q5 subgraph.
  2. If the query uses items that are instance of a particular subgraph.
  3. If the query uses items that occur 99% of the times in a particular subgraph.
  4. If the query uses properties that occur 99% of the times in a particular subgraph.
  5. If the query uses literals that occur 99% of the times in a particular subgraph. The literals can occur with or without language tags. Both versions are compared to check for match. Note that whole literals are matched in queries and Wikidata. Queries that ask for partial matches, using regex for example, are not included. The assumption is that such queries are more likely to contain other items from a subgraph and are caught anyways. This of course excludes generic queries that wish to search the entirety of Wikidata for a certain label or other strings.

The following analysis uses Wikidata dump of 20211101 and WDQS public SPARQL queries of 10/2021. All query related values below are monthly counts.

Query count and time

  • All queries here refer to queries with status code 200 and 500, i.e correct queries, successful or time-out.
  • WDQS receives ~220M queries a month.
  • Total query time for all queries for a month is ~16,000 hours.

The table below lists the top 50 most queried subgraphs with subgraph size and query time information. A breakdown of what caused the match is also present, which corresponds to the parameters mentioned in #What are subgraph related queries. It also ranks the subgraphs by size, query count, and query time consumed.

A more complete list containing 341 subgraphs, that form ~90% of Wikidata triples, is available here: [csvlink|all_subgraph_data.csv]

Top 50 most queries subgraphs in Wikidata with subgraph size information
Subgraph rank by size Subgraph rank by query count Subgraph rank by query time Subgraph Subgraph label %of triples %of entities Days to recover (4.77M rate) Query count %count of all queries Query time (hr) %time of all queries %count of query from Qid %count of query from instance items %count of query from items %count of query from properties %count of query from literals
3 1 1 Q5 human 7.324 9.986 203 68,659,369 31.058 6314 0.393 1.827 17.705 10.324 20.176 1.11
5 2 4 Q16521 taxon 2.871 3.427 79 56,437,140 25.529 495 0.031 22.986 1.251 23.665 0.965 0.496
6 3 3 Q101352 family name 1.546 0.509 43 5,564,173 2.517 640 0.04 0.064 2.425 2.34 0.016 0.032
15 4 2 Q11424 film 0.364 0.281 10 4,757,084 2.152 1613 0.1 0.563 1.308 1.089 0.008 0.407
34 5 7 Q4830453 business 0.108 0.207 3 4,041,395 1.828 343 0.021 0.953 0.788 0.416 0.0 0.101
7 6 9 Q4167410 Wikimedia disambiguation page 1.374 1.459 38 3,737,550 1.691 223 0.014 0.195 0.484 0.554 0.0 0.938
177 7 20 Q34770 language 0.013 0.011 0 1,713,196 0.775 73 0.005 0.008 0.757 0.009 0.0 0.005
1 8 13 Q13442814 scholarly article 49.668 39.794 1375 1,649,268 0.746 142 0.009 0.005 0.261 0.278 0.124 0.386
4 9 17 Q4167836 Wikimedia category 5.85 5.165 162 1,383,343 0.626 96 0.006 0.019 0.594 0.152 0.0 0.01
10 10 14 Q11173 chemical compound 0.693 1.302 19 1,307,852 0.592 133 0.008 0.022 0.548 0.449 0.001 0.014
20 11 22 Q13406463 Wikimedia list article 0.252 0.352 7 1,283,160 0.58 73 0.005 0.018 0.409 0.357 0.0 0.048
63 12 6 Q5398426 television series 0.055 0.062 2 1,206,285 0.546 366 0.023 0.05 0.332 0.252 0.0 0.128
243 13 24 Q14204246 Wikimedia project page 0.008 0.033 0 1,114,113 0.504 62 0.004 0.009 0.227 0.016 0.0 0.275
92 14 11 Q6881511 enterprise 0.036 0.052 1 943,613 0.427 164 0.01 0.034 0.338 0.144 0.0 0.042
26 15 29 Q484170 commune of France 0.18 0.043 5 866,766 0.392 46 0.003 0.006 0.278 0.004 0.098 0.007
165 16 12 Q891723 public company 0.015 0.013 0 837,595 0.379 157 0.01 0.034 0.277 0.061 0.0 0.054
12 17 19 Q3305213 painting 0.432 0.578 12 834,752 0.378 79 0.005 0.012 0.332 0.187 0.005 0.012
91 18 16 Q43229 organization 0.037 0.08 1 806,840 0.365 123 0.008 0.128 0.213 0.097 0.0 0.012
89 19 8 Q4164871 position 0.037 0.128 1 788,077 0.356 332 0.021 0.004 0.343 0.016 0.0 0.003
28 20 30 Q482994 album 0.161 0.287 4 776,845 0.351 37 0.002 0.012 0.287 0.209 0.0 0.016
86 21 23 Q47461344 written work 0.038 0.078 1 774,947 0.351 67 0.004 0.244 0.085 0.039 0.0 0.003
62 22 35 Q7889 video game 0.056 0.047 2 741,401 0.335 30 0.002 0.006 0.195 0.256 0.005 0.007
16 23 21 Q486972 human settlement 0.302 0.602 8 721,789 0.327 73 0.005 0.095 0.22 0.107 0.0 0.006
8 24 18 Q7187 gene 0.927 1.273 26 628,916 0.284 94 0.006 0.107 0.063 0.007 0.021 0.113
25 25 46 Q532 village 0.201 0.292 6 584,789 0.265 21 0.001 0.001 0.246 0.109 0.0 0.013
70 26 27 Q732577 publication 0.048 0.076 1 512,416 0.232 53 0.003 0.229 0.003 0.23 0.0 0.0
42 27 45 Q7725634 literary work 0.077 0.176 2 468,204 0.212 22 0.001 0.017 0.16 0.104 0.0 0.007
138 28 57 Q18340514 events in a specific year or time period 0.019 0.048 1 463,683 0.21 17 0.001 0.0 0.2 0.056 0.0 0.005
54 29 60 Q215380 musical group 0.063 0.087 2 461,181 0.209 17 0.001 0.009 0.164 0.073 0.0 0.008
2 30 28 Q6999 astronomical object 8.75 8.942 242 448,032 0.203 51 0.003 0.0 0.175 0.085 0.015 0.003
41 31 56 Q22808320 Wikimedia human name disambiguation page 0.078 0.075 2 433,986 0.196 17 0.001 0.0 0.174 0.154 0.0 0.001
53 32 63 Q134556 single 0.065 0.103 2 431,003 0.195 16 0.001 0.001 0.167 0.138 0.0 0.004
37 33 32 Q3331189 version, edition, or translation 0.087 0.19 2 410,352 0.186 34 0.002 0.103 0.053 0.118 0.004 0.028
31 34 41 Q16970 church building 0.129 0.226 4 396,936 0.18 25 0.002 0.005 0.172 0.112 0.0 0.001
71 35 25 Q86850539 Whitaker's Latin frequency type C 0.048 0.011 1 355,247 0.161 56 0.003 0.0 0.0 0.0 0.0 0.16
11 36 65 Q8054 protein 0.67 1.05 19 349,573 0.158 16 0.001 0.079 0.034 0.002 0.02 0.066
49 37 167 Q2225692 fourth-level administrative division in Indonesia 0.07 0.088 2 344,964 0.156 5 0.0 0.0 0.147 0.098 0.0 0.009
223 38 87 Q571 book 0.009 0.022 0 340,900 0.154 12 0.001 0.114 0.016 0.01 0.0 0.023
112 39 76 Q476028 association football club 0.026 0.038 1 320,422 0.145 14 0.001 0.006 0.12 0.029 0.0 0.003
21 40 10 Q2668072 collection 0.248 0.534 7 312,822 0.142 166 0.01 0.056 0.084 0.058 0.0 0.001
113 41 54 Q15632617 fictional human 0.026 0.056 1 306,319 0.139 18 0.001 0.006 0.1 0.05 0.0 0.003
121 42 42 Q3957 town 0.023 0.015 1 294,685 0.133 24 0.001 0.047 0.079 0.014 0.0 0.002
133 43 58 Q506240 television film 0.02 0.019 1 290,899 0.132 17 0.001 0.009 0.098 0.07 0.0 0.02
136 44 5 Q15416 television program 0.019 0.05 1 286,609 0.13 386 0.024 0.024 0.084 0.072 0.0 0.01
72 45 79 Q105543609 musical work/composition 0.048 0.099 1 285,889 0.129 13 0.001 0.004 0.095 0.061 0.004 0.009
64 46 38 Q811979 architectural structure 0.055 0.119 2 282,739 0.128 28 0.002 0.09 0.035 0.024 0.0 0.001
23 47 51 Q4022 river 0.219 0.425 6 280,190 0.127 20 0.001 0.002 0.12 0.045 0.0 0.002
32 48 31 Q41176 building 0.125 0.287 3 271,666 0.123 36 0.002 0.034 0.084 0.065 0.002 0.001
45 49 50 Q55488 railway station 0.075 0.104 2 258,862 0.117 20 0.001 0.001 0.109 0.072 0.0 0.001
192 50 143 Q3464665 television series season 0.011 0.02 0 254,318 0.115 6 0.0 0.031 0.077 0.009 0.0 0.0

More on query time

The query time can be broken down to classes for better visualization. Below is a figure with the query class distribution (number of queries per query time class per subgraph) for the top 50 subgraphs. Some of the takeaways are:

  • Most subgraphs have most queries in the range of 10-100ms
  • Second most commons class is 100ms to 1s
  • collection and photograph have most queries (~150k) timed at 1-10s. Around 10 more subgraphs have a little (~10-20k) query in this time range.

File:Top 50 query time class.png

User agent

Analysis on user-agent is an approximation because these don't completely represent distint users. For example lots people use the same bot or script without changing the user-agent, or the same person or bot uses multiple user-agent strings. Yet based on the available data we can get an idea nevertheless.

User agent count

  • Total number of unique user agents across all subgraphs: 981,180
  • First, a list of subgraphs with most and least distinct user-agents is listed. It seems the least number of user-agents a subgraph has is atleast 10. So the large subgraphs are used by mutiple users.
  • The largest numbers of user-agents are present in a variety of type of subgraphs, gene-protien-biological_process-molecular_function appear to be similar among them. It is possible the same queries represent several of these subgraphs. More on subgraph connectivity in #Subgraph Connectivity.
Subgraphs with most user-agents
Subgraph Subgraph label %Query #User agents %User agent
Q11424 film 2.152 251420 0.256
Q8054 protein 0.158 234659 0.239
Q7187 gene 0.284 187029 0.191
Q2996394 biological process 0.072 124415 0.127
Q14860489 molecular function 0.044 89445 0.091
Q5 human 31.058 55377 0.056
Q898273 protein domain 0.019 38484 0.039
Q16521 taxon 25.529 25193 0.026
Q86850539 Whitaker's Latin frequency type C 0.161 20158 0.021
Q4167410 Wikimedia disambiguation page 1.691 13818 0.014
Q14204246 Wikimedia project page 0.504 13443 0.014
Q476028 association football club 0.145 12086 0.012
Q235557 file format 0.045 7701 0.008
Q1520033 count noun 0.05 7662 0.008
Q417841 protein family 0.007 4906 0.005
Q484170 commune of France 0.392 4764 0.005
Q4830453 business 1.828 4383 0.004
Q4164871 position 0.356 4319 0.004
Q7278 political party 0.109 4073 0.004
Q3918 university 0.104 3565 0.004
Subgraphs with least user-agents
Subgraph Subgraph label %Query #User agents %User agent
Q106006703 local regulations of the People's Republic of China 0.0 11 0.0
Q67015940 Government Boys' Primary School 0.0 13 0.0
Q7604693 Statutory Rules of Northern Ireland 0.0 13 0.0
Q106474968 ethnic group by settlement in Macedonia 0.003 15 0.0
Q6453643 decree law 0.0 15 0.0
Q97695005 committee group motion 0.0 15 0.0
Q100532807 Irish Statutory Instrument 0.0 16 0.0
Q10429085 report 0.0 19 0.0
Q99045339 written question 0.0 20 0.0
Q1505023 Interpellation 0.0 20 0.0
Q96739634 individual motion 0.0 21 0.0
Q67035425 ASTM standard 0.0 21 0.0
Q61278455 health sub-centre 0.001 23 0.0
Q26267864 Wikimedia KML file 0.005 23 0.0
Q3508250 Syndicat intercommunal 0.02 24 0.0
Q107102664 cell line from embryonic stem cells 0.0 24 0.0
Q7604686 UK Statutory Instrument 0.0 27 0.0
Q6451276 Congressional Research Service report 0.001 28 0.0
Q61443650 sub post office 0.0 33 0.0
Q26894053 basketball team season 0.009 34 0.0
  • There are 50 subgraphs with more than 1000 user agents, and 300 subgraphs with less than 1000 user agents. Most subgraphs are therefore not queried overly-widely. The distribution of user-agent counts less than 1000 is shown in the figure below. This clearly shows the small number of user counts in most subgraphs.

File:Ua lessthan1k dist.png

User agent distribution in subgraphs

  • Next, the user agent vs query count distribution was analyzed for some of the top subgraphs. While user agent count gives us an idea of how many users may be using a subgraph, it is not clear whether all of them query the subgraph equally, or very few user agents perform most of the queries.
  • ~30 out of 341 subgraphs have a user agent that queries >=50% of all queries of that particular subgraphs.
  • 6 subgraphs have a user agent querying around 80-90% of the time.
  • So the trend of dominating single source queries is not wide spread among subgraphs, but is present in few subraphs nonetheless.

The figure below shows the top 2 user-agent query percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph. This figure shows the top 2 user-agent query percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph.

The figure below shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail of the distribution This figure shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail 10% of the distribution.

Top user agents in subgraphs

  • The top user agents in various subgraphs is listed below. More analysis on Q5 (human) and Q16521 (taxon) is done at the end of the page as they are the most queried subgraphs.
Top user agents in various subgraphs
Subgraph Subgraph label User agent Query count (in subgraph) Query percent (within subgraph) Query percent overall
Q16521 taxon mix-n-match 50622670 89.697 22.899
Q5 human UA # 2 9017930 13.134 4.079
Q5 human mix-n-match 8548335 12.45 3.867
Q5 human UA # 3 5059258 7.369 2.289
Q5 human UA # 4 4020496 5.856 1.819
Q5 human UA # 5 3828747 5.576 1.732
Q101352 family name UA # 5 3828747 68.811 1.732
Q5 human UA # 6 2685807 3.912 1.215
Q5 human UA # 7 2434486 3.546 1.101
Q4830453 business UA # 8 2403677 59.476 1.087
Q5 human UA # 9 2020598 2.943 0.914
Q16521 taxon Hub 1984437 3.516 0.898
Q5 human UA # 11 1877700 2.735 0.849
Q5 human UA # 12 1781161 2.594 0.806
Q16521 taxon UA # 13 1294113 2.293 0.585

User agent vs Subgraph

So far we have explored the user-agent count and distribution per subgraph. It is also important to note the user agent's query across subgraphs. In other words,

  • Do users have a very specific use case and so the queries spans only a few subgraphs? or is it spread across a lot of subgraphs?
  • Are there some user agents that query the most in mutiple subgraphs? This could be due to the nature of the use case or simply because several subgraphs overlap a lot.

We start by looking at how many user agents acces how many subgraphs. From the table below, we see that most user agents (89% of them) query one subgraphs only. Some user agents query a lot of subgraphs as well. A clearer picture is seem from the plot below.

Relationship between subgraphs and user agents
#of Subgraphs (X) #of User agents querying X subgraphs %of User agents querying X subgraphs
1 875724 89.252
2 91962 9.373
5 3562 0.363
3 2388 0.243
6 1539 0.157
7 799 0.081
9 628 0.064
8 463 0.047
4 460 0.047
12 332 0.034
16 308 0.031
15 282 0.029
10 281 0.029
17 242 0.025
18 235 0.024
14 202 0.021
11 184 0.019
19 177 0.018
13 167 0.017
20 119 0.012
21 75 0.008
22 47 0.005
25 46 0.005
23 39 0.004
24 39 0.004
27 32 0.003
26 28 0.003
28 26 0.003
29 25 0.003
30 20 0.002
31 17 0.002
35 16 0.002
37 16 0.002
47 15 0.002
34 15 0.002
61 13 0.001
32 12 0.001
50 12 0.001
36 11 0.001
44 11 0.001
49 10 0.001
65 9 0.001
56 9 0.001
72 9 0.001
51 9 0.001
121 9 0.001
95 9 0.001
124 9 0.001
42 9 0.001
39 9 0.001
File:Ua vs subgraph.png

Next we isolate user agents from each subgraph who query drastically more (>=10% difference) than other user agents in the same subgraph, and perform at least 100k queries (0.05% of all queries) a month. A list of ~30 such user agents was found. A plot with subgraph distributions of all these user agents was observed to find some large buckets where they tend to query. The plot is shows below, followed by some explicit observations.

File:Imp ua dist censored.png

Percentages below are percent of all monthly queries.

  • mix n match (UA #17):
    • a lot of taxon queries (Q16521), 23%
    • a lot of human queries (Q5), 4%
  • UA #6:
    • 1% in Business (Q4830453)
  • UA #14:
    • 1% in human (Q5)
    • 0.5% in film (Q11424)
  • UA #23:
    • 1.73% in family name (Q101352)
    • 1.73% in human (Q5)
    • both have exact counts, meaning they could be the same queries that
      touch both human and family name subgraphs

For reference:

  • 100% percent is 221,067,674 queries
  • 10% percent is 22,106,767 queries
  • 1% percent is 2,210,676 queries
  • 0.1% percent is 221,067 queries
  • 0.05% percent is 110,533 queries
  • 0.01% percent is 22,106 queries

Subgraph connectivity through queries

Subgraph connectivity was explored to some extent using only Wikidata in Wikidata_Subgraph_Analysis. This was based on what items or properties were common between subgraphs and how many direct connections were present between them. A visualization was created to show the strength of this connectivity between subgraphs here: wikidata_graph. This section aims to analyze the connectivity of subgraphs through the queries, i.e, how often are some subgraphs queried together.

  • Subgaph Queries: The total number of queries that touch on at least one of the top 341 subgraps is 72% of all queries.
  • First we look at how many subgraphs do most queries access. The tables below show the least and most query groups by number of subgraphs accessed.
  • 70% of all queries (97% of subgraph queries) touch on 1 or 2 subgraph. 64% of all queries (90% of subgraph queries) touch on only 1 subgraph.
Queries with most subgraphs accessed
#of Subgraphs #of Queries
341 25
333 1
315 2
313 3
258 1
181 3
152 1
142 1
133 2
130 2
129 1
128 2
127 4
126 4
125 9
Queries with least subgraphs
accessed
#of Subgraphs #of Queries %of Queries
1 142507736 64.463
2 12464811 5.638
3 1767253 0.799
4 586173 0.265
5 364445 0.165
6 221485 0.1
7 188012 0.085
8 112922 0.051
9 102524 0.046
10 68871 0.031
11 50341 0.023
12 38102 0.017
13 34075 0.015
14 24003 0.011
15 17935 0.008

File:NumQuery vs numSubgraph.png

  • It is hard to view which subgraphs occur together from the data above. So the subgraphs that occured together were broken into pairs and pars of subgraphs that occur together the most were listed.
  • There are 57,970 subgraphs pairs that occur togther in queries. Total possible subgrah pair count is (340*341)/2 = 57,970. This shows that every subgraph is connected to every other subgraph through queries! Ofcourse the number of queries vary widely.
  • A list of some of the most queried subgraphs is shown below.
Top pairs of subgraphs that are queried together
Subgraph 1 Subgraph 2 Query
Subgraph Subgraph label Subgraph Subgraph label #of Query %of Query
Q101352 family name Q5 human 4935675 2.233
Q4830453 business Q6881511 enterprise 883757 0.4
Q11424 film Q5 human 771698 0.349
Q4830453 business Q891723 public company 735902 0.333
Q3305213 painting Q4167410 Wikimedia disambiguation page 629633 0.285
Q4164871 position Q5 human 541257 0.245
Q47461344 written work Q732577 publication 493402 0.223
Q11424 film Q14204246 Wikimedia project page 483338 0.219
Q6881511 enterprise Q891723 public company 480426 0.217
Q4167410 Wikimedia disambiguation page Q5 human 466217 0.211
Q14204246 Wikimedia project page Q4167410 Wikimedia disambiguation page 436192 0.197
Q13406463 Wikimedia list article Q5 human 394815 0.179
Q4830453 business Q5 human 354945 0.161
Q13442814 scholarly article Q4167410 Wikimedia disambiguation page 316720 0.143
Q13442814 scholarly article Q5 human 282237 0.128
Q13406463 Wikimedia list article Q18340514 events in a specific year or time period 274841 0.124
Q3331189 version, edition, or translation Q5 human 273761 0.124
Q571 book Q5 human 259234 0.117
Q16521 taxon Q5 human 222118 0.1
Q4167410 Wikimedia disambiguation page Q811979 architectural structure 204572 0.093
Q4167410 Wikimedia disambiguation page Q838948 work of art 200810 0.091
Q5398426 television series Q5 human 197997 0.09
Q47461344 written work Q5 human 194750 0.088
Q43229 organization Q4830453 business 179640 0.081
Q5 human Q6881511 enterprise 172486 0.078
Q43229 organization Q5 human 171567 0.078
Q2225692 fourth-level administrative division in Indonesia Q532 village 171086 0.077
Q215380 musical group Q5 human 168318 0.076
Q15632617 fictional human Q5 human 163992 0.074
Q3305213 painting Q838948 work of art 161979 0.073
  • The distribution of the number of times each subgraph pair in wikidata occurs in queries is shown below. Note that (A,B) pair is the same as (B,A) pair, so there is no duplication in the plots. Since the plot is extremely skewed, three plots with various limits on the number of occurrences are shown. We can see how only a small number of pairs occur a lot together, they can be viewed from the table above. Whereas a huge number of pairs occur a very small number of times.

File:Subgraph pair dist.png

  • Below is a heatmap of the number of queries, where both x and y axis represent subgraph indices (names of subgrahps not shown due to space)
  • The diagonals show queries that use only 1 subgraph and are represented as Q5-Q5, or Q42-Q42 for example. Other are represented as Q5-Q42 or Q42-Q5 for example (Symmetrical plot).
  • The tons of vertical and horizontal lines indicate there are lots of subgraphs that happen to pair with many other subgraphs. More analysis on this below.

File:Subgraph pair heatmap.png

Triples analysis