You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "User:AKhatun/Wikidata Subgraph Query Analysis"

From Wikitech-static
Jump to navigation Jump to search
imported>AKhatun
(Add intro)
 
imported>AKhatun
(Add query time analysis)
Line 7: Line 7:


== TL;DR ==
== TL;DR ==
== Query Stats ==
== What are subgraph related queries ==
== What are subgraph related queries ==
== Query count ==
We define some parameters to identify whether a query touches on a subgraph based on the items and properties a query uses. Some queries may even touch on mutiple subgraphs. See more on what a subgraph means [[User:AKhatun/Wikidata_Subgraph_Analysis|here]]. Note: Subgraphs have overlaps.
== Query time ==
 
The parameters that define which subgraph a query belongs to are:
# If the query uses the subgraph's Qid. Example: Q5 containing queries are part of Q5 subgraph.
# If the query uses items that are <code>instance of</code> a particular subgraph.
# If the query uses items that occur 99% of the times in a particular subgraph.
# If the query uses properties that occur 99% of the times in a particular subgraph.
# If the query uses literals that occur 99% of the times in a particular subgraph. The literals can occur with or without language tags. Both versions are compared to check for match. Note that whole literals are matched in queries and Wikidata. Queries that ask for partial matches, using regex for example, are not included. The assumption is that such queries are more likely to contain other items from a subgraph and are caught anyways. This of course excludes generic queries that wish to search the entirety of Wikidata for a certain label or other strings.
 
The following analysis uses Wikidata dump of <code>20211101</code> and WDQS public SPARQL queries of 10/2021. '''All query related values below are monthly counts'''.
 
== Query count and time ==
 
* All queries here refer to queries with status code 200 and 500, i.e correct queries, successful or time-out.
* WDQS receives ~220M queries a month.
* Total query time for all queries for a month is ~16,000 hours.
 
The table below lists the top 50 most queried subgraphs with subgraph size and query time information. A breakdown of what caused the match is also present, which corresponds to the parameters mentioned in [[#What are subgraph related queries]]. It also ranks the subgraphs by size, query count, and query time consumed.
 
A more complete list containing 341 subgraphs, that form ~90% of Wikidata triples, is available here: [csvlink|all_subgraph_data.csv]
 
{| class="wikitable sortable"
|+ Top 50 most queries subgraphs in Wikidata with subgraph size information
|-
! Subgraph rank by size !! Subgraph rank by query count !! Subgraph rank by query time !! Subgraph !! Subgraph label !! %of triples !! %of entities !! Query count !! %count of all queries !! Query time (hr) !! %time of all queries !! %count of query from Qid !! %count of query from instance items !! %count of query from items !! %count of query from properties !! %count of query from literals
|-
|3||1||1||Q5||human||7.324||9.986||68,659,369||31.058||6314||0.393||1.827||17.705||10.324||20.176||1.11
|-
|5||2||4||Q16521||taxon||2.871||3.427||56,437,140||25.529||495||0.031||22.986||1.251||23.665||0.965||0.496
|-
|6||3||3||Q101352||family name||1.546||0.509||5,564,173||2.517||640||0.04||0.064||2.425||2.34||0.016||0.032
|-
|15||4||2||Q11424||film||0.364||0.281||4,757,084||2.152||1613||0.1||0.563||1.308||1.089||0.008||0.407
|-
|34||5||7||Q4830453||business||0.108||0.207||4,041,395||1.828||343||0.021||0.953||0.788||0.416||0.0||0.101
|-
|7||6||9||Q4167410||Wikimedia disambiguation page||1.374||1.459||3,737,550||1.691||223||0.014||0.195||0.484||0.554||0.0||0.938
|-
|177||7||20||Q34770||language||0.013||0.011||1,713,196||0.775||73||0.005||0.008||0.757||0.009||0.0||0.005
|-
|1||8||13||Q13442814||scholarly article||49.668||39.794||1,649,268||0.746||142||0.009||0.005||0.261||0.278||0.124||0.386
|-
|4||9||17||Q4167836||Wikimedia category||5.85||5.165||1,383,343||0.626||96||0.006||0.019||0.594||0.152||0.0||0.01
|-
|10||10||14||Q11173||chemical compound||0.693||1.302||1,307,852||0.592||133||0.008||0.022||0.548||0.449||0.001||0.014
|-
|20||11||22||Q13406463||Wikimedia list article||0.252||0.352||1,283,160||0.58||73||0.005||0.018||0.409||0.357||0.0||0.048
|-
|63||12||6||Q5398426||television series||0.055||0.062||1,206,285||0.546||366||0.023||0.05||0.332||0.252||0.0||0.128
|-
|243||13||24||Q14204246||Wikimedia project page||0.008||0.033||1,114,113||0.504||62||0.004||0.009||0.227||0.016||0.0||0.275
|-
|92||14||11||Q6881511||enterprise||0.036||0.052||943,613||0.427||164||0.01||0.034||0.338||0.144||0.0||0.042
|-
|26||15||29||Q484170||commune of France||0.18||0.043||866,766||0.392||46||0.003||0.006||0.278||0.004||0.098||0.007
|-
|165||16||12||Q891723||public company||0.015||0.013||837,595||0.379||157||0.01||0.034||0.277||0.061||0.0||0.054
|-
|12||17||19||Q3305213||painting||0.432||0.578||834,752||0.378||79||0.005||0.012||0.332||0.187||0.005||0.012
|-
|91||18||16||Q43229||organization||0.037||0.08||806,840||0.365||123||0.008||0.128||0.213||0.097||0.0||0.012
|-
|89||19||8||Q4164871||position||0.037||0.128||788,077||0.356||332||0.021||0.004||0.343||0.016||0.0||0.003
|-
|28||20||30||Q482994||album||0.161||0.287||776,845||0.351||37||0.002||0.012||0.287||0.209||0.0||0.016
|-
|86||21||23||Q47461344||written work||0.038||0.078||774,947||0.351||67||0.004||0.244||0.085||0.039||0.0||0.003
|-
|62||22||35||Q7889||video game||0.056||0.047||741,401||0.335||30||0.002||0.006||0.195||0.256||0.005||0.007
|-
|16||23||21||Q486972||human settlement||0.302||0.602||721,789||0.327||73||0.005||0.095||0.22||0.107||0.0||0.006
|-
|8||24||18||Q7187||gene||0.927||1.273||628,916||0.284||94||0.006||0.107||0.063||0.007||0.021||0.113
|-
|25||25||46||Q532||village||0.201||0.292||584,789||0.265||21||0.001||0.001||0.246||0.109||0.0||0.013
|-
|70||26||27||Q732577||publication||0.048||0.076||512,416||0.232||53||0.003||0.229||0.003||0.23||0.0||0.0
|-
|42||27||45||Q7725634||literary work||0.077||0.176||468,204||0.212||22||0.001||0.017||0.16||0.104||0.0||0.007
|-
|138||28||57||Q18340514||events in a specific year or time period||0.019||0.048||463,683||0.21||17||0.001||0.0||0.2||0.056||0.0||0.005
|-
|54||29||60||Q215380||musical group||0.063||0.087||461,181||0.209||17||0.001||0.009||0.164||0.073||0.0||0.008
|-
|2||30||28||Q6999||astronomical object||8.75||8.942||448,032||0.203||51||0.003||0.0||0.175||0.085||0.015||0.003
|-
|41||31||56||Q22808320||Wikimedia human name disambiguation page||0.078||0.075||433,986||0.196||17||0.001||0.0||0.174||0.154||0.0||0.001
|-
|53||32||63||Q134556||single||0.065||0.103||431,003||0.195||16||0.001||0.001||0.167||0.138||0.0||0.004
|-
|37||33||32||Q3331189||version, edition, or translation||0.087||0.19||410,352||0.186||34||0.002||0.103||0.053||0.118||0.004||0.028
|-
|31||34||41||Q16970||church building||0.129||0.226||396,936||0.18||25||0.002||0.005||0.172||0.112||0.0||0.001
|-
|71||35||25||Q86850539||Whitaker's Latin frequency type C||0.048||0.011||355,247||0.161||56||0.003||0.0||0.0||0.0||0.0||0.16
|-
|11||36||65||Q8054||protein||0.67||1.05||349,573||0.158||16||0.001||0.079||0.034||0.002||0.02||0.066
|-
|49||37||167||Q2225692||fourth-level administrative division in Indonesia||0.07||0.088||344,964||0.156||5||0.0||0.0||0.147||0.098||0.0||0.009
|-
|223||38||87||Q571||book||0.009||0.022||340,900||0.154||12||0.001||0.114||0.016||0.01||0.0||0.023
|-
|112||39||76||Q476028||association football club||0.026||0.038||320,422||0.145||14||0.001||0.006||0.12||0.029||0.0||0.003
|-
|21||40||10||Q2668072||collection||0.248||0.534||312,822||0.142||166||0.01||0.056||0.084||0.058||0.0||0.001
|-
|113||41||54||Q15632617||fictional human||0.026||0.056||306,319||0.139||18||0.001||0.006||0.1||0.05||0.0||0.003
|-
|121||42||42||Q3957||town||0.023||0.015||294,685||0.133||24||0.001||0.047||0.079||0.014||0.0||0.002
|-
|133||43||58||Q506240||television film||0.02||0.019||290,899||0.132||17||0.001||0.009||0.098||0.07||0.0||0.02
|-
|136||44||5||Q15416||television program||0.019||0.05||286,609||0.13||386||0.024||0.024||0.084||0.072||0.0||0.01
|-
|72||45||79||Q105543609||musical work/composition||0.048||0.099||285,889||0.129||13||0.001||0.004||0.095||0.061||0.004||0.009
|-
|64||46||38||Q811979||architectural structure||0.055||0.119||282,739||0.128||28||0.002||0.09||0.035||0.024||0.0||0.001
|-
|23||47||51||Q4022||river||0.219||0.425||280,190||0.127||20||0.001||0.002||0.12||0.045||0.0||0.002
|-
|32||48||31||Q41176||building||0.125||0.287||271,666||0.123||36||0.002||0.034||0.084||0.065||0.002||0.001
|-
|45||49||50||Q55488||railway station||0.075||0.104||258,862||0.117||20||0.001||0.001||0.109||0.072||0.0||0.001
|-
|192||50||143||Q3464665||television series season||0.011||0.02||254,318||0.115||6||0.0||0.031||0.077||0.009||0.0||0.0
|}
== More on query time ==
 
The query time can be broken down to classes for better visualization. Below is a figure with the query class distribution (number of queries per query time class per subgraph) for the top 50 subgraphs. Some of the takeaways are:
* Most subgraphs have most queries in the range of 10-100ms
* Second most commons class is 100ms to 1s
* <code>collection</code> and <code>photograph</code> have most queries (~150k) timed at 1-10s. Around 10 more subgraphs have a little (~10-20k) query in this time range.
 
[[File:top_50_query_time_class.png|1100px]]
 
== User agent ==
== User agent ==
== Triples analysis ==
== Triples analysis ==

Revision as of 17:19, 1 December 2021

Analysis on Subgraphs in Wikidata showed how large each of the subgraphs are in Wikidata and how connected they are. This page shows the results from analysis on the queries that relate to these subgraph. The questions that needed to be answered were:

  • How many(percent) queries access each subgraph?
  • How many queries access multiple subgraphs at once? i.e, how much overlap can we expect in subgraphs?
  • How long do these queries take?
  • How many user-agents access each subgraph? How many of them access lots of subgraphs, or are they confined to a small set of subgraphs? Do some of them dominate queries in mutiple subgraphs?
  • Are there chunks of similar queries in these subgraphs? i.e, how diverse the queries in each subgraph are.

TL;DR

What are subgraph related queries

We define some parameters to identify whether a query touches on a subgraph based on the items and properties a query uses. Some queries may even touch on mutiple subgraphs. See more on what a subgraph means here. Note: Subgraphs have overlaps.

The parameters that define which subgraph a query belongs to are:

  1. If the query uses the subgraph's Qid. Example: Q5 containing queries are part of Q5 subgraph.
  2. If the query uses items that are instance of a particular subgraph.
  3. If the query uses items that occur 99% of the times in a particular subgraph.
  4. If the query uses properties that occur 99% of the times in a particular subgraph.
  5. If the query uses literals that occur 99% of the times in a particular subgraph. The literals can occur with or without language tags. Both versions are compared to check for match. Note that whole literals are matched in queries and Wikidata. Queries that ask for partial matches, using regex for example, are not included. The assumption is that such queries are more likely to contain other items from a subgraph and are caught anyways. This of course excludes generic queries that wish to search the entirety of Wikidata for a certain label or other strings.

The following analysis uses Wikidata dump of 20211101 and WDQS public SPARQL queries of 10/2021. All query related values below are monthly counts.

Query count and time

  • All queries here refer to queries with status code 200 and 500, i.e correct queries, successful or time-out.
  • WDQS receives ~220M queries a month.
  • Total query time for all queries for a month is ~16,000 hours.

The table below lists the top 50 most queried subgraphs with subgraph size and query time information. A breakdown of what caused the match is also present, which corresponds to the parameters mentioned in #What are subgraph related queries. It also ranks the subgraphs by size, query count, and query time consumed.

A more complete list containing 341 subgraphs, that form ~90% of Wikidata triples, is available here: [csvlink|all_subgraph_data.csv]

Top 50 most queries subgraphs in Wikidata with subgraph size information
Subgraph rank by size Subgraph rank by query count Subgraph rank by query time Subgraph Subgraph label %of triples %of entities Query count %count of all queries Query time (hr) %time of all queries %count of query from Qid %count of query from instance items %count of query from items %count of query from properties %count of query from literals
3 1 1 Q5 human 7.324 9.986 68,659,369 31.058 6314 0.393 1.827 17.705 10.324 20.176 1.11
5 2 4 Q16521 taxon 2.871 3.427 56,437,140 25.529 495 0.031 22.986 1.251 23.665 0.965 0.496
6 3 3 Q101352 family name 1.546 0.509 5,564,173 2.517 640 0.04 0.064 2.425 2.34 0.016 0.032
15 4 2 Q11424 film 0.364 0.281 4,757,084 2.152 1613 0.1 0.563 1.308 1.089 0.008 0.407
34 5 7 Q4830453 business 0.108 0.207 4,041,395 1.828 343 0.021 0.953 0.788 0.416 0.0 0.101
7 6 9 Q4167410 Wikimedia disambiguation page 1.374 1.459 3,737,550 1.691 223 0.014 0.195 0.484 0.554 0.0 0.938
177 7 20 Q34770 language 0.013 0.011 1,713,196 0.775 73 0.005 0.008 0.757 0.009 0.0 0.005
1 8 13 Q13442814 scholarly article 49.668 39.794 1,649,268 0.746 142 0.009 0.005 0.261 0.278 0.124 0.386
4 9 17 Q4167836 Wikimedia category 5.85 5.165 1,383,343 0.626 96 0.006 0.019 0.594 0.152 0.0 0.01
10 10 14 Q11173 chemical compound 0.693 1.302 1,307,852 0.592 133 0.008 0.022 0.548 0.449 0.001 0.014
20 11 22 Q13406463 Wikimedia list article 0.252 0.352 1,283,160 0.58 73 0.005 0.018 0.409 0.357 0.0 0.048
63 12 6 Q5398426 television series 0.055 0.062 1,206,285 0.546 366 0.023 0.05 0.332 0.252 0.0 0.128
243 13 24 Q14204246 Wikimedia project page 0.008 0.033 1,114,113 0.504 62 0.004 0.009 0.227 0.016 0.0 0.275
92 14 11 Q6881511 enterprise 0.036 0.052 943,613 0.427 164 0.01 0.034 0.338 0.144 0.0 0.042
26 15 29 Q484170 commune of France 0.18 0.043 866,766 0.392 46 0.003 0.006 0.278 0.004 0.098 0.007
165 16 12 Q891723 public company 0.015 0.013 837,595 0.379 157 0.01 0.034 0.277 0.061 0.0 0.054
12 17 19 Q3305213 painting 0.432 0.578 834,752 0.378 79 0.005 0.012 0.332 0.187 0.005 0.012
91 18 16 Q43229 organization 0.037 0.08 806,840 0.365 123 0.008 0.128 0.213 0.097 0.0 0.012
89 19 8 Q4164871 position 0.037 0.128 788,077 0.356 332 0.021 0.004 0.343 0.016 0.0 0.003
28 20 30 Q482994 album 0.161 0.287 776,845 0.351 37 0.002 0.012 0.287 0.209 0.0 0.016
86 21 23 Q47461344 written work 0.038 0.078 774,947 0.351 67 0.004 0.244 0.085 0.039 0.0 0.003
62 22 35 Q7889 video game 0.056 0.047 741,401 0.335 30 0.002 0.006 0.195 0.256 0.005 0.007
16 23 21 Q486972 human settlement 0.302 0.602 721,789 0.327 73 0.005 0.095 0.22 0.107 0.0 0.006
8 24 18 Q7187 gene 0.927 1.273 628,916 0.284 94 0.006 0.107 0.063 0.007 0.021 0.113
25 25 46 Q532 village 0.201 0.292 584,789 0.265 21 0.001 0.001 0.246 0.109 0.0 0.013
70 26 27 Q732577 publication 0.048 0.076 512,416 0.232 53 0.003 0.229 0.003 0.23 0.0 0.0
42 27 45 Q7725634 literary work 0.077 0.176 468,204 0.212 22 0.001 0.017 0.16 0.104 0.0 0.007
138 28 57 Q18340514 events in a specific year or time period 0.019 0.048 463,683 0.21 17 0.001 0.0 0.2 0.056 0.0 0.005
54 29 60 Q215380 musical group 0.063 0.087 461,181 0.209 17 0.001 0.009 0.164 0.073 0.0 0.008
2 30 28 Q6999 astronomical object 8.75 8.942 448,032 0.203 51 0.003 0.0 0.175 0.085 0.015 0.003
41 31 56 Q22808320 Wikimedia human name disambiguation page 0.078 0.075 433,986 0.196 17 0.001 0.0 0.174 0.154 0.0 0.001
53 32 63 Q134556 single 0.065 0.103 431,003 0.195 16 0.001 0.001 0.167 0.138 0.0 0.004
37 33 32 Q3331189 version, edition, or translation 0.087 0.19 410,352 0.186 34 0.002 0.103 0.053 0.118 0.004 0.028
31 34 41 Q16970 church building 0.129 0.226 396,936 0.18 25 0.002 0.005 0.172 0.112 0.0 0.001
71 35 25 Q86850539 Whitaker's Latin frequency type C 0.048 0.011 355,247 0.161 56 0.003 0.0 0.0 0.0 0.0 0.16
11 36 65 Q8054 protein 0.67 1.05 349,573 0.158 16 0.001 0.079 0.034 0.002 0.02 0.066
49 37 167 Q2225692 fourth-level administrative division in Indonesia 0.07 0.088 344,964 0.156 5 0.0 0.0 0.147 0.098 0.0 0.009
223 38 87 Q571 book 0.009 0.022 340,900 0.154 12 0.001 0.114 0.016 0.01 0.0 0.023
112 39 76 Q476028 association football club 0.026 0.038 320,422 0.145 14 0.001 0.006 0.12 0.029 0.0 0.003
21 40 10 Q2668072 collection 0.248 0.534 312,822 0.142 166 0.01 0.056 0.084 0.058 0.0 0.001
113 41 54 Q15632617 fictional human 0.026 0.056 306,319 0.139 18 0.001 0.006 0.1 0.05 0.0 0.003
121 42 42 Q3957 town 0.023 0.015 294,685 0.133 24 0.001 0.047 0.079 0.014 0.0 0.002
133 43 58 Q506240 television film 0.02 0.019 290,899 0.132 17 0.001 0.009 0.098 0.07 0.0 0.02
136 44 5 Q15416 television program 0.019 0.05 286,609 0.13 386 0.024 0.024 0.084 0.072 0.0 0.01
72 45 79 Q105543609 musical work/composition 0.048 0.099 285,889 0.129 13 0.001 0.004 0.095 0.061 0.004 0.009
64 46 38 Q811979 architectural structure 0.055 0.119 282,739 0.128 28 0.002 0.09 0.035 0.024 0.0 0.001
23 47 51 Q4022 river 0.219 0.425 280,190 0.127 20 0.001 0.002 0.12 0.045 0.0 0.002
32 48 31 Q41176 building 0.125 0.287 271,666 0.123 36 0.002 0.034 0.084 0.065 0.002 0.001
45 49 50 Q55488 railway station 0.075 0.104 258,862 0.117 20 0.001 0.001 0.109 0.072 0.0 0.001
192 50 143 Q3464665 television series season 0.011 0.02 254,318 0.115 6 0.0 0.031 0.077 0.009 0.0 0.0

More on query time

The query time can be broken down to classes for better visualization. Below is a figure with the query class distribution (number of queries per query time class per subgraph) for the top 50 subgraphs. Some of the takeaways are:

  • Most subgraphs have most queries in the range of 10-100ms
  • Second most commons class is 100ms to 1s
  • collection and photograph have most queries (~150k) timed at 1-10s. Around 10 more subgraphs have a little (~10-20k) query in this time range.

File:Top 50 query time class.png

User agent

Triples analysis