You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User:AKhatun/Wikidata Subgraph Query Analysis: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>AKhatun
(Add query time analysis)
imported>AKhatun
(User agent information)
Line 144: Line 144:


== User agent ==
== User agent ==
Analysis on user-agent is an approximation because these don't completely represent distint users. For example lots people use the same bot or script without changing the user-agent, or the same person or bot uses multiple user-agent strings. Yet based on the available data we can get an idea nevertheless.
* Total number of unique user agents across all subgraphs: 981,180
* First, a list of subgraphs with most and least distinct user-agents is listed. It seems the least number of user-agents a subgraph has is atleast 10. So the large subgraphs are used by mutiple users.
* The largest numbers of user-agents are present in a variety of type of subgraphs, gene-protien-biological_process-molecular_function appear to be similar among them. It is possible the same queries represent several of these subgraphs. More on subgraph connectivity in [[#Subgraph Connectivity]].
{|
|
{| class="wikitable"
|+ Subgraphs with most user-agents
|-
! Subgraph !! Subgraph label !! %Query !! #User agents !! %User agent
|-
|Q11424||film||2.152||251420||0.256
|-
|Q8054||protein||0.158||234659||0.239
|-
|Q7187||gene||0.284||187029||0.191
|-
|Q2996394||biological process||0.072||124415||0.127
|-
|Q14860489||molecular function||0.044||89445||0.091
|-
|Q5||human||31.058||55377||0.056
|-
|Q898273||protein domain||0.019||38484||0.039
|-
|Q16521||taxon||25.529||25193||0.026
|-
|Q86850539||Whitaker's Latin frequency type C||0.161||20158||0.021
|-
|Q4167410||Wikimedia disambiguation page||1.691||13818||0.014
|-
|Q14204246||Wikimedia project page||0.504||13443||0.014
|-
|Q476028||association football club||0.145||12086||0.012
|-
|Q235557||file format||0.045||7701||0.008
|-
|Q1520033||count noun||0.05||7662||0.008
|-
|Q417841||protein family||0.007||4906||0.005
|-
|Q484170||commune of France||0.392||4764||0.005
|-
|Q4830453||business||1.828||4383||0.004
|-
|Q4164871||position||0.356||4319||0.004
|-
|Q7278||political party||0.109||4073||0.004
|-
|Q3918||university||0.104||3565||0.004
|}
|
{| class="wikitable"
|+ Subgraphs with least user-agents
|-
! Subgraph !! Subgraph label !! %Query !! #User agents !! %User agent
|-
|Q106006703||local regulations of the People's Republic of China||0.0||11||0.0
|-
|Q67015940||Government Boys' Primary School||0.0||13||0.0
|-
|Q7604693||Statutory Rules of Northern Ireland||0.0||13||0.0
|-
|Q106474968||ethnic group by settlement in Macedonia||0.003||15||0.0
|-
|Q6453643||decree law||0.0||15||0.0
|-
|Q97695005||committee group motion||0.0||15||0.0
|-
|Q100532807||Irish Statutory Instrument||0.0||16||0.0
|-
|Q10429085||report||0.0||19||0.0
|-
|Q99045339||written question||0.0||20||0.0
|-
|Q1505023||Interpellation||0.0||20||0.0
|-
|Q96739634||individual motion||0.0||21||0.0
|-
|Q67035425||ASTM standard||0.0||21||0.0
|-
|Q61278455||health sub-centre||0.001||23||0.0
|-
|Q26267864||Wikimedia KML file||0.005||23||0.0
|-
|Q3508250||Syndicat intercommunal||0.02||24||0.0
|-
|Q107102664||cell line from embryonic stem cells||0.0||24||0.0
|-
|Q7604686||UK Statutory Instrument||0.0||27||0.0
|-
|Q6451276||Congressional Research Service report||0.001||28||0.0
|-
|Q61443650||sub post office||0.0||33||0.0
|-
|Q26894053||basketball team season||0.009||34||0.0
|}
|}
* There are 50 subgraphs with more than 1000 user agents, and 300 subgraphs with less than 1000 user agents. Most subgraphs are therefore not queried overly-widely. The distribution of user-agent counts less than 1000 is shown in the figure below. This clearly shows the small number of user counts in most subgraphs.
[[File:ua_lessthan1k_dist.png]]
* Next, the user agent vs query count distribution was analyzed for some of the top subgraphs. While user agent count gives us an idea of how many users may be using a subgraph, it is not clear whether all of them query the subgraph equally, or very few user agents perform most of the queries.
* ~30 out of 341 subgraphs have a user agent that queries >=50% of all queries of that particular subgraphs.
* 6 subgraphs have a user agent querying around 80-90% of the time.
* So the trend of dominating single source queries is not wide spread among subgraphs, but is present in few subraphs nonetheless.
The figure below shows the top 2 user-agent query percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph.
[[File:top2UA_per-subgraph.png|1100px|This figure shows the top 2 user-agent query percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph.]]
The figure below shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail 10% of the distribution
[[File:subgraph_ua_hist.png|1100px|This figure shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail 10% of the distribution.]]
== User agent vs Subgraph ==
== Subgraph connectivity through queries ==
== Triples analysis ==
== Triples analysis ==

Revision as of 15:52, 2 December 2021

Analysis on Subgraphs in Wikidata showed how large each of the subgraphs are in Wikidata and how connected they are. This page shows the results from analysis on the queries that relate to these subgraph. The questions that needed to be answered were:

  • How many(percent) queries access each subgraph?
  • How many queries access multiple subgraphs at once? i.e, how much overlap can we expect in subgraphs?
  • How long do these queries take?
  • How many user-agents access each subgraph? How many of them access lots of subgraphs, or are they confined to a small set of subgraphs? Do some of them dominate queries in mutiple subgraphs?
  • Are there chunks of similar queries in these subgraphs? i.e, how diverse the queries in each subgraph are.

TL;DR

What are subgraph related queries

We define some parameters to identify whether a query touches on a subgraph based on the items and properties a query uses. Some queries may even touch on mutiple subgraphs. See more on what a subgraph means here. Note: Subgraphs have overlaps.

The parameters that define which subgraph a query belongs to are:

  1. If the query uses the subgraph's Qid. Example: Q5 containing queries are part of Q5 subgraph.
  2. If the query uses items that are instance of a particular subgraph.
  3. If the query uses items that occur 99% of the times in a particular subgraph.
  4. If the query uses properties that occur 99% of the times in a particular subgraph.
  5. If the query uses literals that occur 99% of the times in a particular subgraph. The literals can occur with or without language tags. Both versions are compared to check for match. Note that whole literals are matched in queries and Wikidata. Queries that ask for partial matches, using regex for example, are not included. The assumption is that such queries are more likely to contain other items from a subgraph and are caught anyways. This of course excludes generic queries that wish to search the entirety of Wikidata for a certain label or other strings.

The following analysis uses Wikidata dump of 20211101 and WDQS public SPARQL queries of 10/2021. All query related values below are monthly counts.

Query count and time

  • All queries here refer to queries with status code 200 and 500, i.e correct queries, successful or time-out.
  • WDQS receives ~220M queries a month.
  • Total query time for all queries for a month is ~16,000 hours.

The table below lists the top 50 most queried subgraphs with subgraph size and query time information. A breakdown of what caused the match is also present, which corresponds to the parameters mentioned in #What are subgraph related queries. It also ranks the subgraphs by size, query count, and query time consumed.

A more complete list containing 341 subgraphs, that form ~90% of Wikidata triples, is available here: [csvlink|all_subgraph_data.csv]

Top 50 most queries subgraphs in Wikidata with subgraph size information
Subgraph rank by size Subgraph rank by query count Subgraph rank by query time Subgraph Subgraph label %of triples %of entities Query count %count of all queries Query time (hr) %time of all queries %count of query from Qid %count of query from instance items %count of query from items %count of query from properties %count of query from literals
3 1 1 Q5 human 7.324 9.986 68,659,369 31.058 6314 0.393 1.827 17.705 10.324 20.176 1.11
5 2 4 Q16521 taxon 2.871 3.427 56,437,140 25.529 495 0.031 22.986 1.251 23.665 0.965 0.496
6 3 3 Q101352 family name 1.546 0.509 5,564,173 2.517 640 0.04 0.064 2.425 2.34 0.016 0.032
15 4 2 Q11424 film 0.364 0.281 4,757,084 2.152 1613 0.1 0.563 1.308 1.089 0.008 0.407
34 5 7 Q4830453 business 0.108 0.207 4,041,395 1.828 343 0.021 0.953 0.788 0.416 0.0 0.101
7 6 9 Q4167410 Wikimedia disambiguation page 1.374 1.459 3,737,550 1.691 223 0.014 0.195 0.484 0.554 0.0 0.938
177 7 20 Q34770 language 0.013 0.011 1,713,196 0.775 73 0.005 0.008 0.757 0.009 0.0 0.005
1 8 13 Q13442814 scholarly article 49.668 39.794 1,649,268 0.746 142 0.009 0.005 0.261 0.278 0.124 0.386
4 9 17 Q4167836 Wikimedia category 5.85 5.165 1,383,343 0.626 96 0.006 0.019 0.594 0.152 0.0 0.01
10 10 14 Q11173 chemical compound 0.693 1.302 1,307,852 0.592 133 0.008 0.022 0.548 0.449 0.001 0.014
20 11 22 Q13406463 Wikimedia list article 0.252 0.352 1,283,160 0.58 73 0.005 0.018 0.409 0.357 0.0 0.048
63 12 6 Q5398426 television series 0.055 0.062 1,206,285 0.546 366 0.023 0.05 0.332 0.252 0.0 0.128
243 13 24 Q14204246 Wikimedia project page 0.008 0.033 1,114,113 0.504 62 0.004 0.009 0.227 0.016 0.0 0.275
92 14 11 Q6881511 enterprise 0.036 0.052 943,613 0.427 164 0.01 0.034 0.338 0.144 0.0 0.042
26 15 29 Q484170 commune of France 0.18 0.043 866,766 0.392 46 0.003 0.006 0.278 0.004 0.098 0.007
165 16 12 Q891723 public company 0.015 0.013 837,595 0.379 157 0.01 0.034 0.277 0.061 0.0 0.054
12 17 19 Q3305213 painting 0.432 0.578 834,752 0.378 79 0.005 0.012 0.332 0.187 0.005 0.012
91 18 16 Q43229 organization 0.037 0.08 806,840 0.365 123 0.008 0.128 0.213 0.097 0.0 0.012
89 19 8 Q4164871 position 0.037 0.128 788,077 0.356 332 0.021 0.004 0.343 0.016 0.0 0.003
28 20 30 Q482994 album 0.161 0.287 776,845 0.351 37 0.002 0.012 0.287 0.209 0.0 0.016
86 21 23 Q47461344 written work 0.038 0.078 774,947 0.351 67 0.004 0.244 0.085 0.039 0.0 0.003
62 22 35 Q7889 video game 0.056 0.047 741,401 0.335 30 0.002 0.006 0.195 0.256 0.005 0.007
16 23 21 Q486972 human settlement 0.302 0.602 721,789 0.327 73 0.005 0.095 0.22 0.107 0.0 0.006
8 24 18 Q7187 gene 0.927 1.273 628,916 0.284 94 0.006 0.107 0.063 0.007 0.021 0.113
25 25 46 Q532 village 0.201 0.292 584,789 0.265 21 0.001 0.001 0.246 0.109 0.0 0.013
70 26 27 Q732577 publication 0.048 0.076 512,416 0.232 53 0.003 0.229 0.003 0.23 0.0 0.0
42 27 45 Q7725634 literary work 0.077 0.176 468,204 0.212 22 0.001 0.017 0.16 0.104 0.0 0.007
138 28 57 Q18340514 events in a specific year or time period 0.019 0.048 463,683 0.21 17 0.001 0.0 0.2 0.056 0.0 0.005
54 29 60 Q215380 musical group 0.063 0.087 461,181 0.209 17 0.001 0.009 0.164 0.073 0.0 0.008
2 30 28 Q6999 astronomical object 8.75 8.942 448,032 0.203 51 0.003 0.0 0.175 0.085 0.015 0.003
41 31 56 Q22808320 Wikimedia human name disambiguation page 0.078 0.075 433,986 0.196 17 0.001 0.0 0.174 0.154 0.0 0.001
53 32 63 Q134556 single 0.065 0.103 431,003 0.195 16 0.001 0.001 0.167 0.138 0.0 0.004
37 33 32 Q3331189 version, edition, or translation 0.087 0.19 410,352 0.186 34 0.002 0.103 0.053 0.118 0.004 0.028
31 34 41 Q16970 church building 0.129 0.226 396,936 0.18 25 0.002 0.005 0.172 0.112 0.0 0.001
71 35 25 Q86850539 Whitaker's Latin frequency type C 0.048 0.011 355,247 0.161 56 0.003 0.0 0.0 0.0 0.0 0.16
11 36 65 Q8054 protein 0.67 1.05 349,573 0.158 16 0.001 0.079 0.034 0.002 0.02 0.066
49 37 167 Q2225692 fourth-level administrative division in Indonesia 0.07 0.088 344,964 0.156 5 0.0 0.0 0.147 0.098 0.0 0.009
223 38 87 Q571 book 0.009 0.022 340,900 0.154 12 0.001 0.114 0.016 0.01 0.0 0.023
112 39 76 Q476028 association football club 0.026 0.038 320,422 0.145 14 0.001 0.006 0.12 0.029 0.0 0.003
21 40 10 Q2668072 collection 0.248 0.534 312,822 0.142 166 0.01 0.056 0.084 0.058 0.0 0.001
113 41 54 Q15632617 fictional human 0.026 0.056 306,319 0.139 18 0.001 0.006 0.1 0.05 0.0 0.003
121 42 42 Q3957 town 0.023 0.015 294,685 0.133 24 0.001 0.047 0.079 0.014 0.0 0.002
133 43 58 Q506240 television film 0.02 0.019 290,899 0.132 17 0.001 0.009 0.098 0.07 0.0 0.02
136 44 5 Q15416 television program 0.019 0.05 286,609 0.13 386 0.024 0.024 0.084 0.072 0.0 0.01
72 45 79 Q105543609 musical work/composition 0.048 0.099 285,889 0.129 13 0.001 0.004 0.095 0.061 0.004 0.009
64 46 38 Q811979 architectural structure 0.055 0.119 282,739 0.128 28 0.002 0.09 0.035 0.024 0.0 0.001
23 47 51 Q4022 river 0.219 0.425 280,190 0.127 20 0.001 0.002 0.12 0.045 0.0 0.002
32 48 31 Q41176 building 0.125 0.287 271,666 0.123 36 0.002 0.034 0.084 0.065 0.002 0.001
45 49 50 Q55488 railway station 0.075 0.104 258,862 0.117 20 0.001 0.001 0.109 0.072 0.0 0.001
192 50 143 Q3464665 television series season 0.011 0.02 254,318 0.115 6 0.0 0.031 0.077 0.009 0.0 0.0

More on query time

The query time can be broken down to classes for better visualization. Below is a figure with the query class distribution (number of queries per query time class per subgraph) for the top 50 subgraphs. Some of the takeaways are:

  • Most subgraphs have most queries in the range of 10-100ms
  • Second most commons class is 100ms to 1s
  • collection and photograph have most queries (~150k) timed at 1-10s. Around 10 more subgraphs have a little (~10-20k) query in this time range.

File:Top 50 query time class.png

User agent

Analysis on user-agent is an approximation because these don't completely represent distint users. For example lots people use the same bot or script without changing the user-agent, or the same person or bot uses multiple user-agent strings. Yet based on the available data we can get an idea nevertheless.

  • Total number of unique user agents across all subgraphs: 981,180
  • First, a list of subgraphs with most and least distinct user-agents is listed. It seems the least number of user-agents a subgraph has is atleast 10. So the large subgraphs are used by mutiple users.
  • The largest numbers of user-agents are present in a variety of type of subgraphs, gene-protien-biological_process-molecular_function appear to be similar among them. It is possible the same queries represent several of these subgraphs. More on subgraph connectivity in #Subgraph Connectivity.
Subgraphs with most user-agents
Subgraph Subgraph label %Query #User agents %User agent
Q11424 film 2.152 251420 0.256
Q8054 protein 0.158 234659 0.239
Q7187 gene 0.284 187029 0.191
Q2996394 biological process 0.072 124415 0.127
Q14860489 molecular function 0.044 89445 0.091
Q5 human 31.058 55377 0.056
Q898273 protein domain 0.019 38484 0.039
Q16521 taxon 25.529 25193 0.026
Q86850539 Whitaker's Latin frequency type C 0.161 20158 0.021
Q4167410 Wikimedia disambiguation page 1.691 13818 0.014
Q14204246 Wikimedia project page 0.504 13443 0.014
Q476028 association football club 0.145 12086 0.012
Q235557 file format 0.045 7701 0.008
Q1520033 count noun 0.05 7662 0.008
Q417841 protein family 0.007 4906 0.005
Q484170 commune of France 0.392 4764 0.005
Q4830453 business 1.828 4383 0.004
Q4164871 position 0.356 4319 0.004
Q7278 political party 0.109 4073 0.004
Q3918 university 0.104 3565 0.004
Subgraphs with least user-agents
Subgraph Subgraph label %Query #User agents %User agent
Q106006703 local regulations of the People's Republic of China 0.0 11 0.0
Q67015940 Government Boys' Primary School 0.0 13 0.0
Q7604693 Statutory Rules of Northern Ireland 0.0 13 0.0
Q106474968 ethnic group by settlement in Macedonia 0.003 15 0.0
Q6453643 decree law 0.0 15 0.0
Q97695005 committee group motion 0.0 15 0.0
Q100532807 Irish Statutory Instrument 0.0 16 0.0
Q10429085 report 0.0 19 0.0
Q99045339 written question 0.0 20 0.0
Q1505023 Interpellation 0.0 20 0.0
Q96739634 individual motion 0.0 21 0.0
Q67035425 ASTM standard 0.0 21 0.0
Q61278455 health sub-centre 0.001 23 0.0
Q26267864 Wikimedia KML file 0.005 23 0.0
Q3508250 Syndicat intercommunal 0.02 24 0.0
Q107102664 cell line from embryonic stem cells 0.0 24 0.0
Q7604686 UK Statutory Instrument 0.0 27 0.0
Q6451276 Congressional Research Service report 0.001 28 0.0
Q61443650 sub post office 0.0 33 0.0
Q26894053 basketball team season 0.009 34 0.0
  • There are 50 subgraphs with more than 1000 user agents, and 300 subgraphs with less than 1000 user agents. Most subgraphs are therefore not queried overly-widely. The distribution of user-agent counts less than 1000 is shown in the figure below. This clearly shows the small number of user counts in most subgraphs.

File:Ua lessthan1k dist.png

  • Next, the user agent vs query count distribution was analyzed for some of the top subgraphs. While user agent count gives us an idea of how many users may be using a subgraph, it is not clear whether all of them query the subgraph equally, or very few user agents perform most of the queries.
  • ~30 out of 341 subgraphs have a user agent that queries >=50% of all queries of that particular subgraphs.
  • 6 subgraphs have a user agent querying around 80-90% of the time.
  • So the trend of dominating single source queries is not wide spread among subgraphs, but is present in few subraphs nonetheless.

The figure below shows the top 2 user-agent query percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph. This figure shows the top 2 user-agent query percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph.

The figure below shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail 10% of the distribution This figure shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail 10% of the distribution.

User agent vs Subgraph

Subgraph connectivity through queries

Triples analysis