You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

User:AKhatun/Wikidata Subgraph Query Analysis

From Wikitech-static
< User:AKhatun
Revision as of 19:51, 6 December 2021 by imported>AKhatun (→‎Subgraph connectivity through queries: Fix numbers)
Jump to navigation Jump to search

Analysis on Subgraphs in Wikidata showed how large each of the subgraphs are in Wikidata and how connected they are. This page shows the results from analysis on the queries that relate to these subgraph. The questions that needed to be answered were:

  • How many(percent) queries access each subgraph?
  • How many queries access multiple subgraphs at once? i.e, how much overlap can we expect in subgraphs?
  • How long do these queries take?
  • How many user-agents access each subgraph? How many of them access lots of subgraphs, or are they confined to a small set of subgraphs? Do some of them dominate queries in mutiple subgraphs?
  • Are there chunks of similar queries in these subgraphs? i.e, how diverse the queries in each subgraph are.

TL;DR

What are subgraph related queries

We define some parameters to identify whether a query touches on a subgraph based on the items and properties a query uses. Some queries may even touch on mutiple subgraphs. See more on what a subgraph means here. Note: Subgraphs have overlaps.

The parameters that define which subgraph a query belongs to are:

  1. If the query uses the subgraph's Qid. Example: Q5 containing queries are part of Q5 subgraph.
  2. If the query uses items that are instance of a particular subgraph.
  3. If the query uses items that occur 99% of the times in a particular subgraph.
  4. If the query uses properties that occur 99% of the times in a particular subgraph.
  5. If the query uses literals that occur 99% of the times in a particular subgraph. The literals can occur with or without language tags. Both versions are compared to check for match. Note that whole literals are matched in queries and Wikidata. Queries that ask for partial matches, using regex for example, are not included. The assumption is that such queries are more likely to contain other items from a subgraph and are caught anyways. This of course excludes generic queries that wish to search the entirety of Wikidata for a certain label or other strings.

The following analysis uses Wikidata dump of 20211101 and WDQS public SPARQL queries of 10/2021. All query related values below are monthly counts.

Query count and time

  • All queries here refer to queries with status code 200 and 500, i.e correct queries, successful or time-out.
  • WDQS receives ~220M queries a month.
  • Total query time for all queries for a month is ~16,000 hours.

The table below lists the top 50 most queried subgraphs with subgraph size and query time information. A breakdown of what caused the match is also present, which corresponds to the parameters mentioned in #What are subgraph related queries. It also ranks the subgraphs by size, query count, and query time consumed.

A more complete list containing 341 subgraphs, that form ~90% of Wikidata triples, is available here: [csvlink|all_subgraph_data.csv]

Top 50 most queries subgraphs in Wikidata with subgraph size information
Subgraph rank by size Subgraph rank by query count Subgraph rank by query time Subgraph Subgraph label %of triples %of entities Query count %count of all queries Query time (hr) %time of all queries %count of query from Qid %count of query from instance items %count of query from items %count of query from properties %count of query from literals
3 1 1 Q5 human 7.324 9.986 68,659,369 31.058 6314 0.393 1.827 17.705 10.324 20.176 1.11
5 2 4 Q16521 taxon 2.871 3.427 56,437,140 25.529 495 0.031 22.986 1.251 23.665 0.965 0.496
6 3 3 Q101352 family name 1.546 0.509 5,564,173 2.517 640 0.04 0.064 2.425 2.34 0.016 0.032
15 4 2 Q11424 film 0.364 0.281 4,757,084 2.152 1613 0.1 0.563 1.308 1.089 0.008 0.407
34 5 7 Q4830453 business 0.108 0.207 4,041,395 1.828 343 0.021 0.953 0.788 0.416 0.0 0.101
7 6 9 Q4167410 Wikimedia disambiguation page 1.374 1.459 3,737,550 1.691 223 0.014 0.195 0.484 0.554 0.0 0.938
177 7 20 Q34770 language 0.013 0.011 1,713,196 0.775 73 0.005 0.008 0.757 0.009 0.0 0.005
1 8 13 Q13442814 scholarly article 49.668 39.794 1,649,268 0.746 142 0.009 0.005 0.261 0.278 0.124 0.386
4 9 17 Q4167836 Wikimedia category 5.85 5.165 1,383,343 0.626 96 0.006 0.019 0.594 0.152 0.0 0.01
10 10 14 Q11173 chemical compound 0.693 1.302 1,307,852 0.592 133 0.008 0.022 0.548 0.449 0.001 0.014
20 11 22 Q13406463 Wikimedia list article 0.252 0.352 1,283,160 0.58 73 0.005 0.018 0.409 0.357 0.0 0.048
63 12 6 Q5398426 television series 0.055 0.062 1,206,285 0.546 366 0.023 0.05 0.332 0.252 0.0 0.128
243 13 24 Q14204246 Wikimedia project page 0.008 0.033 1,114,113 0.504 62 0.004 0.009 0.227 0.016 0.0 0.275
92 14 11 Q6881511 enterprise 0.036 0.052 943,613 0.427 164 0.01 0.034 0.338 0.144 0.0 0.042
26 15 29 Q484170 commune of France 0.18 0.043 866,766 0.392 46 0.003 0.006 0.278 0.004 0.098 0.007
165 16 12 Q891723 public company 0.015 0.013 837,595 0.379 157 0.01 0.034 0.277 0.061 0.0 0.054
12 17 19 Q3305213 painting 0.432 0.578 834,752 0.378 79 0.005 0.012 0.332 0.187 0.005 0.012
91 18 16 Q43229 organization 0.037 0.08 806,840 0.365 123 0.008 0.128 0.213 0.097 0.0 0.012
89 19 8 Q4164871 position 0.037 0.128 788,077 0.356 332 0.021 0.004 0.343 0.016 0.0 0.003
28 20 30 Q482994 album 0.161 0.287 776,845 0.351 37 0.002 0.012 0.287 0.209 0.0 0.016
86 21 23 Q47461344 written work 0.038 0.078 774,947 0.351 67 0.004 0.244 0.085 0.039 0.0 0.003
62 22 35 Q7889 video game 0.056 0.047 741,401 0.335 30 0.002 0.006 0.195 0.256 0.005 0.007
16 23 21 Q486972 human settlement 0.302 0.602 721,789 0.327 73 0.005 0.095 0.22 0.107 0.0 0.006
8 24 18 Q7187 gene 0.927 1.273 628,916 0.284 94 0.006 0.107 0.063 0.007 0.021 0.113
25 25 46 Q532 village 0.201 0.292 584,789 0.265 21 0.001 0.001 0.246 0.109 0.0 0.013
70 26 27 Q732577 publication 0.048 0.076 512,416 0.232 53 0.003 0.229 0.003 0.23 0.0 0.0
42 27 45 Q7725634 literary work 0.077 0.176 468,204 0.212 22 0.001 0.017 0.16 0.104 0.0 0.007
138 28 57 Q18340514 events in a specific year or time period 0.019 0.048 463,683 0.21 17 0.001 0.0 0.2 0.056 0.0 0.005
54 29 60 Q215380 musical group 0.063 0.087 461,181 0.209 17 0.001 0.009 0.164 0.073 0.0 0.008
2 30 28 Q6999 astronomical object 8.75 8.942 448,032 0.203 51 0.003 0.0 0.175 0.085 0.015 0.003
41 31 56 Q22808320 Wikimedia human name disambiguation page 0.078 0.075 433,986 0.196 17 0.001 0.0 0.174 0.154 0.0 0.001
53 32 63 Q134556 single 0.065 0.103 431,003 0.195 16 0.001 0.001 0.167 0.138 0.0 0.004
37 33 32 Q3331189 version, edition, or translation 0.087 0.19 410,352 0.186 34 0.002 0.103 0.053 0.118 0.004 0.028
31 34 41 Q16970 church building 0.129 0.226 396,936 0.18 25 0.002 0.005 0.172 0.112 0.0 0.001
71 35 25 Q86850539 Whitaker's Latin frequency type C 0.048 0.011 355,247 0.161 56 0.003 0.0 0.0 0.0 0.0 0.16
11 36 65 Q8054 protein 0.67 1.05 349,573 0.158 16 0.001 0.079 0.034 0.002 0.02 0.066
49 37 167 Q2225692 fourth-level administrative division in Indonesia 0.07 0.088 344,964 0.156 5 0.0 0.0 0.147 0.098 0.0 0.009
223 38 87 Q571 book 0.009 0.022 340,900 0.154 12 0.001 0.114 0.016 0.01 0.0 0.023
112 39 76 Q476028 association football club 0.026 0.038 320,422 0.145 14 0.001 0.006 0.12 0.029 0.0 0.003
21 40 10 Q2668072 collection 0.248 0.534 312,822 0.142 166 0.01 0.056 0.084 0.058 0.0 0.001
113 41 54 Q15632617 fictional human 0.026 0.056 306,319 0.139 18 0.001 0.006 0.1 0.05 0.0 0.003
121 42 42 Q3957 town 0.023 0.015 294,685 0.133 24 0.001 0.047 0.079 0.014 0.0 0.002
133 43 58 Q506240 television film 0.02 0.019 290,899 0.132 17 0.001 0.009 0.098 0.07 0.0 0.02
136 44 5 Q15416 television program 0.019 0.05 286,609 0.13 386 0.024 0.024 0.084 0.072 0.0 0.01
72 45 79 Q105543609 musical work/composition 0.048 0.099 285,889 0.129 13 0.001 0.004 0.095 0.061 0.004 0.009
64 46 38 Q811979 architectural structure 0.055 0.119 282,739 0.128 28 0.002 0.09 0.035 0.024 0.0 0.001
23 47 51 Q4022 river 0.219 0.425 280,190 0.127 20 0.001 0.002 0.12 0.045 0.0 0.002
32 48 31 Q41176 building 0.125 0.287 271,666 0.123 36 0.002 0.034 0.084 0.065 0.002 0.001
45 49 50 Q55488 railway station 0.075 0.104 258,862 0.117 20 0.001 0.001 0.109 0.072 0.0 0.001
192 50 143 Q3464665 television series season 0.011 0.02 254,318 0.115 6 0.0 0.031 0.077 0.009 0.0 0.0

More on query time

The query time can be broken down to classes for better visualization. Below is a figure with the query class distribution (number of queries per query time class per subgraph) for the top 50 subgraphs. Some of the takeaways are:

  • Most subgraphs have most queries in the range of 10-100ms
  • Second most commons class is 100ms to 1s
  • collection and photograph have most queries (~150k) timed at 1-10s. Around 10 more subgraphs have a little (~10-20k) query in this time range.

File:Top 50 query time class.png

User agent

Analysis on user-agent is an approximation because these don't completely represent distint users. For example lots people use the same bot or script without changing the user-agent, or the same person or bot uses multiple user-agent strings. Yet based on the available data we can get an idea nevertheless.

  • Total number of unique user agents across all subgraphs: 981,180
  • First, a list of subgraphs with most and least distinct user-agents is listed. It seems the least number of user-agents a subgraph has is atleast 10. So the large subgraphs are used by mutiple users.
  • The largest numbers of user-agents are present in a variety of type of subgraphs, gene-protien-biological_process-molecular_function appear to be similar among them. It is possible the same queries represent several of these subgraphs. More on subgraph connectivity in #Subgraph Connectivity.
Subgraphs with most user-agents
Subgraph Subgraph label %Query #User agents %User agent
Q11424 film 2.152 251420 0.256
Q8054 protein 0.158 234659 0.239
Q7187 gene 0.284 187029 0.191
Q2996394 biological process 0.072 124415 0.127
Q14860489 molecular function 0.044 89445 0.091
Q5 human 31.058 55377 0.056
Q898273 protein domain 0.019 38484 0.039
Q16521 taxon 25.529 25193 0.026
Q86850539 Whitaker's Latin frequency type C 0.161 20158 0.021
Q4167410 Wikimedia disambiguation page 1.691 13818 0.014
Q14204246 Wikimedia project page 0.504 13443 0.014
Q476028 association football club 0.145 12086 0.012
Q235557 file format 0.045 7701 0.008
Q1520033 count noun 0.05 7662 0.008
Q417841 protein family 0.007 4906 0.005
Q484170 commune of France 0.392 4764 0.005
Q4830453 business 1.828 4383 0.004
Q4164871 position 0.356 4319 0.004
Q7278 political party 0.109 4073 0.004
Q3918 university 0.104 3565 0.004
Subgraphs with least user-agents
Subgraph Subgraph label %Query #User agents %User agent
Q106006703 local regulations of the People's Republic of China 0.0 11 0.0
Q67015940 Government Boys' Primary School 0.0 13 0.0
Q7604693 Statutory Rules of Northern Ireland 0.0 13 0.0
Q106474968 ethnic group by settlement in Macedonia 0.003 15 0.0
Q6453643 decree law 0.0 15 0.0
Q97695005 committee group motion 0.0 15 0.0
Q100532807 Irish Statutory Instrument 0.0 16 0.0
Q10429085 report 0.0 19 0.0
Q99045339 written question 0.0 20 0.0
Q1505023 Interpellation 0.0 20 0.0
Q96739634 individual motion 0.0 21 0.0
Q67035425 ASTM standard 0.0 21 0.0
Q61278455 health sub-centre 0.001 23 0.0
Q26267864 Wikimedia KML file 0.005 23 0.0
Q3508250 Syndicat intercommunal 0.02 24 0.0
Q107102664 cell line from embryonic stem cells 0.0 24 0.0
Q7604686 UK Statutory Instrument 0.0 27 0.0
Q6451276 Congressional Research Service report 0.001 28 0.0
Q61443650 sub post office 0.0 33 0.0
Q26894053 basketball team season 0.009 34 0.0
  • There are 50 subgraphs with more than 1000 user agents, and 300 subgraphs with less than 1000 user agents. Most subgraphs are therefore not queried overly-widely. The distribution of user-agent counts less than 1000 is shown in the figure below. This clearly shows the small number of user counts in most subgraphs.

File:Ua lessthan1k dist.png

  • Next, the user agent vs query count distribution was analyzed for some of the top subgraphs. While user agent count gives us an idea of how many users may be using a subgraph, it is not clear whether all of them query the subgraph equally, or very few user agents perform most of the queries.
  • ~30 out of 341 subgraphs have a user agent that queries >=50% of all queries of that particular subgraphs.
  • 6 subgraphs have a user agent querying around 80-90% of the time.
  • So the trend of dominating single source queries is not wide spread among subgraphs, but is present in few subraphs nonetheless.

The figure below shows the top 2 user-agent query percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph. This figure shows the top 2 user-agent query percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph.

The figure below shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail of the distribution This figure shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail 10% of the distribution.

User agent vs Subgraph

So far we have explored the user-agent count and distribution per subgraph. It is also important to note the user agent's query across subgraphs. In other words,

  • Do users have a very specific use case and so the queries spans only a few subgraphs? or is it spread across a lot of subgraphs?
  • Are there some user agents that query the most in mutiple subgraphs? This could be due to the nature of the use case or simply because several subgraphs overlap a lot.

We start by looking at how many user agents acces how many subgraphs. From the table below, we see that most user agents (89% of them) query one subgraphs only. Some user agents query a lot of subgraphs as well. A clearer picture is seem from the plot below.

Relationship between subgraphs and user agents
#of Subgraphs (X) #of User agents querying X subgraphs %of User agents querying X subgraphs
1 875724 89.252
2 91962 9.373
5 3562 0.363
3 2388 0.243
6 1539 0.157
7 799 0.081
9 628 0.064
8 463 0.047
4 460 0.047
12 332 0.034
16 308 0.031
15 282 0.029
10 281 0.029
17 242 0.025
18 235 0.024
14 202 0.021
11 184 0.019
19 177 0.018
13 167 0.017
20 119 0.012
21 75 0.008
22 47 0.005
25 46 0.005
23 39 0.004
24 39 0.004
27 32 0.003
26 28 0.003
28 26 0.003
29 25 0.003
30 20 0.002
31 17 0.002
35 16 0.002
37 16 0.002
47 15 0.002
34 15 0.002
61 13 0.001
32 12 0.001
50 12 0.001
36 11 0.001
44 11 0.001
49 10 0.001
65 9 0.001
56 9 0.001
72 9 0.001
51 9 0.001
121 9 0.001
95 9 0.001
124 9 0.001
42 9 0.001
39 9 0.001
File:Ua vs subgraph.png

Next we isolate user agents from each subgraph who query drastically more (>=10% difference) than other user agents in the same subgraph, and perform at least 100k queries (0.05% of all queries) a month. A list of ~30 such user agents was found. A plot with subgraph distributions of all these user agents was observed to find some large buckets where they tend to query. The plot is shows below, followed by some explicit observations.

File:Imp ua dist censored.png

Percentages below are percent of all monthly queries.

  • mix n match (UA #17):
    • a lot of taxon queries (Q16521), 23%
    • a lot of human queries (Q5), 4%
  • UA #6:
    • 1% in Business (Q4830453)
  • UA #14:
    • 1% in human (Q5)
    • 0.5% in film (Q11424)
  • UA #23:
    • 1.73% in family name (Q101352)
    • 1.73% in human (Q5)
    • both have exact counts, meaning they could be the same queries that
      touch both human and family name subgraphs

For reference:

  • 100% percent is 221,067,674 queries
  • 10% percent is 22,106,767 queries
  • 1% percent is 2,210,676 queries
  • 0.1% percent is 221,067 queries
  • 0.05% percent is 110,533 queries
  • 0.01% percent is 22,106 queries

Subgraph connectivity through queries

Subgraph connectivity was explored to some extent using only Wikidata in Wikidata_Subgraph_Analysis. This was based on what items or properties were common between subgraphs and how many direct connections were present between them. A visualization was created to show the strength of this connectivity between subgraphs here: wikidata_graph. This section aims to analyze the connectivity of subgraphs through the queries, i.e, how often are some subgraphs queried together.

  • Subgaph Queries: The total number of queries that touch on at least one of the top 341 subgraps is 72% of all queries.
  • First we look at how many subgraphs do most queries access. The tables below show the least and most query groups by number of subgraphs accessed.
  • 70% of all queries (97% of subgraph queries) touch on 1 or 2 subgraph. 64% of all queries (90% of subgraph queries) touch on only 1 subgraph.
Queries with most subgraphs accessed
#of Subgraphs #of Queries
341 25
333 1
315 2
313 3
258 1
181 3
152 1
142 1
133 2
130 2
129 1
128 2
127 4
126 4
125 9
Queries with least subgraphs
accessed
#of Subgraphs #of Queries %of Queries
1 142507736 64.463
2 12464811 5.638
3 1767253 0.799
4 586173 0.265
5 364445 0.165
6 221485 0.1
7 188012 0.085
8 112922 0.051
9 102524 0.046
10 68871 0.031
11 50341 0.023
12 38102 0.017
13 34075 0.015
14 24003 0.011
15 17935 0.008

File:NumQuery vs numSubgraph.png

  • It is hard to view which subgraphs occur together from the data above. So the subgraphs that occured together were broken into pairs and pars of subgraphs that occur together the most were listed.
  • There are 57,970 subgraphs pairs that occur togther in queries. Total possible subgrah pair count is (340*341)/2 = 57,970. This shows that every subgraph is connected to every other subgraph through queries! Ofcourse the number of queries vary widely.
  • A list of some of the most queried subgraphs is shown below.
Top pairs of subgraphs that are queried together
Subgraph 1 Subgraph 2 Query
Subgraph Subgraph label Subgraph Subgraph label #of Query %of Query
Q101352 family name Q5 human 4935675 2.233
Q4830453 business Q6881511 enterprise 883757 0.4
Q11424 film Q5 human 771698 0.349
Q4830453 business Q891723 public company 735902 0.333
Q3305213 painting Q4167410 Wikimedia disambiguation page 629633 0.285
Q4164871 position Q5 human 541257 0.245
Q47461344 written work Q732577 publication 493402 0.223
Q11424 film Q14204246 Wikimedia project page 483338 0.219
Q6881511 enterprise Q891723 public company 480426 0.217
Q4167410 Wikimedia disambiguation page Q5 human 466217 0.211
Q14204246 Wikimedia project page Q4167410 Wikimedia disambiguation page 436192 0.197
Q13406463 Wikimedia list article Q5 human 394815 0.179
Q4830453 business Q5 human 354945 0.161
Q13442814 scholarly article Q4167410 Wikimedia disambiguation page 316720 0.143
Q13442814 scholarly article Q5 human 282237 0.128
Q13406463 Wikimedia list article Q18340514 events in a specific year or time period 274841 0.124
Q3331189 version, edition, or translation Q5 human 273761 0.124
Q571 book Q5 human 259234 0.117
Q16521 taxon Q5 human 222118 0.1
Q4167410 Wikimedia disambiguation page Q811979 architectural structure 204572 0.093
Q4167410 Wikimedia disambiguation page Q838948 work of art 200810 0.091
Q5398426 television series Q5 human 197997 0.09
Q47461344 written work Q5 human 194750 0.088
Q43229 organization Q4830453 business 179640 0.081
Q5 human Q6881511 enterprise 172486 0.078
Q43229 organization Q5 human 171567 0.078
Q2225692 fourth-level administrative division in Indonesia Q532 village 171086 0.077
Q215380 musical group Q5 human 168318 0.076
Q15632617 fictional human Q5 human 163992 0.074
Q3305213 painting Q838948 work of art 161979 0.073
  • The distribution of the number of times each subgraph pair in wikidata occurs in queries is shown below. Note that (A,B) pair is the same as (B,A) pair, so there is no duplication in the plots. Since the plot is extremely skewed, three plots with various limits on the number of occurrences are shown. We can see how only a small number of pairs occur a lot together, they can be viewed from the table above. Whereas a huge number of pairs occur a very small number of times.

File:Subgraph pair dist.png

  • Below is a heatmap of the number of queries, where both x and y axis represent subgraph indices (names of subgrahps not shown due to space)
  • The diagonals show queries that use only 1 subgraph and are represented as Q5-Q5, or Q42-Q42 for example. Other are represented as Q5-Q42 or Q42-Q5 for example (Symmetrical plot).
  • The tons of vertical and horizontal lines indicate there are lots of subgraphs that happen to pair with many other subgraphs. More analysis on this below.

File:Subgraph pair heatmap.png

Triples analysis