You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

User:AKhatun/Wikidata Subgraph Query Analysis: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>AKhatun
(Add columns of percents of query only in a subgraph vs queries also involving other subgraphs)
imported>AKhatun
(→‎TL;DR: Shorter TL;DR)
Line 5: Line 5:
* How many user-agents access each subgraph? How many of them access lots of subgraphs, or are they confined to a small set of subgraphs? Do some of them dominate queries in multiple subgraphs?
* How many user-agents access each subgraph? How many of them access lots of subgraphs, or are they confined to a small set of subgraphs? Do some of them dominate queries in multiple subgraphs?
* Are there chunks of similar queries in these subgraphs? i.e, how diverse the queries in each subgraph are.
* Are there chunks of similar queries in these subgraphs? i.e, how diverse the queries in each subgraph are.
== Shorter TL;DR ==
* The monthly queries that touch the top 341 subgraphs were analyzed. The percentage of queries for each of the subgraphs '''change slightly''' by month. Only two subgraphs have significantly more queries compared to other subgraphs. Human subgraph has 30% of the queries, and Taxon subgraph consists of ~20% of the queries.
* Query times are divided into 5 classes. Most query times are low (10ms to 100ms, the second class of query time), some are 100ms to 1s (the third class of query time).
* Most subgraphs don’t have a lot of user-agents accessing them. Some of them have a few user-agents doing most of the queries. The trend of a dominating single source of queries is not widespread among subgraphs.
* Most user agents (89% of them) query one subgraph only. Few user agents query a lot of subgraphs as well.
* 70% of all queries touch on 1 or 2 subgraphs. 64% of all queries touch on only 1 subgraph.
* No significant correlation was found between query time and the number of subgraphs a query tries to access.
* For human subgraph:
** The queries that were estimated to be related to the human subgraph accounted for 31.94% of all queries in Wikidata. 25.78% of queries used only the human subgraph and the rest 6.16% of queries used a mix of human and various other subgraphs.
** The total query time of the human subgraph is 34% of the total query time. The average time per query is 0.3 seconds (300 ms). Most queries in this subgraph are small and simple.
** More than 50% of the queries in human subgraph are done by ~10 user agents. So only a small number of user-agents do most of the queries.
** The top 10 types of queries account for 60% of the queries of human subgraph. So a small type of queries cover the bulk of the queries.
** Some user agents do a moderate amount of small simple queries of various types, but most user agents do only 1 type of query.
*For taxon subgraph:
** The queries that were estimated to be related to the taxon subgraph accounted for 14.26% of all queries in Wikidata. 13.57% queries used only the taxon subgraph and the rest 0.69% queries used a mix of taxon and various other subgraphs
** Q-ID (Q16521) match is almost the only reason for queries to be in the taxon subgraph.
** Most properties that match the taxon subgraph are some sort of external IDs.
** The total query time of taxon subgraph is ~3% of total query time. The average time per query is 0.064 seconds (64 ms). Almost all queries in this subgraph are small and simple.
** The top user agent performs 85% of taxon subgraph queries. Basically, a single user agent does most of the queries.
** The variety of query types in taxon subgraph is quite less (1.1K) compared to 11K query types in human subgraph. Only the top 3 types of queries account for ~90% of the queries of taxon subgraph.
** Most user agents make only 1 type of query.
** In sum, most queries in taxon subgraph are small and simple, take between 10 to 100ms time to run, and are mostly by 1 user agents.


== TL;DR ==
== TL;DR ==

Revision as of 20:57, 26 January 2022

Analysis on Subgraphs in Wikidata showed how large each of the subgraphs are in Wikidata and how connected they are. This page shows the results from analysis on the queries that relate to these subgraph. The questions that needed to be answered were:

  • How many(percent) queries access each subgraph?
  • How many queries access multiple subgraphs at once? i.e, how much overlap can we expect in subgraphs?
  • How long do these queries take?
  • How many user-agents access each subgraph? How many of them access lots of subgraphs, or are they confined to a small set of subgraphs? Do some of them dominate queries in multiple subgraphs?
  • Are there chunks of similar queries in these subgraphs? i.e, how diverse the queries in each subgraph are.

Shorter TL;DR

  • The monthly queries that touch the top 341 subgraphs were analyzed. The percentage of queries for each of the subgraphs change slightly by month. Only two subgraphs have significantly more queries compared to other subgraphs. Human subgraph has 30% of the queries, and Taxon subgraph consists of ~20% of the queries.
  • Query times are divided into 5 classes. Most query times are low (10ms to 100ms, the second class of query time), some are 100ms to 1s (the third class of query time).
  • Most subgraphs don’t have a lot of user-agents accessing them. Some of them have a few user-agents doing most of the queries. The trend of a dominating single source of queries is not widespread among subgraphs.
  • Most user agents (89% of them) query one subgraph only. Few user agents query a lot of subgraphs as well.
  • 70% of all queries touch on 1 or 2 subgraphs. 64% of all queries touch on only 1 subgraph.
  • No significant correlation was found between query time and the number of subgraphs a query tries to access.
  • For human subgraph:
    • The queries that were estimated to be related to the human subgraph accounted for 31.94% of all queries in Wikidata. 25.78% of queries used only the human subgraph and the rest 6.16% of queries used a mix of human and various other subgraphs.
    • The total query time of the human subgraph is 34% of the total query time. The average time per query is 0.3 seconds (300 ms). Most queries in this subgraph are small and simple.
    • More than 50% of the queries in human subgraph are done by ~10 user agents. So only a small number of user-agents do most of the queries.
    • The top 10 types of queries account for 60% of the queries of human subgraph. So a small type of queries cover the bulk of the queries.
    • Some user agents do a moderate amount of small simple queries of various types, but most user agents do only 1 type of query.
  • For taxon subgraph:
    • The queries that were estimated to be related to the taxon subgraph accounted for 14.26% of all queries in Wikidata. 13.57% queries used only the taxon subgraph and the rest 0.69% queries used a mix of taxon and various other subgraphs
    • Q-ID (Q16521) match is almost the only reason for queries to be in the taxon subgraph.
    • Most properties that match the taxon subgraph are some sort of external IDs.
    • The total query time of taxon subgraph is ~3% of total query time. The average time per query is 0.064 seconds (64 ms). Almost all queries in this subgraph are small and simple.
    • The top user agent performs 85% of taxon subgraph queries. Basically, a single user agent does most of the queries.
    • The variety of query types in taxon subgraph is quite less (1.1K) compared to 11K query types in human subgraph. Only the top 3 types of queries account for ~90% of the queries of taxon subgraph.
    • Most user agents make only 1 type of query.
    • In sum, most queries in taxon subgraph are small and simple, take between 10 to 100ms time to run, and are mostly by 1 user agents.

TL;DR

  • The monthly queries that touch/access each of the top 341 subgraphs was determined. The percentage of queries for each of the subgraphs change slightly by month. The list for the top 50 subgraphs for Nov, 2021 can be found here: #Query count and time, and comparison with Oct and Nov, 2021 data is shown here: #Comparison of subgraph queries across time.
  • Only two subgraphs have significantly more queries compared to other subgraphs. Human subgraph has 30% of the queries, and Taxon subgraph consists of ~20% of the queries.
  • Most subgraphs have most query times in the range of 10ms to 100ms. The second most common class is 100ms to 1s. A small number of subgraphs have more time-consuming queries (1s - 10s)
  • Most subgraphs don’t have a lot of user-agents accessing them. Some of them have a few user-agents doing most of the queries. ~30 out of 341 subgraphs have a user agent that queries >=50% of all queries of that particular subgraphs. 6 subgraphs have a user agent doing 80-90% of the queries. So the trend of a dominating single source of queries is not widespread among subgraphs but is present in a few subgraphs nonetheless.
  • Most user agents (89% of them) query one subgraph only. Some user agents query a lot of subgraphs as well. Explicit observations about some user-agents are shown in #User agent vs Subgraph section.
  • Looking at the connection among subgraphs through queries we see: 70% of all queries (97% of queries in 341 subgraphs) touch on 1 or 2 subgraphs. 64% of all queries (90% of queries in 341 subgraphs) touch on only 1 subgraph.
  • No significant correlation was found between query time and the number of subgraphs a query tries to access. But queries that access more subgraphs (although less in number) do appear slightly more in larger query time classes.
  • In-depth analysis was done on Human and Taxon subgraph queries since they account for the most queries per subgraph.
  • For human subgraph:
    • The queries that were estimated to be related to the human subgraph accounted for 31.94% of all queries in Wikidata. 25.78% of queries used only the human subgraph and the rest 6.16% of queries used a mix of human and various other subgraphs.
    • A lot of the queries are associated with human subgraph due to the properties they use, the instance of items, and URIs in subject or object.
    • There are some high-usage items, but mostly a long list of low-usage items. For properties, only ~10 properties account for most of the queries in the human subgraph. The matched URIs form a smooth logarithmic pattern, with most URIs being Wikipedia article links in various languages.
    • The total query time of the human subgraph is 34% of the total query time. The average time per query is 0.3 seconds (300 ms). Most queries in this subgraph are small and simple.
    • More than 50% of the queries in human subgraph are done by ~10 user agents. The top 2 user agents perform 20% of the human subgraph queries (7% of all queries). The first user agent does small queries that don’t require much time (10-100ms), whereas the second user agent performs queries that take comparatively more time (100ms - 1s). Note that user-agent strings are not directly representative of distinct users. The second top user-agent here, for example, has multiple variations of user-agent strings which were considered different at this time for simplicity.
    • In terms of query type (type is determined by the operations used in a query): the top 10 types of queries account for 60% of the queries of human subgraph. The rest form a long tail of less-used query types.
    • Most user agents do only 1 type of query. 8 user agents perform more than 500 types of queries. But looking into these types, it seems most of these types don’t have a lot of queries, and are mostly small simple queries. In short: Some user agents do a moderate amount of small simple queries of various types, and most user agents do only 1 type of query.
  • For taxon subgraph:
    • The queries that were estimated to be related to the taxon subgraph accounted for 14.26% of all queries in Wikidata. 13.57% queries used only the taxon subgraph and the rest 0.69% queries used a mix of taxon and various other subgraphs
    • Q-ID match is almost the only reason for queries to be in the taxon subgraph (12% out of 14%). Basically, these are the queries that have Q16521 in them.
    • Only a few (~3) items match with significant number of queries, forming a logarithmic distribution of item use across queries. The distribution of properties used in taxon subgraph queries is also extremely skewed by only ~10 properties, with most of these properties being some sort of external IDs. Most URIs matched are simply the taxon Q-ID (Q16521). Of the unique URIs, ~30% are various Wikipedia links.
    • The total query time of taxon subgraph is ~3% of total query time. The average time per query is 0.064 seconds (64 ms). Almost all queries in this subgraph are small and simple.
    • The top user agent performs 85% of taxon subgraph queries, and the next 4 user agents combined perform 9% of the queries. The rest of the user-agents perform <1% of the queries in taxon subgraph. In terms of time, the top user agent, which made 85% of the queries, accounts for 33% of time consumed. But two more user agents perform some comparatively heavy queries. With less than 1% of queries each, they account for 26% and 11% of query time respectively.
    • The variety of query types in taxon subgraph is quite less (1.1K) compared to 11K query types in human subgraph. Only the top 3 types of queries account for ~90% of the queries of taxon subgraph. The rest form a long tail of small query counts.
    • Most user agents make only 1 type of query. Only 3 user agents make queries of >100 types, and 5 user agents make queries of 50-100 types. The top user agents (in terms of query time and count) make mostly <5 types of queries. Except for the two user agents that accounted for comparatively more query time, they perform 35-25 types of queries. Still quite a small number.
    • In sum, most queries in taxon subgraph are small and simple, take between 10 to 100ms time to run, and are mostly by 1/2 user agents.

What are subgraph related queries

We define some parameters to identify whether a query touches on a subgraph based on the items and properties a query uses. Some queries may even touch multiple subgraphs. See more on what a subgraph means here. Note: Subgraphs have overlaps.

The parameters that define which subgraph a query belongs to are:

  1. If the query uses the subgraph's Q-ID. Example: Q5 containing queries are part of Q5 subgraph.
  2. If the query uses items that are instance of a particular subgraph.
  3. If the query uses items that occur 99% of the times in a particular subgraph.
  4. If the query uses properties that occur 99% of the times in a particular subgraph.
  5. If the query uses literals that occur 99% of the times in a particular subgraph. The literals can occur with or without language tags. Both versions are compared to check for match. Note that whole literals are matched in queries and Wikidata. Queries that ask for partial matches, using regex for example, are not included. The assumption is that such queries are more likely to contain other items from the subgraph and are caught anyways.

The following analysis uses Wikidata dump of 20211101 and WDQS public SPARQL queries of 10/2021 unless otherwise stated. All query related numbers below are monthly values.

Query count and time

  • All queries here refer to queries with status code 200 and 500, i.e correct queries, successful or time-out.
  • WDQS receives ~220M queries a month.
  • Total query time for all queries for a month is ~16,000 hours.

The table below lists the top 50 most queried subgraphs with subgraph size and query time information of 11/2021. A breakdown of what caused the match is also present, which corresponds to the parameters mentioned in #What are subgraph related queries. It also ranks the subgraphs by size, query count, and query time consumed. A more complete list containing 341 subgraphs, that form ~90% of Wikidata triples, is available here: subgraph data for November 2021, and subgraph data for October 2021. The difference between values from October and November is shown in the next table for comparison purposes. In some places, the query count percentages differ slightly.

Top 50 most queries subgraphs in Wikidata with subgraph size information
Subgraph rank by size Subgraph rank by query count Subgraph rank by query time Subgraph Subgraph label %of triples %of entities Days to recover (4.77M rate) Query count %count of all queries Query time (hr) %time of all queries Avg time/query (s) %query using only this subgraph %query using this and other subgraphs %count of query from Qid %count of query from instance items %count of query from items %count of query from properties %count of query from literals
3 1 1 Q5 human 7.254 10.045 204 60,868,572 31.941 5248 34.195 0.31 80.718 19.282 2.541 18.199 12.198 19.457 1.435
5 2 11 Q16521 taxon 2.885 3.5 81 27,172,995 14.259 480 3.131 0.064 95.179 4.821 12.19 0.746 12.862 0.87 0.433
34 3 7 Q4830453 business 0.107 0.208 3 9,228,037 4.842 554 3.607 0.216 68.097 31.903 1.646 2.95 2.24 0.001 0.177
6 4 5 Q101352 family name 1.646 0.511 46 5,990,617 3.144 659 4.292 0.396 20.867 79.133 0.041 3.057 2.791 0.018 0.038
15 5 2 Q11424 film 0.359 0.284 10 5,067,305 2.659 1541 10.042 1.095 74.15 25.85 0.451 1.469 1.348 0.003 0.543
1 6 13 Q13442814 scholarly article 48.935 39.815 1378 4,944,995 2.595 263 1.713 0.191 76.272 23.728 0.017 1.942 1.938 0.405 0.396
7 7 3 Q4167410 Wikimedia disambiguation page 1.354 1.464 38 3,292,873 1.728 765 4.982 0.836 22.922 77.078 0.164 0.192 0.472 0.0 1.163
2 8 25 Q6999 astronomical object 8.684 8.943 245 2,444,109 1.283 79 0.516 0.117 92.622 7.378 0.003 1.218 1.222 0.023 0.004
92 9 14 Q6881511 enterprise 0.036 0.052 1 1,937,486 1.017 234 1.528 0.436 2.584 97.416 0.083 0.812 0.538 0.0 0.071
26 10 29 Q484170 commune of France 0.179 0.048 5 1,934,902 1.015 70 0.455 0.13 78.322 21.678 0.024 0.869 0.085 0.115 0.01
19 11 22 Q13406463 Wikimedia list article 0.249 0.355 7 1,766,742 0.927 117 0.765 0.239 17.64 82.36 0.034 0.372 0.628 0.0 0.137
63 12 12 Q5398426 television series 0.055 0.063 2 1,379,486 0.724 411 2.68 1.073 66.963 33.037 0.048 0.376 0.369 0.0 0.167
37 13 47 Q7725634 literary work 0.087 0.203 2 1,377,546 0.723 42 0.273 0.11 6.75 93.25 0.39 0.181 0.243 0.0 0.009
16 14 4 Q486972 human settlement 0.298 0.612 8 1,328,064 0.697 699 4.557 1.896 42.354 57.646 0.328 0.39 0.236 0.0 0.005
163 15 15 Q891723 public company 0.015 0.013 0 1,175,813 0.617 219 1.426 0.67 10.916 89.084 0.042 0.415 0.185 0.001 0.092
90 16 6 Q43229 organization 0.037 0.082 1 1,067,340 0.56 600 3.908 2.023 40.99 59.01 0.259 0.227 0.146 0.0 0.021
13 17 24 Q3305213 painting 0.426 0.579 12 926,701 0.486 86 0.558 0.333 24.622 75.378 0.017 0.426 0.284 0.002 0.008
87 18 36 Q47461344 written work 0.037 0.078 1 881,216 0.462 53 0.345 0.216 4.058 95.942 0.289 0.079 0.114 0.0 0.003
25 19 32 Q532 village 0.199 0.294 6 872,310 0.458 61 0.399 0.253 39.133 60.867 0.003 0.417 0.198 0.0 0.015
4 20 28 Q4167836 Wikimedia category 5.806 5.175 164 808,536 0.424 74 0.484 0.331 81.351 18.649 0.037 0.363 0.292 0.0 0.024
61 21 51 Q7889 video game 0.055 0.048 2 753,351 0.395 37 0.244 0.179 62.267 37.733 0.006 0.181 0.314 0.002 0.01
20 22 41 Q8502 mountain 0.248 0.559 7 749,283 0.393 47 0.306 0.225 67.13 32.87 0.002 0.369 0.351 0.0 0.001
28 23 33 Q482994 album 0.16 0.288 5 704,746 0.37 59 0.388 0.304 27.474 72.526 0.012 0.15 0.189 0.0 0.098
89 24 17 Q4164871 position 0.037 0.128 1 645,434 0.339 175 1.141 0.977 12.504 87.496 0.003 0.305 0.025 0.0 0.011
8 25 16 Q7187 gene 0.911 1.273 26 604,364 0.317 208 1.354 1.238 31.577 68.423 0.084 0.1 0.022 0.015 0.127
11 26 26 Q11173 chemical compound 0.684 1.302 19 588,469 0.309 76 0.496 0.466 68.901 31.099 0.135 0.11 0.092 0.002 0.014
55 27 54 Q215380 musical group 0.062 0.087 2 585,266 0.307 37 0.241 0.227 39.016 60.984 0.01 0.205 0.16 0.0 0.011
31 28 39 Q16970 church building 0.128 0.227 4 577,677 0.303 48 0.315 0.301 43.769 56.231 0.003 0.288 0.214 0.0 0.002
71 29 55 Q732577 publication 0.047 0.076 1 569,536 0.299 37 0.238 0.231 4.203 95.797 0.283 0.015 0.296 0.0 0.0
22 30 43 Q79007 street 0.23 0.626 6 535,623 0.281 44 0.289 0.298 47.589 52.411 0.028 0.246 0.218 0.001 0.001
23 31 34 Q4022 river 0.216 0.425 6 520,347 0.273 56 0.365 0.388 53.592 46.408 0.002 0.254 0.192 0.0 0.002
242 32 8 Q14204246 Wikimedia project page 0.008 0.033 0 498,708 0.262 548 3.572 3.957 8.77 91.23 0.026 0.19 0.038 0.0 0.064
36 33 63 Q3947 house 0.096 0.216 3 465,249 0.244 33 0.212 0.252 58.051 41.949 0.0 0.238 0.223 0.0 0.002
32 34 31 Q41176 building 0.124 0.29 3 463,636 0.243 65 0.423 0.504 37.511 62.489 0.042 0.189 0.168 0.001 0.002
307 35 62 Q783794 company 0.005 0.012 0 459,638 0.241 33 0.213 0.256 44.132 55.868 0.081 0.146 0.1 0.0 0.006
29 36 48 Q23397 lake 0.136 0.279 4 456,054 0.239 42 0.273 0.331 59.859 40.141 0.002 0.227 0.211 0.0 0.001
119 37 42 Q3957 town 0.023 0.015 1 450,870 0.237 46 0.297 0.364 44.245 55.755 0.057 0.162 0.034 0.0 0.003
64 38 40 Q811979 architectural structure 0.054 0.12 2 445,779 0.234 48 0.313 0.388 12.038 87.962 0.097 0.126 0.117 0.0 0.001
80 39 59 Q34442 road 0.041 0.073 1 440,960 0.231 34 0.22 0.276 14.176 85.824 0.008 0.129 0.171 0.0 0.001
275 40 180 Q21198342 manga series 0.007 0.015 0 437,382 0.23 11 0.074 0.093 28.665 71.335 0.01 0.052 0.2 0.0 0.003
72 41 23 Q86850539 Whitaker's Latin frequency type C 0.047 0.011 1 436,103 0.229 95 0.622 0.788 10.35 89.65 0.0 0.0 0.0 0.0 0.228
138 42 139 Q18340514 events in a specific year or time period 0.019 0.048 1 431,649 0.227 16 0.104 0.133 10.729 89.271 0.0 0.21 0.068 0.0 0.004
261 43 53 Q2085381 publisher 0.007 0.015 0 420,459 0.221 37 0.243 0.319 52.906 47.094 0.001 0.21 0.068 0.0 0.004
44 44 38 Q55488 railway station 0.074 0.104 2 410,774 0.216 49 0.319 0.43 25.81 74.19 0.001 0.172 0.163 0.0 0.002
108 45 27 Q33506 museum 0.027 0.044 1 409,716 0.215 75 0.486 0.655 28.194 71.806 0.017 0.184 0.134 0.0 0.001
181 46 19 Q34770 language 0.013 0.011 0 402,013 0.211 145 0.947 1.302 33.166 66.834 0.009 0.169 0.02 0.0 0.017
112 47 86 Q15632617 fictional human 0.025 0.056 1 395,934 0.208 25 0.166 0.232 17.231 82.769 0.007 0.138 0.09 0.0 0.004
42 48 119 Q22808320 Wikimedia human name disambiguation page 0.077 0.075 2 381,873 0.2 19 0.125 0.181 67.093 32.907 0.0 0.164 0.142 0.0 0.001
143 49 75 Q11032 newspaper 0.017 0.043 0 380,153 0.199 28 0.181 0.263 55.697 44.303 0.002 0.169 0.143 0.0 0.019
38 50 117 Q3331189 version, edition, or translation 0.087 0.191 2 374,597 0.197 19 0.126 0.186 10.191 89.809 0.117 0.037 0.134 0.0 0.038

Comparison of subgraph queries across time

Comparison of subgraph queries across time (Oct, Nov 2021)
Subgraph rank by size Subgraph Subgraph label %of entities %of triples Oct query count Oct %count of queries Oct query time (hr) Oct %time of queries Nov query count Nov %count of queries Nov query time (hr) Nov %time of queries
3 Q5 human 9.986 7.324 68,659,369 31.058 6,314 39.3 60,868,572 31.941 5,248 34.195
5 Q16521 taxon 3.427 2.871 56,437,140 25.529 495 3.1 27,172,995 14.259 480 3.131
34 Q4830453 business 0.207 0.108 4,041,395 1.828 343 2.1 9,228,037 4.842 554 3.607
6 Q101352 family name 0.509 1.546 5,564,173 2.517 640 4.0 5,990,617 3.144 659 4.292
15 Q11424 film 0.281 0.364 4,757,084 2.152 1,613 10.0 5,067,305 2.659 1,541 10.042
1 Q13442814 scholarly article 39.794 49.668 1,649,268 0.746 142 0.9 4,944,995 2.595 263 1.713
7 Q4167410 Wikimedia disambiguation page 1.459 1.374 3,737,550 1.691 223 1.4 3,292,873 1.728 765 4.982
2 Q6999 astronomical object 8.942 8.75 448,032 0.203 51 0.3 2,444,109 1.283 79 0.516
92 Q6881511 enterprise 0.052 0.036 943,613 0.427 164 1.0 1,937,486 1.017 234 1.528
26 Q484170 commune of France 0.043 0.18 866,766 0.392 46 0.3 1,934,902 1.015 70 0.455
20 Q13406463 Wikimedia list article 0.352 0.252 1,283,160 0.58 73 0.5 1,766,742 0.927 117 0.765
63 Q5398426 television series 0.062 0.055 1,206,285 0.546 366 2.3 1,379,486 0.724 411 2.68
42 Q7725634 literary work 0.176 0.077 468,204 0.212 22 0.1 1,377,546 0.723 42 0.273
16 Q486972 human settlement 0.602 0.302 721,789 0.327 73 0.5 1,328,064 0.697 699 4.557
165 Q891723 public company 0.013 0.015 837,595 0.379 157 1.0 1,175,813 0.617 219 1.426
91 Q43229 organization 0.08 0.037 806,840 0.365 123 0.8 1,067,340 0.56 600 3.908
12 Q3305213 painting 0.578 0.432 834,752 0.378 79 0.5 926,701 0.486 86 0.558
86 Q47461344 written work 0.078 0.038 774,947 0.351 67 0.4 881,216 0.462 53 0.345
25 Q532 village 0.292 0.201 584,789 0.265 21 0.1 872,310 0.458 61 0.399
4 Q4167836 Wikimedia category 5.165 5.85 1,383,343 0.626 96 0.6 808,536 0.424 74 0.484
62 Q7889 video game 0.047 0.056 741,401 0.335 30 0.2 753,351 0.395 37 0.244
19 Q8502 mountain 0.559 0.253 227,393 0.103 16 0.1 749,283 0.393 47 0.306
28 Q482994 album 0.287 0.161 776,845 0.351 37 0.2 704,746 0.37 59 0.388
89 Q4164871 position 0.128 0.037 788,077 0.356 332 2.1 645,434 0.339 175 1.141
8 Q7187 gene 1.273 0.927 628,916 0.284 94 0.6 604,364 0.317 208 1.354
10 Q11173 chemical compound 1.302 0.693 1,307,852 0.592 133 0.8 588,469 0.309 76 0.496
54 Q215380 musical group 0.087 0.063 461,181 0.209 17 0.1 585,266 0.307 37 0.241
31 Q16970 church building 0.226 0.129 396,936 0.18 25 0.2 577,677 0.303 48 0.315
70 Q732577 publication 0.076 0.048 512,416 0.232 53 0.3 569,536 0.299 37 0.238
22 Q79007 street 0.62 0.231 225,188 0.102 20 0.1 535,623 0.281 44 0.289
23 Q4022 river 0.425 0.219 280,190 0.127 20 0.1 520,347 0.273 56 0.365
243 Q14204246 Wikimedia project page 0.033 0.008 1,114,113 0.504 62 0.4 498,708 0.262 548 3.572
36 Q3947 house 0.216 0.098 118,886 0.054 9 0.1 465,249 0.244 33 0.212
32 Q41176 building 0.287 0.125 271,666 0.123 36 0.2 463,636 0.243 65 0.423
310 Q783794 company 0.012 0.005 124,932 0.057 19 0.1 459,638 0.241 33 0.213
29 Q23397 lake 0.278 0.138 130,027 0.059 14 0.1 456,054 0.239 42 0.273
121 Q3957 town 0.015 0.023 294,685 0.133 24 0.1 450,870 0.237 46 0.297
64 Q811979 architectural structure 0.119 0.055 282,739 0.128 28 0.2 445,779 0.234 48 0.313
80 Q34442 road 0.073 0.041 215,771 0.098 14 0.1 440,960 0.231 34 0.22
280 Q21198342 manga series 0.014 0.007 208,503 0.094 5 0.0 437,382 0.23 11 0.074
71 Q86850539 Whitaker's Latin frequency type C 0.011 0.048 355,247 0.161 56 0.3 436,103 0.229 95 0.622
138 Q18340514 events in a specific year or time period 0.048 0.019 463,683 0.21 17 0.1 431,649 0.227 16 0.104
264 Q2085381 publisher 0.014 0.007 179,442 0.081 23 0.1 420,459 0.221 37 0.243
45 Q55488 railway station 0.104 0.075 258,862 0.117 20 0.1 410,774 0.216 49 0.319
108 Q33506 museum 0.044 0.028 252,308 0.114 54 0.3 409,716 0.215 75 0.486
177 Q34770 language 0.011 0.013 1,713,196 0.775 73 0.5 402,013 0.211 145 0.947
113 Q15632617 fictional human 0.056 0.026 306,319 0.139 18 0.1 395,934 0.208 25 0.166
41 Q22808320 Wikimedia human name disambiguation page 0.075 0.078 433,986 0.196 17 0.1 381,873 0.2 19 0.125
144 Q11032 newspaper 0.043 0.017 230,085 0.104 11 0.1 380,153 0.199 28 0.181
37 Q3331189 version, edition, or translation 0.19 0.087 410,352 0.186 34 0.2 374,597 0.197 19 0.126

More on query time

The query time can be broken down to classes for better visualization. Below is a figure with the query class distribution (number of queries per query time class per subgraph) for the top 50 subgraphs w.r.t query time consumed. Some of the takeaways are:

  • Most subgraphs have most queries in the range of 10ms to 100ms
  • Second most common class is 100ms to 1s
  • collection and photograph have most queries (~150k) timed at 1s to 10s. Around 10 more subgraphs have a little (~10-20k) query in this time range.

Distribution comparison

Following is the query time distribution of all queries, regardless of subgraph.

File:Query time dist all queries.png

We then compare this distribution to the distributions listed above for the top subgraphs. To compare, we plot the differences of percentages in each subgraph with the percentage in all queries. That is,

For each subgraph:
 For each query time class:
  Percent of query count for this time class in this subgraph - Percent of query count in all queries
This shows us the difference in overall distribution with the distribution in the individual subgraphs.

User agent

Analysis on user-agent is an approximation because these don't completely represent distinct users. For example lots people use the same bot or script without changing the user-agent, or the same person or bot uses multiple user-agent strings. Yet based on the available data we can get an estimate nevertheless.

User agent count

  • Total number of unique user agents across all subgraphs: 981,180
  • First, a list of subgraphs with most and least distinct user-agents is listed. It seems the least number of user-agents a subgraph has is at least 10. So the large subgraphs are used by at least a bunch of users.
  • The largest numbers of user-agents are present in a variety of type of subgraphs. gene, protein, biological process, molecular function appear to be similar among them. It is possible the same queries represent several of these subgraphs. More on subgraph connectivity in #Subgraph Connectivity.
Subgraphs with most user-agents
Subgraph Subgraph label %Query #User agents %User agent
Q11424 film 2.152 251,420 0.256
Q8054 protein 0.158 234,659 0.239
Q7187 gene 0.284 187,029 0.191
Q2996394 biological process 0.072 124,415 0.127
Q14860489 molecular function 0.044 89,445 0.091
Q5 human 31.058 55,377 0.056
Q898273 protein domain 0.019 38,484 0.039
Q16521 taxon 25.529 25,193 0.026
Q86850539 Whitaker's Latin frequency type C 0.161 20,158 0.021
Q4167410 Wikimedia disambiguation page 1.691 13,818 0.014
Q14204246 Wikimedia project page 0.504 13,443 0.014
Q476028 association football club 0.145 12,086 0.012
Q235557 file format 0.045 7,701 0.008
Q1520033 count noun 0.05 7,662 0.008
Q417841 protein family 0.007 4,906 0.005
Q484170 commune of France 0.392 4,764 0.005
Q4830453 business 1.828 4,383 0.004
Q4164871 position 0.356 4,319 0.004
Q7278 political party 0.109 4,073 0.004
Q3918 university 0.104 3,565 0.004
Subgraphs with least user-agents
Subgraph Subgraph label %Query #User agents %User agent
Q106006703 local regulations of the People's Republic of China 0.0 11 0.0
Q67015940 Government Boys' Primary School 0.0 13 0.0
Q7604693 Statutory Rules of Northern Ireland 0.0 13 0.0
Q106474968 ethnic group by settlement in Macedonia 0.003 15 0.0
Q6453643 decree law 0.0 15 0.0
Q97695005 committee group motion 0.0 15 0.0
Q100532807 Irish Statutory Instrument 0.0 16 0.0
Q10429085 report 0.0 19 0.0
Q99045339 written question 0.0 20 0.0
Q1505023 Interpellation 0.0 20 0.0
Q96739634 individual motion 0.0 21 0.0
Q67035425 ASTM standard 0.0 21 0.0
Q61278455 health sub-centre 0.001 23 0.0
Q26267864 Wikimedia KML file 0.005 23 0.0
Q3508250 Syndicat intercommunal 0.02 24 0.0
Q107102664 cell line from embryonic stem cells 0.0 24 0.0
Q7604686 UK Statutory Instrument 0.0 27 0.0
Q6451276 Congressional Research Service report 0.001 28 0.0
Q61443650 sub post office 0.0 33 0.0
Q26894053 basketball team season 0.009 34 0.0
  • There are 50 subgraphs with more than 1000 user agents, and 300 subgraphs with less than 1000 user agents. Most subgraphs are therefore not queried by too many distinct users. The distribution of user-agent counts less than 1000 is shown in the figure below. This clearly shows the small number of user counts in most subgraphs.

File:Ua lessthan1k dist.png

User agent distribution in subgraphs

  • Next, the user agent vs query count distribution was analyzed for some of the top subgraphs. While user agent count gives us an idea of how many users may be using a subgraph, it is not clear whether all of them query the subgraph equally, or very few user agents perform most of the queries.
  • ~30 out of 341 subgraphs have a user agent that queries >=50% of all queries of that particular subgraphs.
  • 6 subgraphs have a user agent querying around 80-90% of the time.
  • So the trend of a dominating single source of queries is not wide spread among subgraphs, but is present in few subgraphs nonetheless.

The figure below shows the top 2 user-agent query in percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph. This figure shows the top 2 user-agent query percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph.

The figure below shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent and most other subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail of the distribution This figure shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail 10% of the distribution.

Top user agents in subgraphs

  • The top user agents in various subgraphs is listed below. More analysis on Q5 (human) and Q16521 (taxon) is done at the end of the page as they are the most queried subgraphs.
Top user agents in various subgraphs
Subgraph Subgraph label User agent Query count (in subgraph) Query percent (within subgraph) Query percent overall
Q16521 taxon mix-n-match 50,622,670 89.697 22.899
Q5 human UA # 2 9,017,930 13.134 4.079
Q5 human mix-n-match 8,548,335 12.45 3.867
Q5 human UA # 3 5,059,258 7.369 2.289
Q5 human UA # 4 4,020,496 5.856 1.819
Q5 human UA # 5 3,828,747 5.576 1.732
Q101352 family name UA # 5 3,828,747 68.811 1.732
Q5 human UA # 6 2,685,807 3.912 1.215
Q5 human UA # 7 2,434,486 3.546 1.101
Q4830453 business UA # 8 2,403,677 59.476 1.087
Q5 human UA # 9 2,020,598 2.943 0.914
Q16521 taxon Hub 1,984,437 3.516 0.898
Q5 human UA # 11 1,877,700 2.735 0.849
Q5 human UA # 12 1,781,161 2.594 0.806
Q16521 taxon UA # 13 1,294,113 2.293 0.585

User agent vs Subgraph

So far we have explored the user-agent count and distribution per subgraph. It is also important to note the user agent's query across subgraphs. In other words,

  • Do users have a very specific use case and so the queries spans only a few subgraphs? or is it spread across a lot of subgraphs?
  • Are there some user agents that query the most in multiple subgraphs? This could be due to the nature of the use case or simply because some subgraphs overlap a lot.

We start by looking at how many user agents access how many subgraphs. From the table below, we see that most user agents (89% of them) query one subgraph only. Some user agents query a lot of subgraphs as well. A clearer picture is seem from the plot below.

Relationship between subgraphs and user agents
#of Subgraphs (X) #of User agents querying X subgraphs %of User agents querying X subgraphs
1 875,724 89.252
2 91,962 9.373
5 3,562 0.363
3 2,388 0.243
6 1,539 0.157
7 799 0.081
9 628 0.064
8 463 0.047
4 460 0.047
12 332 0.034
16 308 0.031
15 282 0.029
10 281 0.029
17 242 0.025
18 235 0.024
14 202 0.021
11 184 0.019
19 177 0.018
13 167 0.017
20 119 0.012
21 75 0.008
22 47 0.005
25 46 0.005
23 39 0.004
24 39 0.004
27 32 0.003
26 28 0.003
28 26 0.003
29 25 0.003
30 20 0.002
31 17 0.002
35 16 0.002
37 16 0.002
47 15 0.002
34 15 0.002
61 13 0.001
32 12 0.001
50 12 0.001
36 11 0.001
44 11 0.001
49 10 0.001
65 9 0.001
56 9 0.001
72 9 0.001
51 9 0.001
121 9 0.001
95 9 0.001
124 9 0.001
42 9 0.001
39 9 0.001
File:Ua vs subgraph.png

Next we isolate user agents from each subgraph who query drastically more (>=10% difference) than other user agents in the same subgraph, and perform at least 100k queries (0.05% of all queries) a month. A list of ~30 such user agents was found. A plot with subgraph distributions of all these user agents was observed to find some large buckets where they tend to query. The plot is shows below, followed by some explicit observations.

File:Imp ua dist censored.png

Percentages below are percent of all monthly queries.

  • mix n match (UA #17):
    • a lot of taxon queries (Q16521), 23%
    • a lot of human queries (Q5), 4%
  • UA #6:
    • 1% in Business (Q4830453)
  • UA #14:
    • 1% in human (Q5)
    • 0.5% in film (Q11424)
  • UA #23:
    • 1.73% in family name (Q101352)
    • 1.73% in human (Q5)
    • both have exact counts, meaning they could be the same queries that
      touch both human and family name subgraphs

For reference:

  • 100% percent is 221,067,674 queries
  • 10% percent is 22,106,767 queries
  • 1% percent is 2,210,676 queries
  • 0.1% percent is 221,067 queries
  • 0.05% percent is 110,533 queries
  • 0.01% percent is 22,106 queries

Subgraph connectivity through queries

Subgraph connectivity was explored to some extent using only Wikidata in Wikidata_Subgraph_Analysis. This was based on what items or properties were common between subgraphs and how many direct connections were present between them. A visualization was created to show the strength of this connectivity between subgraphs here: wikidata_graph. This section aims to analyze the connectivity of subgraphs through the queries, i.e, how often are some subgraphs queried together.

  • Subgraph Queries: The total number of queries that touch at least one of the top 341 subgraphs is 72% of all queries.
  • First we look at how many subgraphs do most queries access. The tables below show the least and most query groups by number of subgraphs accessed.
  • 70% of all queries (97% of subgraph queries) touch on 1 or 2 subgraph. 64% of all queries (90% of subgraph queries) touch on only 1 subgraph.
Queries with most subgraphs accessed
#of Subgraphs #of Queries
341 25
333 1
315 2
313 3
258 1
181 3
152 1
142 1
133 2
130 2
129 1
128 2
127 4
126 4
125 9
Queries with least subgraphs
accessed
#of Subgraphs #of Queries %of Queries
1 142,507,736 64.463
2 12,464,811 5.638
3 1,767,253 0.799
4 586,173 0.265
5 364,445 0.165
6 221,485 0.1
7 188,012 0.085
8 112,922 0.051
9 102,524 0.046
10 68,871 0.031
11 50,341 0.023
12 38,102 0.017
13 34,075 0.015
14 24,003 0.011
15 17,935 0.008

File:NumQuery vs numSubgraph.png

  • It is hard to view which subgraphs occur together from the data above. So the subgraphs that occurred together were broken into pairs and pars of subgraphs that occur together the most were listed.
  • There are 57,970 subgraphs pairs that occur together in queries. Total possible subgraph pair count is (340*341)/2 = 57,970. This shows that every subgraph is connected to every other subgraph through queries! Of course the number of queries vary widely.
  • A list of some of the most queried subgraph pairs is shown below.
Top pairs of subgraphs that are queried together
Subgraph 1 Subgraph 2 Query
Subgraph Subgraph label Subgraph Subgraph label #of Query %of Query
Q101352 family name Q5 human 4,649,345 2.44
Q4830453 business Q6881511 enterprise 1,858,183 0.975
Q11424 film Q5 human 1,096,150 0.575
Q5 human Q7725634 literary work 1,067,191 0.56
Q4830453 business Q891723 public company 973,565 0.511
Q13406463 Wikimedia list article Q5 human 970,047 0.509
Q16521 taxon Q5 human 890,304 0.467
Q4167410 Wikimedia disambiguation page Q5 human 840,151 0.441
Q4830453 business Q5 human 680,786 0.357
Q3305213 painting Q4167410 Wikimedia disambiguation page 606,434 0.318
Q6881511 enterprise Q891723 public company 572,986 0.301
Q13442814 scholarly article Q5 human 527,538 0.277
Q47461344 written work Q732577 publication 514,321 0.27
Q4164871 position Q5 human 480,484 0.252
Q13442814 scholarly article Q4167410 Wikimedia disambiguation page 446,490 0.234
Q482994 album Q5 human 409,139 0.215
Q13406463 Wikimedia list article Q16521 taxon 401,466 0.211
Q13406463 Wikimedia list article Q4167410 Wikimedia disambiguation page 349,421 0.183
Q14204246 Wikimedia project page Q4167410 Wikimedia disambiguation page 341,845 0.179
Q43229 organization Q5 human 337,868 0.177
Q5398426 television series Q5 human 323,501 0.17
Q215380 musical group Q5 human 320,532 0.168
Q47461344 written work Q5 human 313149 0.164
Q5 human Q6881511 enterprise 285,110 0.15
Q3331189 version, edition, or translation Q5 human 283,741 0.149
Q5 human Q86850539 Whitaker's Latin frequency type C 280,866 0.147
Q11424 film Q13406463 Wikimedia list article 272,316 0.143
Q13406463 Wikimedia list article Q18340514 events in a specific year or time period 270,710 0.142
Q16521 taxon Q4167410 Wikimedia disambiguation page 266,507 0.14
Q4167410 Wikimedia disambiguation page Q86850539 Whitaker's Latin frequency type C 249,340 0.131
  • The distribution of the number of times each subgraph pair in Wikidata occurs in queries is shown below. Note that (A,B) pair is the same as (B,A) pair, so there is no duplication in the plots. Since the plot is extremely skewed, three plots with various limits on the number of occurrences are shown. We can see how only a small number of pairs occur a lot together, they can be viewed from the table above. Whereas a huge number of pairs occur a very small number of times.

File:Subgraph pair dist.png

  • Below is a heat-map of the number of queries, where both x and y axis represent subgraph indices (names of subgraphs not shown due to space). The subgraphs are sorted by most queried subgraphs.
  • The diagonals show queries that use only 1 subgraph and are represented as Q5-Q5, or Q42-Q42 for example. Other are represented as Q5-Q42 or Q42-Q5
  • It is a Symmetrical plot.
  • The tons of vertical and horizontal lines indicate there are lots of subgraphs that happen to pair with many other subgraphs.
  • The exact numbers for this heat-map are present in subgraph_pair_heatmap_df as a csv file.

File:Subgraph pair heatmap.png

Number of subgraph accessed vs query time

To view whether there is a correlation between accessing more subgraphs and query time, various subsets of subgraphs were taken and their query time distributions were observed. Then, number (and percents) of queries that access various number of subgraphs were plotted for each query time group. A simple scatter plot with time and subgraph number was not possible due to the large number of queries, but the following plots give us a good idea of the correlation. The pearson correlation of query time and number of subgraphs accessed is 0.016. We see there is a slight correlation but it is not significant enough. All query time groups are dominated by queries that access 1 or 2 subgraphs. Queries accessing more subgraphs do appear comparatively more in More than 10s group.

The following analysis was done with data from November 2021. Thus there are slight differences in numbers from the above analysis, which were done with October 2021 data.

File:Various subset time classes.png

File:TimeGroupWise numSubgraphAccessed.png

Paired subgraph query time analysis

We observed that some queries do indeed take more time when more subgraphs are involved. But this could also occur because of the particular subgraphs being accessed. For example, even simple queries in the scholarly article subgraph may take a long time time out simply due to the large size of graph it has to comb through. To make this analysis complete, we look at the query times of queries that access only subgraph X and those that access X and other subgraphs. The influence of the other subgraphs persists, but we can now pick out anything clearly similar or different. If both plots look similar (the query classes), then we can assume it to be the effect of subgraph X. If there are more long-running queries when the query accesses X with other subgraphs, than when it accesses only X, then we can assume the cause of this long-running queries is not solely X. It could be due to other subgraphs, or simply due to the nature of the query (too complex, lots of string manipulation, regex etc).

The plots below shows the comparison of query time class distribution for when the queries use only subgraph X versus when they use X and other subgraphs, where X represents the top 30 large subgraphs. It indeed looks like some subgraphs have queries with longer time when other subgraphs are involved, such as for scholarly articles, lake, hill, clinical trial, river etc.

File:Solo vs with others subgraph top 30.png

Human subgraph (Q5) query analysis

The following analysis was done with query data of November, 2021.

The queries that were estimated to be related to the human subgraph accounted for 31.94% of all queries in Wikidata. 25.78% queries used only the human subgraph and the rest 6.16% queries used a mix of human and various other subgraphs. As described in #What are subgraph related queries, subgraphs are related to queries through Properties, Subject or Object URIs, Subgraph instance items, etc. Here is a breakdown for human subgraph taken from #Query count and time. A query can be said to be related to human subgraph due to multiple of the following reasons.

  • Number of queries: 60,868,572 (31.94%)
  • Percent of queries matching subgraph Qid, i.e, has Q5: 2.54%
  • Percent of queries matching instance items: 18%
  • Percent of queries matching subject/object URIs: 12%
  • Percent of queries matching properties: 19.45%
  • Percent of queries matching literal strings: 1.43%

Some of these breakdown have large percentages. It is worth looking at what items/properties/URIs are queried the most. Also looking at the distribution of such items' usage in queries shows how narrow or wide the search space is.

Here is a detailed breakdown of what kind of match caused a query to be part of the human subgraph:

Human subgraph query breakdown
item predicate URI human Q-id literal # query % all query % human query
0 1 0 0 0 17,785,347 9.333 29.219
1 1 1 0 0 12,215,379 6.41 20.068
1 0 0 0 0 10,705,360 5.618 17.588
1 0 1 0 0 7,253,287 3.806 11.916
1 1 0 0 0 3,137,130 1.646 5.154
0 0 1 0 0 2,512,142 1.318 4.127
0 0 0 0 1 1,775,347 0.932 2.917
0 1 0 1 0 1,694,236 0.889 2.783
0 0 0 1 0 930,137 0.488 1.528
1 1 0 1 0 598,261 0.314 0.983
1 1 1 1 0 508,706 0.267 0.836
0 1 0 0 1 407,610 0.214 0.67
0 0 0 1 1 350,982 0.184 0.577
0 1 1 1 0 311,340 0.163 0.511
0 1 1 0 0 226,959 0.119 0.373
1 0 0 1 0 178,650 0.094 0.294
0 1 1 1 1 135,684 0.071 0.223
1 0 1 1 0 76,736 0.04 0.126
0 1 0 1 1 56,971 0.03 0.094
1 0 1 0 1 3,451 0.002 0.006
1 0 0 0 1 2,844 0.001 0.005
0 0 1 0 1 702 0.0 0.001
1 1 1 1 1 437 0.0 0.001
1 1 0 1 1 393 0.0 0.001
0 0 1 1 0 304 0.0 0.0
1 1 0 0 1 93 0.0 0.0
1 0 0 1 1 59 0.0 0.0
1 1 1 0 1 17 0.0 0.0
1 0 1 1 1 5 0.0 0.0
0 1 1 0 1 3 0.0 0.0
Total 60,868,572 31.94 100

File:Human venn.png

Instance items matched

  • Total items used: 7,969,182
  • Total queries that use these items: 34,680,808 (18% of all queries)
  • The distribution shows there are some high usage (~10k-20k queries) items, a small number of medium usage (~5k queries) items, and rest form a long tail of small usage (<1k queries) items in the human subgraph.
Top items that cause a query to be related to Human subgraph (Q5)
Instance item Instance item label #of queries
Q22686 Donald Trump 19,759
Q1747297 Robert Oliveri 19,247
Q509260 John Zimmerman 19,193
Q6499255 Laura Nader 19,135
Q209394 Michael Wood 19,101
Q937 Albert Einstein 19,098
Q7340648 Rob Whitehurst 19,026
Q52354375 Irene Aparicio 18,970
Q6232209 John F. Cassidy 18,964
Q22986632 Lori Lynn Ross 18,954
Q3976229 Stuart Lancaster 18,953
Q106466114 Gary Michael Ritchie 18,947
Q86599148 James Spicer 18,926
Q87653156 David A. Cook 18,919
Q16015822 Jerry Fleck 18917
Q7179427 Petur Hliddal 18,914
Q19878977 Jackie Carson 18,902
Q99859767 Kathy McCarty 18,898
Q90307934 Ann Harris 18,893
Q1070508 Cheryl Carasik 18,834
Q9682 Elizabeth II 18,816
Q6279 Joe Biden 18,277
Q64840837 Dylan Arnold 18,161
Q76 Barack Obama 18,035
Q107626126 Mauricio Lara 18,010

File:Human instance count all.png

File:Human instance count 20k.gif

Properties matched

  • Total properties used: 1,091 (Recall these are properties that occur 99% of the times in the human subgraph)
  • Total queries that use these properties: 37,078,566 (19.45% of all queries)
  • The distribution shows there are 3 properties with ~20-30M queries, 7 properties with ~1-5M queries, and rest of the more than 1000 properties match ~100K and less queries. In short, the distribution is a extremely skewed by only ~10 properties that are highly related to the human subgraph.
Top properties that cause a query to be related to Human subgraph (Q5)
Property Property label #of queries
P570 date of death 30,151,024
P569 date of birth 30,084,200
P27 country of citizenship 24,186,000
P106 occupation 5,259,920
P734 family name 4,871,326
P735 given name 4,616,631
P19 place of birth 2,379,702
P2949 WikiTree person ID 1,707,373
P20 place of death 1,222,037
P4985 TMDb person ID 916,399
P39 position held 750,380
P3602 candidacy in election 599,067
P69 educated at 561,380
P26 spouse 471,589
P108 employer 384,111
P2562 married name 279,197
P937 work location 258,707
P1066 student of 158,339
P184 doctoral advisor 152,318
P1960 Google Scholar author ID 151,507
P185 doctoral student 150,982
P54 member of sports team 150,573
P1153 Scopus author ID 150,545
P119 place of burial 144,027
P3829 Publons author ID 138,839

File:Human pred count all log.png

File:Human pred count.gif

Subject/Object URI matched

  • Total URIs used: 7,926,297 (Recall these are URIs that occur 99% of the times in the human subgraph)
  • Total queries that use these URIs: 23,245,152 (12.2% of all queries)
  • The top URIs/items show the obvious and most common ways the human subgraph is queried: query about specific people, about groups of people, and about their wikipedia pages. More about types of queries below.
  • The distribution is a smooth logarithmic graph with only one item present in 165k queries, and the rest go down from 40k in a logarithmic pattern.
Top URIs that cause a query to be related to Human subgraph (Q5)
URI URI label #of queries
Q3391743 visual artist 165,540
Q1925963 graphic artist 38,897
Q28389 screenwriter 33,718
en.wikipedia.org/wiki/Lee_Child - 33,179
en.wikipedia.org/wiki/Emily_Wilson_(journalist) - 30,837
en.wikipedia.org/wiki/M.I.A._(rapper) - 29,388
Q10800557 film actor 29,318
en.wikipedia.org/wiki/Shannon_Lee - 29,216
en.wikipedia.org/wiki/Eugene_Gordon_Lee - 29,205
en.wikipedia.org/wiki/Lee_Childs - 29,203
en.wikipedia.org/wiki/Emily_Wilson_(classicist) - 26,864
en.wikipedia.org/wiki/Emily_Wilson_(actress) - 26,862
en.wikipedia.org/wiki/Adhir_Kalyan - 26,862
en.wikipedia.org/wiki/Emily_Wilson_(footballer) - 26,861
en.wikipedia.org/wiki/Emily_Wilson_Walker - 26,861
Q10798782 television actor 24,679
Q185351 jurist 22,130
Q1650915 researcher 21,206
Q2374149 botanist 20,385
Q250867 Catholic priest 20,314
Q10873124 chess player 19,832
Q12299841 cricketer 19,414
Q14373094 rugby league player 19,396
Q509260 John Zimmerman 19,193
Q6499255 Laura Nader 19,135

File:Human uri count all log.png

File:Human uri count.gif

Query time

  • The total query time of human subgraph is 34% of total query time and total query count is ~32% of all queries.
  • Average time per query is 0.3 seconds (300 ms). Most queries in this subgraph are small and simple.
  • The query time distribution is shown in the chart below, both in absolute counts and in percent of queries in human subgraph.

File:Human time class.png

User agent

List of top user agents that query human subgraph is given below. This helps us view the distribution of usage - whether few user agents dominate the usage or it is a rather well distributed usage scenario across user agents. Top 10 user agents in terms of query count and query time is shown in the table below.

Top user agents in human subgraph
User agent Query count % query in human subgraph % query overall Query time(hr) % query time in human subgraph % query time overall
mix-n-match 6,960,988 11.436 3.653 79 1.51 0.516
searx1 6,615,319 10.868 3.471 778 14.832 5.072
UA#3 3,491,821 5.737 1.832 75 1.426 0.487
UA#4 3,073,725 5.05 1.613 175 3.327 1.138
UA#5 2,933,240 4.819 1.539 80 1.516 0.518
UA#6 2,488,807 4.089 1.306 19 0.364 0.125
UA#7 2,182,220 3.585 1.145 44 0.841 0.288
WikidataQueryServiceR 2,044,045 3.358 1.073 36 0.68 0.232
UA#9 1,970,264 3.237 1.034 27 0.524 0.179
searx2 1,909,144 3.137 1.002 200 3.808 1.302
UA#11 75,523 0.124 0.04 434 8.271 2.828
UA#12 55,357 0.091 0.029 319 6.083 2.08
searx3 1,428,789 2.347 0.75 151 2.871 0.982
OB-bot 287,534 0.472 0.151 144 2.736 0.935
UA#15 50,915 0.084 0.027 134 2.553 0.873
UA#16 31,298 0.051 0.016 112 2.132 0.729
searx4 771,932 1.268 0.405 92 1.761 0.602

The query time breakdown was plotted for the top 20 user agents (in terms of time). Most queries have query time of 10ms to 1s, as observed earlier. Some user agents have most queries in the range 10ms to 100ms and some others have most queries in the range 100ms to 1s.

File:Human ua query class percent limy15.png

Query types

Query types are grouped by the operations a query uses and also the order of operations used. This groups similar queries together despite different information sought and also separates groups of simple or complicated queries. The human subgraph has ~11,500 different types of queries. Notice that some query groups can be very similar in what they ask for, while most groups differ a lot. The top query groups are listed below. The top 10 types of queries account for 60% of the queries of human subgraph. The rest form a really long tail of small query counts.

Top query types in human subgraph
Query group (operation list) #of queries %of queries in human subgraph
bgp, service, path, bgp, sequence, leftjoin, bgp, join, path, bgp, sequence, leftjoin, bgp, join, path, bgp, sequence, leftjoin, bgp, join, path, bgp, sequence, leftjoin, bgp, join, path, bgp, sequence, leftjoin, bgp, join, path, bgp, sequence, leftjoin, bgp, join, path, bgp, sequence, leftjoin, bgp, join, path, bgp, sequence, leftjoin, bgp, join, bgp, (leftjoin, bgp)x10, path, leftjoin, leftjoin, bgp, path, leftjoin, leftjoin, bgp, leftjoin, bgp, path, leftjoin, (leftjoin, bgp)x15, leftjoin, path, bgp, sequence, leftjoin, bgp, join, bgp, (leftjoin, bgp)x35, leftjoin, path, bgp, sequence, (leftjoin, bgp)x8, service, join, group, (extend)x68,project 9,511,204 15.626
table, bgp, join, filter, project, distinct 6,889,023 11.318
bgp, service, path, bgp, sequence, leftjoin, bgp, join, path, bgp, sequence, leftjoin, bgp, join, path, bgp, sequence, leftjoin, bgp, join, path, bgp, sequence, leftjoin, bgp, join, path, bgp, sequence, leftjoin, bgp, join, path, bgp, sequence, leftjoin, bgp, join, path, bgp, sequence, leftjoin, bgp, join, path, bgp, sequence, leftjoin, bgp, join, bgp, (leftjoin, bgp)x9, leftjoin, bgp, path, leftjoin, leftjoin, bgp, path, leftjoin, leftjoin, bgp, leftjoin, bgp, path, leftjoin, (leftjoin, bgp)x15 leftjoin, path, bgp, sequence, leftjoin, bgp, join, bgp, (leftjoin, bgp)x36, leftjoin, path, bgp, sequence, (leftjoin, bgp)x8, service, join, group, (extend)x68, project 4,298,916 7.063
table, bgp, join, bgp, leftjoin, bgp, leftjoin, bgp, join, filter, project 3,444,363 5.659
bgp, bgp, leftjoin, filter, bgp, extend, filter, union, bgp, extend, filter, union, project] 3,073,725 5.05
bgp, project 2,454,919 4.033
table, bgp, leftjoin, bgp, join, filter, project 2,429,518 3.991
bgp, bgp, service, join, filter, project, distinct 1,172,912 1.927
table, bgp, join, bgp, service, join, order, project 1,047,351 1.721
table, extend, bgp, join, bgp, leftjoin, bgp, leftjoin, bgp, leftjoin, path, leftjoin, bgp, leftjoin, bgp, leftjoin, bgp, leftjoin, bgp, leftjoin, bgp, leftjoin, bgp, leftjoin, bgp, leftjoin, bgp, leftjoin, bgp, service, join, filter, project 1,033,220 1.697

File:Human qtype 100.png

File:Human qtype all.png

Looking at the top 20 query groups (that make up 70% of all human subgraph queries), the following query types were found:

  1. Query lists lots of predicates with times/date precision of some items
  2. Mix n Match: Birth and Death of certain people, with filters
  3. Like 1, slightly different
  4. Name and family information of people in specific languages
  5. All properties of some humans (these queries are generic but here used for humans)
  6. Uses P227 (or some international ID) and retrieves the Wikipedia page for it
  7. Same as 6 but written differently
  8. Just wants labels in specific languages
  9. Same as 6 but written differently
  10. All contact info of people (facebook, instagram, youtube, twitter, etc)
  11. Labels and Wikipedia article of people
  12. Only label and description of people
  13. Search for films by director filter
  14. CEO or high officials of companies searched by name strings
  15. Timeline of people of a particular occupation and particular gender
  16. Occupation, name, birth, death of people
  17. Education institution and its reference of people
  18. Label and Wikipedia page of people in specific languages
  19. List all people, or all people of certain occupation, or entity related to a given Wikipedia page (generic query)
  20. Notable works, labels, and IDs of works like isbns

UA vs query types

Getting the number of query types per user agent informs us of the variety of queries a user agent makes to WDQS. This also breaks down the human subgraph queries into finer groups. The following plot shows the number of query types for each user agent in the human subgraph.

File:Human qtype vs ua.png

This shows us that most user agents make only 1 type of query. Only 8 user agents make queries of >500 types, and ~50 user agents make queries of >100 types. Looking into query counts in each of these UA - query type groups, we find that they have few queries (<10,000), and only ~10 groups have >10,000 queries, but all of these are small simple queries. The figure below shows the number of query per query type for the top 8 user agents. As we can see, their distribution looks alike although their query counts and query types differ.

File:Query vs query type 8ua.png

Query type vs time class

While there are close to 11,500 query types, 20 of these types make 70% of all queries of human subgraph (22% of all queries), not all of them are equally time consuming. Some can be simple queries, while some can be long and complex. The following plot shows these 20 query types with query time classes. The values above the bar show both percent in human subgraph and overall query percentage. The subplots are titled with percent of the number of queries in that query type.

File:Top 20 qtype qtime.png

Services

The queries use ~50 unique services. The top 10 services are the most used; rest are used in less than 50 queries, mostly in less than 10 queries. 20 of these services are used in only 1 query.

Top 10 used services in queries in human subgraph
Service Query count % query in human subgraph
wikibase:label 29,925,798 49.165
wikibase:mwapi 14,014,458 23.024
gas:service 46,588 0.077
bd:slice 42,764 0.07
http://dbpedia.org/sparql 22,751 0.037
https://query.wikidata.org/sparql 22,751 0.037
wikibase:around 1,733 0.003
https://sophox.org/sparql 628 0.001
wikibase:box 195 0.0
mediawiki:categoryTree 45 0.0

Triples

Some query type analysis done in section query types gives us a good idea of what kind of queries human subgraph receives. Looking at the triples themselves also helps us peek into what most of the queries look like, what the most common subjects, objects, and properties are. The table below lists these along with the top Wikidata items and properties used overall. From the numbers it seems the top items are probably part of the same queries.

Top Subjects
Subject Query count % Query count in human subgraph
bd:serviceParam 30,022,819 49.324
item 23,632,602 38.826
hint:Prior 14,198,844 23.327
P27 13,845,887 22.747
P17 13,810,191 22.689
articleen 13,810,173 22.689
P281Node 13,810,142 22.688
P2048Node 13,810,142 22.688
P2046Node 13,810,142 22.688
P400 13,810,128 22.688
P275 13,810,128 22.688
P1346 13,810,128 22.688
P162 13,810,128 22.688
P35 13,810,128 22.688
P495 13,810,128 22.688
P112 13,810,128 22.688
P277 13,810,128 22.688
P123 13,810,128 22.688
P282 13,810,128 22.688
P58 13,810,128 22.688
Top Predicates
Predicate Query count % Query count in human subgraph
rdfs:label 34,247,253 56.264
wikibase:language 29,925,828 49.165
schema:about 24,610,856 40.433
schema:description 22,181,059 36.441
schema:isPartOf 20,672,719 33.963
schema:inLanguage 19,811,149 32.547
wdt:P27 18,580,218 30.525
wdt:P18 16,407,156 26.955
wdt:P50 15,283,299 25.109
wdt:P856 15,277,657 25.099
wdt:P2002 15,260,957 25.072
wdt:P345 15,175,422 24.931
rdf:type 15,141,091 24.875
wdt:P2013 15,126,300 24.851
wdt:P2003 15,044,080 24.716
wdt:P2397 14,980,826 24.612
wdt:P212 14,433,555 23.713
wikibase:timePrecision 14,224,353 23.369
wikibase:timeValue 14,222,729 23.366
wdt:P17 14,130,982 23.216
Top Objects
Object Query count % Query count in human subgraph
en 23,579,559 38.738
https://en.wikipedia.org/ 16,123,553 26.489
item 15,770,101 25.908
itemLabel 14,812,538 24.335
itemDescription 14,040,107 23.066
www.wikidata.org 14,008,069 23.014
true^^http://www.w3.org/2001/XMLSchema#boolean 13,994,657 22.992
mwapi:item 13,899,357 22.835
EntitySearch 13,897,505 22.832
P27 13,895,366 22.828
P18 13,859,909 22.77
wikibase:BestRank 13,825,048 22.713
P569 13,811,781 22.691
P570 13,811,013 22.69
1^^http://www.w3.org/2001/XMLSchema#integer 13,810,316 22.689
P17 13,810,214 22.689
P856 13,810,151 22.688
P50 13,810,144 22.688
P577timePrecision 13,810,142 22.688
P162 13,810,142 22.688
Top Wikidata items
Item Query count % Query count in human subgraph
P569 24,521,735 40.286
P570 24,463,540 40.191
P27 18,887,725 31.03
P18 16,424,542 26.984
P50 15,795,288 25.95
P856 15,470,890 25.417
P2002 15,288,837 25.118
P345 15,203,948 24.978
P625 15,191,078 24.957
P2013 15,127,905 24.853
P2003 15,045,926 24.719
P2397 14,983,022 24.615
P580 14,978,083 24.607
P582 14,925,914 24.522
P17 14,875,547 24.439
P136 14,736,247 24.21
P169 14,719,164 24.182
P577 14,672,391 24.105
P212 14,624,020 24.026
P112 14,542,333 23.891

Paths

Paths are more complex predicates that chain properties with logic. Complex paths can increase the scope of a query and also increase its runtime. The table below lists the most used paths in human subgraph queries. While most path are not very complex or long, there are a lot of variety in ways paths are formed to perform queries. Ordinary properties are not considered as paths. The following list contains not only the paths, but also their breakdown into components paths (as done by Jena ARQ while parsing SPARQL queries). For instance: (p:P31/ps:P31)/(wdt:P279)* is recorded as:

  • (p:P31/ps:P31)/(wdt:P279)*
    • (p:P31/ps:P31)
      • p:P31
      • ps:P31
    • (wdt:P279)*
      • wdt:P279

The unit form, wdt:P279 for example, was removed from the path list since they are part of other paths and not paths themselves. More paths that seemed obvious as being part of a longer path, and not paths themselves, were also removed from the list for better visualization of the distinct paths used in the queries.

Top Paths
Path Query count % Query count in human subgraph
p:P570/psv:P570 13,867,481 22.783
p:P569/psv:P569 13,863,357 22.77
p:P625/psv:P625 13,810,408 22.689
p:P577/psv:P577 13,810,371 22.689
p:P571/psv:P571 13,810,310 22.689
p:P576/psv:P576 13,810,242 22.689
p:P582/psv:P582 13,810,146 22.688
psv:P2046/<http://wikiba.se/ontology#quantityUnit 13,810,142 22.688
p:P580/psv:P580 13,810,142 22.688
psv:P281/<http://wikiba.se/ontology#quantityUnit 13,810,142 22.688
p:P619/psv:P619 13,810,138 22.688
p:P620/psv:P620 13,810,138 22.688
wdt:P31/(wdt:P279)* 1,148,817 1.887
wdt:P31|wdt:P279 1,040,140 1.709
p:P169|p:P488 704,491 1.157
ps:P169|ps:P488 704,491 1.157
p:P2572/ps:P2572 501,987 0.825
((((((((((((wdt:P17|wdt:P101)|wdt:P112)|wdt:P135)|wdt:P136)|wdt:P279)|wdt:P361)|wdt:P460)|wdt:P793)|wdt:P800)|wdt:P1269)|wdt:P1344)|wdt:P1830)|(p:P2572/ps:P2572) 501,987 0.825
ps:P106/(wdt:P279)* 429,856 0.706
ps:P31/(wdt:P279)* 429,284 0.705
wdt:P106/(wdt:P279)* 251,007 0.412
p:P569/ps:P569 p:P570/ps:P570 246,947 0.406
p:P569/ps:P569 p:P570/ps:P570 202,546 0.333
wdt:P50|wdt:P2093 197,426 0.324

Taxon subgraph (Q16521) query analysis

The following analysis was done with query data of November, 2021.

The queries that were estimated to be related to the taxon subgraph accounted for 14.26% of all queries in Wikidata. 13.57% queries used only the taxon subgraph and the rest 0.69% queries used a mix of taxon and various other subgraphs. As described in #What are subgraph related queries, subgraphs are related to queries through Properties, Subject or Object URIs, Subgraph instance items, etc. Here is a breakdown for taxon subgraph taken from #Query count and time. A query can be said to be related to taxon subgraph due to multiple of the following reasons.

  • Number of queries: 27,172,995 (14.26%)
  • Percent of queries matching subgraph Q-ID, i.e, has Q5: 12.19%
  • Percent of queries matching instance items: 0.75%
  • Percent of queries matching subject/object URIs: 12.86%
  • Percent of queries matching properties: 0.87%
  • Percent of queries matching literal strings: 0.43%

Percent of queries matching subject/object URIs (12.86) includes Q-ID (12.19) and instance items (0.75) in them. This makes Q-ID match almost the only reason for queries to be in taxon subgraph. Therefore we look at the top URIs that cause these matches. Also looking at the distribution of such items' usage in queries shows how narrow or wide the search space is (in this case, quite narrow as almost all queries match the Q-ID itself).

Here is a detailed breakdown of what kind of match caused a query to be part of the taxon subgraph:

Taxon subgraph query breakdown
item predicate URI taxon Q-id literal # query % all query % taxon query
0 0 1 1 0 22,955,853 12.046 84.48
0 1 0 0 0 1,415,482 0.743 5.209
1 0 0 0 0 638,533 0.335 2.35
0 0 1 0 0 624,147 0.328 2.297
1 0 1 0 0 501,232 0.263 1.845
0 0 0 0 1 443,593 0.233 1.632
0 0 1 1 1 233,109 0.122 0.858
1 1 1 0 0 132,462 0.07 0.487
1 0 0 0 1 66,408 0.035 0.244
0 1 0 0 1 60,111 0.032 0.221
1 1 0 0 0 38,147 0.02 0.14
1 0 1 1 0 30,652 0.016 0.113
0 0 1 0 1 13,104 0.007 0.048
1 0 1 0 1 9,026 0.005 0.033
0 1 1 1 0 5,847 0.003 0.022
1 1 1 1 0 5,248 0.003 0.019
0 1 1 1 1 24 0.0 0.0
0 0 0 1 0 14 0.0 0.0
0 1 1 0 0 3 0.0 0.0
Total 27,172,995 14.26 100

File:Taxon venn.png

Instance items matched

  • Total items used: 588,668
  • Total queries that use these items: 1,421,708 (0.75% of all queries)
  • The distribution shows there are only 3 high usage(>100k queries) items, and the rest form a long tail of small usage (<1k queries) items in the taxon subgraph.
  • Note that these are for the queries from the month of November 2021. These data change from one month to another.
Top items that cause a query to be related to Taxon subgraph (Q16521)
Instance item Instance item label #of queries
Q15978631 Homo sapiens 148,455
Q83310 house mouse 111,473
Q184224 brown rat 111,397
Q25400 Asteraceae 67,794
Q729 animal 43,360
Q25308 Orchidaceae 22,905
Q756 plant 11,244
Q14560 Cactaceae 7,840
Q173756 Apocynaceae 7,018
Q526228 Acantharchus pomotis 4,373
Q36341 Brown Bear 1,878
Q80174 Pan 1,696
Q19537 bonobo 1,684
Q69581 Siberian tiger 1,664
Q171497 sunflower 1,637
Q504549 Spur-thighed tortoise 1,617
Q8202634 Apis mellifera sahariensis 1,438
Q41960 Ailurus fulgens 1,317
Q2346039 Thunnus 1,285
Q719725 Saccharomyces cerevisiae 1,244

File:Taxon instance count all log.png

File:Taxon instance count.gif

Properties matched

  • Total properties used: 162 (Recall these are properties that occur 99% of the times in the taxon subgraph)
  • Total queries that use these properties: 1,657,324 (0.87% of all queries)
  • Most of these look like external IDs. Only 31 of these properties are not IDs.
  • The distribution shows there is 1 property with >1M queries, 7 properties with >100K queries, 14 properties with 2-8K queries, and rest of the properties match ~1K and less queries. In short, the distribution is a extremely skewed by only ~10 properties.
Top properties that cause a query to be related to Taxon subgraph (Q16521)
Property Property label #of queries
P3151 iNaturalist taxon ID 1,346,905
P141 IUCN conservation status 167,512
P183 endemic to 152,618
P961 IPNI plant ID 64,340
P938 FishBase species ID 39,494
P6018 SeaLifeBase ID 25,217
P574 year of taxon publication 12,599
P2040 CITES Species+ ID 12,073
P697 ex taxon author 10,710
P566 basionym 8,469
P5473 The Reptile Database ID 6,357
P5036 AmphibiaWeb Species ID 6,343
P7715 World Flora Online ID 4,868
P5626 Global Invasive Species Database ID 4,509
P5037 Plants of the World online ID 3,629
P6105 Observation.org ID 3,193
P960 Tropicos ID 2,922
P9157 Open Tree of Life ID 2,470
P1070 PlantList-ID 2,181
P1772 USDA PLANTS ID 2,167

File:Taxon pred count all log.png

File:Taxon pred count.gif

Subject/Object URI matched

  • Total URIs used: 651,945 (Recall these are URIs that occur 99% of the times in the human subgraph)
  • Total queries that use these URIs: 24,510,707 (12.86% of all queries)
  • The top URI is in fact the Q-ID of taxon subgraph - Q16521 - and matches 12.19%of all queries. We look into the queries directly later in this section.
  • We analyze the top 100K URIs. Of these, 66% are Wikidata items, 31% are Wikipedia links.
  • The distribution shows that the top 2 URIs occur in queries tens of times greater than the other URIs. Of course this data is only for November 2021, but the high usage of Taxon Q-ID was also observed in October 2021 data.
Top URIs that cause a query to be related to Taxon subgraph (Q16521)
URI URI label #of queries
Q16521 taxon 23,230,733 (12.86%)
Q767728 variety 50,370 (0.03%)
Q279749 form 7,100
https://ja.wikipedia.org/wiki/キーウィ_(鳥) - 4,827
https://en.wikipedia.org/wiki/Tokay_gecko - 2,602
https://ja.wikipedia.org/wiki/ガラパゴスリクイグアナ - 2,309
https://ja.wikipedia.org/wiki/セミ - 2,186
https://ja.wikipedia.org/wiki/ドードー - 1,932
https://ja.wikipedia.org/wiki/ダイオウホウズキイカ - 1,864
https://ja.wikipedia.org/wiki/アオリイカ - 1,843
https://ja.wikipedia.org/wiki/コウイカ目 - 1,835
https://en.wikipedia.org/wiki/Bobtail_squid - 1,834
https://ja.wikipedia.org/wiki/緩歩動物 - 1,821
https://ja.wikipedia.org/wiki/トグロコウイカ - 1,819
https://ja.wikipedia.org/wiki/アメリカオオアカイカ - 1,814
https://ja.wikipedia.org/wiki/ケンサキイカ - 1,812
https://ja.wikipedia.org/wiki/ハモ - 1,547
https://en.wikipedia.org/wiki/Coronavirus - 1,421
https://ja.wikipedia.org/wiki/ダンゴムシ - 1,402
https://ja.wikipedia.org/wiki/ブタ - 1,333

File:Taxon uri count all log.png

File:Taxon uri count all log except4largest.png

The Wikipedia links are from 80 different languages. The tables below show some of the top languages used in terms of unique links queried and shows the top 5 links for each of these languages.

Top language Wikipedias as URIs in taxon subgraph queries
Language: en, # unique links:13533
URI # Query
en.wikipedia.org/wiki/Tokay_gecko 2,602
en.wikipedia.org/wiki/Bobtail_squid 1,834
en.wikipedia.org/wiki/Coronavirus 1,421
en.wikipedia.org/wiki/Fusarium 1,204
en.wikipedia.org/wiki/Tardigrade 1,202
Language: ja, # unique links:6611
URI # Query
ja.wikipedia.org/wiki/キーウィ_(鳥) 4,827
ja.wikipedia.org/wiki/ガラパゴスリクイグアナ 2,309
ja.wikipedia.org/wiki/セミ 2,186
ja.wikipedia.org/wiki/ドードー 1,932
ja.wikipedia.org/wiki/ダイオウホウズキイカ 1,864
Language: es, # unique links:2908
URI # Query
es.wikipedia.org/wiki/Eudocimus_ruber 977
es.wikipedia.org/wiki/Chelonioidea 610
es.wikipedia.org/wiki/Pelecanus 458
es.wikipedia.org/wiki/Fregata_magnificens 262
es.wikipedia.org/wiki/Ebolavirus 255
Language: de, # unique links:1950
URI # Query
de.wikipedia.org/wiki/Coronaviridae 123
de.wikipedia.org/wiki/Taubenschwänzchen 95
de.wikipedia.org/wiki/Mariendistel 79
de.wikipedia.org/wiki/Wespenspinne 75
de.wikipedia.org/wiki/Schnaken 72
Language: fr, # unique links:1342
URI # Query
fr.wikipedia.org/wiki/Homo_sapiens 201
fr.wikipedia.org/wiki/Coronavirus 107
fr.wikipedia.org/wiki/Tardigrada 80
fr.wikipedia.org/wiki/Scutigère_véloce 76
fr.wikipedia.org/wiki/Muguet_de_mai 55
Language: nl, # unique links:1056
URI # Query
nl.wikipedia.org/wiki/Europese_hoornaar 105
nl.wikipedia.org/wiki/Vuurwants 66
nl.wikipedia.org/wiki/Stinkende_kortschildkever 61
nl.wikipedia.org/wiki/Stadsreus_(zweefvlieg) 48
nl.wikipedia.org/wiki/Coronavirussen 47
Language: ru, # unique links:800
URI # Query
ru.wikipedia.org/wiki/Мимивирус 53
ru.wikipedia.org/wiki/Коронавирусы 50
ru.wikipedia.org/wiki/Обыкновенная_мухоловка 49
ru.wikipedia.org/wiki/Тихоходки 40
ru.wikipedia.org/wiki/Малая_панда 40
Language: pt, # unique links:520
URI # Query
pt.wikipedia.org/wiki/Candiru 39
pt.wikipedia.org/wiki/Tangerina 36
pt.wikipedia.org/wiki/Tardigrada 34
pt.wikipedia.org/wiki/Grelo 33
pt.wikipedia.org/wiki/Panda-vermelho 31

Query time

  • The total query time of taxon subgraph is ~3% of total query time and total query count is 14.26% of all queries.
  • Average time per query is 0.064 seconds (64 ms). Almost all queries in this subgraph are small and simple.
  • The query time distribution is shown in the chart below: in absolute counts, in percent of queries in taxon subgraph, and in percent of all queries.

File:Taxon time class.png

User agent

List of top user agents that query the taxon subgraph is given below. This helps us view the distribution of usage - whether few user agents dominate the usage or it is a rather well distributed usage scenario across user agents. Top 10 user agents in terms of query count and query time is shown in the table below. The query type column is discussed later in the section.

Top user agents in taxon subgraph
User agent Query count % query in taxon subgraph % query overall Query time(hr) % query time in taxon subgraph % query time overall # query type
mix-n-match 22,959,293 84.493 12.048 163 33.949 1.063 5
Hub 1,318,251 4.851 0.692 17 3.455 0.108 1
WikidataQueryServiceR 568,799 2.093 0.298 9 1.837 0.058 34
UA#4 325,563 1.198 0.171 10 2.044 0.064 5
UA#5 265,565 0.977 0.139 2 0.495 0.015 5
UA#6 199,650 0.735 0.105 53 11.133 0.349 24
UA#7 168,536 0.62 0.088 2 0.441 0.014 2
sparqlwrapper 161,781 0.595 0.085 126 26.257 0.822 33
UA#9 107,736 0.396 0.057 3 0.627 0.02 1
EasyContent 103,292 0.38 0.054 2 0.346 0.011 1
UA#11 1,330 0.005 0.001 12 2.481 0.078 8
UA#12 45,065 0.166 0.024 8 1.644 0.051 9
AhrefsBot 56,580 0.208 0.03 8 1.575 0.049 12
Apache-Jena-ARQ 6,654 0.024 0.003 6 1.27 0.04 11

The query time breakdown was plotted for the top 20 user agents (in terms of time).

File:Taxon ua query class percent.png File:Taxon ua query class percent log.png

Query types

Query types are grouped by the operations a query uses and also the order of operations used. This groups similar queries together despite different information sought and also separates groups of simple or complicated queries. The taxon subgraph has ~1100 different types of queries (The variety is quite less compared to 11K query type in human subgraph). Notice that some query groups can be very similar in what they ask for as well, although most groups differ a lot. The top query groups are listed below. Only the top 3 types of queries account for ~90% of the queries of taxon subgraph. Top 40 form 99% of the queries in this subgraph. The rest form a long tail of small query counts.

Top query types in taxon subgraph
Query group (operation list) #of queries %of queries in human subgraph
['path', 'table', 'bgp', 'join', 'bgp', 'union', 'join', 'project'] 13,013,162 47.89
['path', 'table', 'bgp', 'join', 'bgp', 'union', 'bgp', 'union', 'join', 'project'] 9,943,644 36.594
['bgp', 'project', 'distinct', 'slice'] 1,318,254 4.851
['table', 'bgp', 'leftjoin', 'bgp', 'join', 'filter', 'project'] 236,468 0.87
['bgp', 'project'] 230,329 0.848
['path', 'bgp', 'path', 'sequence', 'table', 'join', 'filter', 'order', 'project', 'distinct'] 199,441 0.734
['bgp', 'bgp', 'service', 'join', 'project', 'slice'] 169,045 0.622
['table', 'extend', 'extend', 'bgp', 'join', 'project'] 152,885 0.563
['table', 'path', 'bgp', 'sequence', 'path', 'bgp', 'sequence', 'path', 'bgp', 'sequence', 'leftjoin', 'leftjoin', 'leftjoin', 'bgp', 'leftjoin', 'bgp', 'leftjoin', 'bgp', 'leftjoin', 'bgp', 'leftjoin', 'bgp', 'leftjoin', 'bgp', 'service', 'join', 'project', 'distinct'] 152,353 0.561
['bgp', 'extend', 'bgp', 'extend', 'union', 'bgp', 'extend', 'union', 'bgp', 'extend', 'union', 'path', 'extend', 'union', 'path', 'extend', 'union', 'bgp', 'extend', 'union', 'bgp', 'extend', 'union', 'bgp', 'extend', 'union', 'bgp', 'extend', 'union', 'bgp', 'extend', 'union', 'bgp', 'extend', 'union', 'bgp', 'extend', 'union', 'bgp', 'extend', 'union', 'bgp', 'extend', 'union', 'path', 'extend', 'union', 'bgp', 'extend', 'union', 'path', 'extend', 'union', 'bgp', 'extend', 'union', 'path', 'extend', 'union', 'project', 'distinct'] 139,177 0.512

File:Taxon qtype 40.png

File:Taxon qtype all.png

Looking at the top 5 query groups (that make up >90% of all taxon subgraph queries), the following query types were found:

  1. Search with taxon name, synonyms, altlabels etc
  2. Search with taxon ID. E.g. SELECT DISTINCT ?subject WHERE { ?subject wdt:P3151 '47126' .}
  3. Get labels of certain items
  4. Get labels of certain items in specific languages
  5. Get external IDs of items

UA vs query types

Getting the number of query types per user agent informs us of the variety of queries a user agent makes to WDQS. This also breaks down the taxon subgraph queries into finer groups. The following plot shows the number of query types for each user agent in the taxon subgraph.

File:Taxon qtype vs ua.png

This shows us that most user agents make only 1 type of query. Only 3 user agents make queries of >100 types, and 5 user agents make queries of 50-100 types. The figure below shows the number of query per query type for the top 8 user agents.

** The number of query types for the top user agents in terms of query count and time is listed in the Taxon User agent section.

File:Taxon query vs query type 8ua.png

Query type vs time class

While there are close to 1,100 query types and only the top 3 types of queries account for ~90% of the queries of taxon subgraph (12.8% of all queries), not all of them are equally time consuming. Some can be simple queries, while some can be long and complex. The following plot shows the top 10 query types with query time classes. The values above the bar show both percent in taxon subgraph and overall query percentage. The subplots are titled with percent of the number of queries in that query type.

In sum, most queries in taxon subgraph are small and simple, take between 10 to 100ms time to run, and are mostly by 1/2 user agents.

File:Taxon top 10 qtype qtime.png

Services

The queries use 12 unique services. The top 4 services are the most used, although the usage is still pretty low; rest are used in less than 30 queries.

Services used in queries in taxon subgraph
Service Query count % query in taxon subgraph
wikibase:label 853,272 3.14
wikibase:mwapi 8,198 0.03
gas:service 1,206 0.004
bd:sample 512 0.002
https://query.wikidata.org/bigdata/namespace/wdq/sparql 28 0.0
idsm:wikidata 28 0.0
https://query.wikidata.org/sparql 14 0.0
http://sparql.wikipathways.org/sparql 13 0.0
wikibase:around 3 0.0
https://sparql.wikipathways.org/sparql 2 0.0
https://sophox.org/sparql 2 0.0
https://spang.dbcls.jp/sparql 1 0.0

Triples

Some query type analysis done in section query types gives us a good idea of what kind of queries taxon subgraph receives. Looking at the triples themselves also helps us peek into what most of the queries look like, what the most common subjects, objects, and properties are. The table below lists these along with the top wikidata items and properties used overall.

Top triples in taxon subgraph queries
Subject Predicate Object # query % query of taxon subgraph
q wdt:P31/(wdt:P279)* wd:Q16521 22,956,806 84.484*
bd:serviceParam wikibase:language en 379,833 1.398
item wdt:P225 taxonName 265,426 0.977
item p:P105 taxonRank1 265,360 0.977
taxonRank1 ps:P105 taxonRank 265,360 0.977
item rdfs:label label 241,505 0.889
item skos:altLabel altLabel 236,495 0.870
bd:serviceParam wikibase:language [AUTO_LANGUAGE],en 205,404 0.756
items rdfs:label itemlabel 199,441 0.734
items (wdt:P279)? types 195,916 0.721

*coincides with the number of queries from mix-n-match. Almost all mix-in-match queries in taxon subgraph have this triple. And all of these triples only occur in mix-n-match queries.

Top Subjects
Subject Query count % Query in taxon subgraph
q 23,071,376 84.906
subject 1,335,606 4.915
item 965,952 3.555
bd:serviceParam 853,712 3.142
taxonRank1 265,360 0.977
items 199,444 0.734
resource 153,967 0.567
class 153,628 0.565
order 152,354 0.561
family 152,354 0.561
P27 112,441 0.414
wikidata_id 111,123 0.409
https://ja.wikipedia.org/wiki/ 106,127 0.391
wdpage 71,631 0.264
id 39,642 0.146
X 29,097 0.107
Z 28,987 0.107
supertype 23,178 0.085
property 21,799 0.08
location 17,093 0.063
Top Predicates
Predicate Query count % Query in taxon subgraph
skos:altLabel 23,214,129 85.431
wdt:P31/(wdt:P279)* 23,033,714 84.767
prop 22,957,095 84.485
wdt:P3151 1,342,739 4.941
rdfs:label 881,092 3.243
wikibase:language 853,273 3.14
wdt:P225 751,916 2.767
wdt:P31 643,819 2.369
schema:about 611,190 2.249
wdt:P1843 430,635 1.585
wdt:P18 322,440 1.187
p:P105 280,663 1.033
ps:P105 280,547 1.032
wdt:P105 207,658 0.764
(wdt:P279)? 195,916 0.721
((wdt:P31)*/(wdt:P279)*)/(wd:P361)* 195,915 0.721
(wdt:P171)+ 168,128 0.619
wdt:P141 165,257 0.608
wdt:P183 152,479 0.561
wdt:P1889 152,376 0.561
Top Objects
Object Query count % Query in taxon subgraph
wd:Q16521 23,000,295 84.644
taxonName 422,507 1.555
en 387,760 1.427
pic 290,568 1.069
label 277,361 1.021
taxonRank 270,223 0.994
taxonRank1 265,360 0.977
altLabel 237,031 0.872
class 218,616 0.805
[AUTO_LANGUAGE],en 205,404 0.756
itemlabel 199,450 0.734
items 199,441 0.734
types 195,920 0.721
name 181,578 0.668
wd:Q35409 162,940 0.6
wd:Q36602 155,742 0.573
wd:Q37517 153,105 0.563
endemicTo 152,386 0.561
conservationStatus 152,357 0.561
order 152,354 0.561
Top Wikidata items
Item Query count % Query in human subgraph
P31 23,985,585 39.406
P225 23,735,291 38.994
P279 23,529,502 38.656
Q16521 23,230,747 38.165
P1420 22,965,572 37.73
P3151 1,346,905 2.213
P105 490,972 0.807
P1843 433,159 0.712
P18 325,372 0.535
P171 311,272 0.511
P625 273,859 0.45
P361 212,381 0.349
Q379813 200,195 0.329
Q152 199,784 0.328
Q2095 199,645 0.328
Q25403900 199,589 0.328
Q11004 199,561 0.328
Q12117 199,544 0.328
Q10990 199,527 0.328
Q3314483 199,503 0.328

Paths

Paths are more complex predicates that chain properties with logic. Complex paths can increase the scope of a query and also increase its runtime. The table below lists the most used paths in taxon subgraph queries. While most path are not very complex or long, there are a lot of variety in ways paths are formed to perform queries. Ordinary properties are not considered as paths. The following list contains not only the paths, but also their breakdown into components paths (as done by Jena ARQ while parsing SPARQL queries). For instance: (p:P31/ps:P31)/(wdt:P279)* is recorded as:

  • (p:P31/ps:P31)/(wdt:P279)*
    • (p:P31/ps:P31)
      • p:P31
      • ps:P31
    • (wdt:P279)*
      • wdt:P279

The unit form, wdt:P279 for example, was removed from the path list since they are part of other paths and not paths themselves. More paths that seemed obvious as being part of a longer path, and not paths themselves, were also removed from the list for better visualization of the distinct paths used in the queries.

Top Paths
Path Query count % Query in taxon subgraph
wdt:P31/(wdt:P279)* 23,035,627 84.774
((wdt:P31)*/(wdt:P279)*)/(wd:P361)* 195,916 0.721
(wdt:P171)+ 168,128 0.619
wdt:P1416|wdt:P108 139,191 0.512
(wdt:P159)?/wdt:P625 139,191 0.512
wdt:P50|wdt:P2093 139,191 0.512
^wdt:P31/wdt:P235 139,191 0.512
wdt:P31/(wdt:P279)? 139,184 0.512
(((((((((((((((((((((((((wdt:P171|(wdt:P171/wdt:P171))|((wdt:P171/wdt:P171)/wdt:P171))|(((wdt:P171/wdt:P171)/wdt:P171)/wdt:P171))|(wdt:P171/wdt:P171*4))|(wdt:P171/wdt:P171*5))|(wdt:P171/wdt:P171*6))|(wdt:P171/wdt:P171*7))|(wdt:P171/wdt:P171*8))|(wdt:P171/wdt:P171*9))|(wdt:P171/wdt:P171*10))|(wdt:P171/wdt:P171*11))|(wdt:P171/wdt:P171*12))|(wdt:P171/wdt:P171*13))|(wdt:P171/wdt:P171*14))|(wdt:P171/wdt:P171*15))|(wdt:P171/wdt:P171*16))|(wdt:P171/wdt:P171*17))|(wdt:P171/wdt:P171*18))|(wdt:P171/wdt:P171*19))|(wdt:P171/wdt:P171*20))|(wdt:P171/wdt:P171*21))|(wdt:P171/wdt:P171*22))|(wdt:P171/wdt:P171*23))|(wdt:P171/wdt:P171*24))|(wdt:P171/wdt:P171*25)) 112,475 0.414
wdt:P31|wdt:P279 103,573 0.381
(wdt:P1647)* 11,178 0.041
(((((((((((((((wdt:P17|wdt:P101)|wdt:P112)|wdt:P135)|wdt:P136)|wdt:P279)|wdt:P361)|wdt:P460)|wdt:P793)|wdt:P800)|wdt:P1269)|wdt:P1344)|wdt:P1830)|(p:P2572/ps:P2572))|wdt:P3342)|wdt:P3602)|wdt:P5004 11,002 0.04