You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
User:AKhatun/Wikidata Subgraph Query Analysis: Difference between revisions
imported>AKhatun (→Query count and time: Add time to recover in table) |
imported>AKhatun m (Typo and minor edits) |
||
Line 3: | Line 3: | ||
* How many queries access multiple subgraphs at once? i.e, how much overlap can we expect in subgraphs? | * How many queries access multiple subgraphs at once? i.e, how much overlap can we expect in subgraphs? | ||
* How long do these queries take? | * How long do these queries take? | ||
* How many user-agents access each subgraph? How many of them access lots of subgraphs, or are they confined to a small set of subgraphs? Do some of them dominate queries in | * How many user-agents access each subgraph? How many of them access lots of subgraphs, or are they confined to a small set of subgraphs? Do some of them dominate queries in multiple subgraphs? | ||
* Are there chunks of similar queries in these subgraphs? i.e, how diverse the queries in each subgraph are. | * Are there chunks of similar queries in these subgraphs? i.e, how diverse the queries in each subgraph are. | ||
== TL;DR == | == TL;DR == | ||
== What are subgraph related queries == | == What are subgraph related queries == | ||
We define some parameters to identify whether a query touches on a subgraph based on the items and properties a query uses. Some queries may even touch on | We define some parameters to identify whether a query touches on a subgraph based on the items and properties a query uses. Some queries may even touch on multiple subgraphs. See more on what a subgraph means [[User:AKhatun/Wikidata_Subgraph_Analysis|here]]. Note: Subgraphs have overlaps. | ||
The parameters that define which subgraph a query belongs to are: | The parameters that define which subgraph a query belongs to are: | ||
Line 15: | Line 15: | ||
# If the query uses items that occur 99% of the times in a particular subgraph. | # If the query uses items that occur 99% of the times in a particular subgraph. | ||
# If the query uses properties that occur 99% of the times in a particular subgraph. | # If the query uses properties that occur 99% of the times in a particular subgraph. | ||
# If the query uses literals that occur 99% of the times in a particular subgraph. The literals can occur with or without language tags. Both versions are compared to check for match. Note that whole literals are matched in queries and Wikidata. Queries that ask for partial matches, using regex for example, are not included. The assumption is that such queries are more likely to contain other items from | # If the query uses literals that occur 99% of the times in a particular subgraph. The literals can occur with or without language tags. Both versions are compared to check for match. Note that whole literals are matched in queries and Wikidata. Queries that ask for partial matches, using regex for example, are not included. The assumption is that such queries are more likely to contain other items from the subgraph and are caught anyways. | ||
The following analysis uses Wikidata dump of <code>20211101</code> and WDQS public SPARQL queries of 10/2021. '''All query related | The following analysis uses Wikidata dump of <code>20211101</code> and WDQS public SPARQL queries of 10/2021. '''All query related numbers below are monthly values'''. | ||
== Query count and time == | == Query count and time == | ||
Line 145: | Line 145: | ||
== User agent == | == User agent == | ||
Analysis on user-agent is an approximation because these don't completely represent | Analysis on user-agent is an approximation because these don't completely represent distinct users. For example lots people use the same bot or script without changing the user-agent, or the same person or bot uses multiple user-agent strings. Yet based on the available data we can get an estimate nevertheless. | ||
=== User agent count === | === User agent count === | ||
* Total number of unique user agents across all subgraphs: 981,180 | * Total number of unique user agents across all subgraphs: 981,180 | ||
* First, a list of subgraphs with most and least distinct user-agents is listed. It seems the least number of user-agents a subgraph has is | * First, a list of subgraphs with most and least distinct user-agents is listed. It seems the least number of user-agents a subgraph has is at least 10. So the large subgraphs are used by multiple users. | ||
* The largest numbers of user-agents are present in a variety of type of subgraphs, | * The largest numbers of user-agents are present in a variety of type of subgraphs. <code>gene, protein, biological process, molecular function</code> appear to be similar among them. It is possible the same queries represent several of these subgraphs. More on subgraph connectivity in [[#Subgraph Connectivity]]. | ||
{| | {| | ||
Line 252: | Line 252: | ||
=== User agent distribution in subgraphs === | === User agent distribution in subgraphs === | ||
* Next, the user agent vs query count distribution was analyzed for some of the top subgraphs. While user agent count gives us an idea of how many users may be using a subgraph, it is not clear whether all of them query the subgraph equally, or very few user agents perform most of the queries. | * Next, the user agent vs query count distribution was analyzed for some of the top subgraphs. While user agent count gives us an idea of how many users may be using a subgraph, it is not clear whether all of them query the subgraph equally, or very few user agents perform most of the queries. | ||
* ~30 out of 341 subgraphs have a user agent that queries >=50% of all queries of that particular subgraphs. | * '''~30''' out of 341 subgraphs have a user agent that queries >=50% of all queries of that particular subgraphs. | ||
* 6 subgraphs have a user agent querying around 80-90% of the time. | * '''6''' subgraphs have a user agent querying around 80-90% of the time. | ||
* So the trend of dominating single source queries is not wide spread among subgraphs, but is present in few | * So the trend of dominating single source queries is not wide spread among subgraphs, but is present in few subgraphs nonetheless. | ||
The figure below shows the top 2 user-agent query percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph. | The figure below shows the top 2 user-agent query in percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph. | ||
[[File:top2UA_per-subgraph.png|1100px|This figure shows the top 2 user-agent query percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph.]] | [[File:top2UA_per-subgraph.png|1100px|This figure shows the top 2 user-agent query percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph.]] | ||
The figure below shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent | The figure below shows 100 subgraphs with their user agent query usage distribution in percents. '''Usage greater than 50% is marked in red'''. A birds-eye view of the plots shows how some subgraphs have a dominating user agent and most other subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail of the distribution | ||
[[File:subgraph_ua_hist.png|1100px|This figure shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail 10% of the distribution.]] | [[File:subgraph_ua_hist.png|1100px|This figure shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail 10% of the distribution.]] | ||
Line 305: | Line 305: | ||
So far we have explored the user-agent count and distribution per subgraph. It is also important to note the user agent's query across subgraphs. In other words, | So far we have explored the user-agent count and distribution per subgraph. It is also important to note the user agent's query across subgraphs. In other words, | ||
* Do users have a very specific use case and so the queries spans only a few subgraphs? or is it spread across a lot of subgraphs? | * Do users have a very specific use case and so the queries spans only a few subgraphs? or is it spread across a lot of subgraphs? | ||
* Are there some user agents that query the most in | * Are there some user agents that query the most in multiple subgraphs? This could be due to the nature of the use case or simply because some subgraphs overlap a lot. | ||
We start by looking at how many user agents | We start by looking at how many user agents access how many subgraphs. From the table below, we see that most user agents (89% of them) query one subgraphs only. Some user agents query a lot of subgraphs as well. A clearer picture is seem from the plot below. | ||
{| | {| | ||
Line 448: | Line 448: | ||
== Subgraph connectivity through queries == | == Subgraph connectivity through queries == | ||
Subgraph connectivity was explored to some extent using only Wikidata in [[User:AKhatun/Wikidata_Subgraph_Analysis#Subgraph_Connectivity|Wikidata_Subgraph_Analysis]]. This was based on what items or properties were common between subgraphs and how many direct connections were present between them. A visualization was created to show the strength of this connectivity between subgraphs here: [https://tanny411.github.io/Wikidata-WDQS-Analysis/wikidata_graph.html wikidata_graph]. This section aims to analyze the connectivity of subgraphs through the queries, i.e, how often are some subgraphs queried together. | Subgraph connectivity was explored to some extent using only Wikidata in [[User:AKhatun/Wikidata_Subgraph_Analysis#Subgraph_Connectivity|Wikidata_Subgraph_Analysis]]. This was based on what items or properties were common between subgraphs and how many direct connections were present between them. A visualization was created to show the strength of this connectivity between subgraphs here: [https://tanny411.github.io/Wikidata-WDQS-Analysis/wikidata_graph.html wikidata_graph]. This section aims to analyze the connectivity of subgraphs through the queries, i.e, '''how often are some subgraphs queried together.''' | ||
* Subgaph Queries: The total number of queries that touch on at least one of the top 341 subgraps is '''72%''' of all queries. | * Subgaph Queries: The total number of queries that touch on at least one of the top 341 subgraps is '''72%''' of all queries. | ||
* First we look at how many subgraphs do most queries access. The tables below show the least and most query groups by number of subgraphs accessed. | * First we look at how many subgraphs do most queries access. The tables below show the least and most query groups by number of subgraphs accessed. | ||
* 70% of all queries (97% of subgraph queries) touch on 1 or 2 subgraph. 64% of all queries (90% of subgraph queries) touch on only 1 subgraph. | * '''70%''' of all queries (97% of subgraph queries) touch on 1 or 2 subgraph. '''64%''' of all queries (90% of subgraph queries) touch on only 1 subgraph. | ||
{| style="margin-left: auto; margin-right: auto; border: none;" | {| style="margin-left: auto; margin-right: auto; border: none;" | ||
Line 528: | Line 528: | ||
[[File:numQuery_vs_numSubgraph.png]] | [[File:numQuery_vs_numSubgraph.png]] | ||
* It is hard to view which subgraphs occur together from the data above. So the subgraphs that occured together were broken into pairs and pars of subgraphs that occur together the most were listed. | * It is hard to view ''which subgraphs'' occur together from the data above. So the subgraphs that occured together were broken into pairs and pars of subgraphs that occur together the most were listed. | ||
* There are '''57,970''' subgraphs pairs that occur togther in queries. Total possible subgrah pair count is '''(340*341)/2 = 57,970'''. This shows that every subgraph is connected to every other subgraph through queries! Ofcourse the number of queries vary widely. | * There are '''57,970''' subgraphs pairs that occur togther in queries. Total possible subgrah pair count is '''(340*341)/2 = 57,970'''. This shows that every subgraph is connected to every other subgraph through queries! Ofcourse the number of queries vary widely. | ||
* A list of some of the most queried subgraphs is shown below. | * A list of some of the most queried subgraphs is shown below. | ||
Line 606: | Line 606: | ||
* Below is a heatmap of the number of queries, where both x and y axis represent subgraph indices (names of subgrahps not shown due to space) | * Below is a heatmap of the number of queries, where both x and y axis represent subgraph indices (names of subgrahps not shown due to space) | ||
* The diagonals show queries that use only 1 subgraph and are represented as Q5-Q5, or Q42-Q42 for example. Other are represented as Q5-Q42 or Q42-Q5 | * The diagonals show queries that use only 1 subgraph and are represented as Q5-Q5, or Q42-Q42 for example. Other are represented as Q5-Q42 or Q42-Q5 | ||
* The tons of vertical and horizontal lines indicate there are lots of subgraphs that happen to pair with many other subgraphs | *It is a '''Symmetrical''' plot. | ||
* The tons of vertical and horizontal lines indicate there are lots of subgraphs that happen to pair with many other subgraphs. | |||
[[File:subgraph_pair_heatmap.png]] | [[File:subgraph_pair_heatmap.png]] | ||
== Triples analysis == | == Triples analysis == |
Revision as of 06:42, 9 December 2021
Analysis on Subgraphs in Wikidata showed how large each of the subgraphs are in Wikidata and how connected they are. This page shows the results from analysis on the queries that relate to these subgraph. The questions that needed to be answered were:
- How many(percent) queries access each subgraph?
- How many queries access multiple subgraphs at once? i.e, how much overlap can we expect in subgraphs?
- How long do these queries take?
- How many user-agents access each subgraph? How many of them access lots of subgraphs, or are they confined to a small set of subgraphs? Do some of them dominate queries in multiple subgraphs?
- Are there chunks of similar queries in these subgraphs? i.e, how diverse the queries in each subgraph are.
TL;DR
We define some parameters to identify whether a query touches on a subgraph based on the items and properties a query uses. Some queries may even touch on multiple subgraphs. See more on what a subgraph means here. Note: Subgraphs have overlaps.
The parameters that define which subgraph a query belongs to are:
- If the query uses the subgraph's Qid. Example: Q5 containing queries are part of Q5 subgraph.
- If the query uses items that are
instance of
a particular subgraph. - If the query uses items that occur 99% of the times in a particular subgraph.
- If the query uses properties that occur 99% of the times in a particular subgraph.
- If the query uses literals that occur 99% of the times in a particular subgraph. The literals can occur with or without language tags. Both versions are compared to check for match. Note that whole literals are matched in queries and Wikidata. Queries that ask for partial matches, using regex for example, are not included. The assumption is that such queries are more likely to contain other items from the subgraph and are caught anyways.
The following analysis uses Wikidata dump of 20211101
and WDQS public SPARQL queries of 10/2021. All query related numbers below are monthly values.
Query count and time
- All queries here refer to queries with status code 200 and 500, i.e correct queries, successful or time-out.
- WDQS receives ~220M queries a month.
- Total query time for all queries for a month is ~16,000 hours.
The table below lists the top 50 most queried subgraphs with subgraph size and query time information. A breakdown of what caused the match is also present, which corresponds to the parameters mentioned in #What are subgraph related queries. It also ranks the subgraphs by size, query count, and query time consumed.
A more complete list containing 341 subgraphs, that form ~90% of Wikidata triples, is available here: [csvlink|all_subgraph_data.csv]
Subgraph rank by size | Subgraph rank by query count | Subgraph rank by query time | Subgraph | Subgraph label | %of triples | %of entities | Days to recover (4.77M rate) | Query count | %count of all queries | Query time (hr) | %time of all queries | %count of query from Qid | %count of query from instance items | %count of query from items | %count of query from properties | %count of query from literals |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 1 | 1 | Q5 | human | 7.324 | 9.986 | 203 | 68,659,369 | 31.058 | 6314 | 0.393 | 1.827 | 17.705 | 10.324 | 20.176 | 1.11 |
5 | 2 | 4 | Q16521 | taxon | 2.871 | 3.427 | 79 | 56,437,140 | 25.529 | 495 | 0.031 | 22.986 | 1.251 | 23.665 | 0.965 | 0.496 |
6 | 3 | 3 | Q101352 | family name | 1.546 | 0.509 | 43 | 5,564,173 | 2.517 | 640 | 0.04 | 0.064 | 2.425 | 2.34 | 0.016 | 0.032 |
15 | 4 | 2 | Q11424 | film | 0.364 | 0.281 | 10 | 4,757,084 | 2.152 | 1613 | 0.1 | 0.563 | 1.308 | 1.089 | 0.008 | 0.407 |
34 | 5 | 7 | Q4830453 | business | 0.108 | 0.207 | 3 | 4,041,395 | 1.828 | 343 | 0.021 | 0.953 | 0.788 | 0.416 | 0.0 | 0.101 |
7 | 6 | 9 | Q4167410 | Wikimedia disambiguation page | 1.374 | 1.459 | 38 | 3,737,550 | 1.691 | 223 | 0.014 | 0.195 | 0.484 | 0.554 | 0.0 | 0.938 |
177 | 7 | 20 | Q34770 | language | 0.013 | 0.011 | 0 | 1,713,196 | 0.775 | 73 | 0.005 | 0.008 | 0.757 | 0.009 | 0.0 | 0.005 |
1 | 8 | 13 | Q13442814 | scholarly article | 49.668 | 39.794 | 1375 | 1,649,268 | 0.746 | 142 | 0.009 | 0.005 | 0.261 | 0.278 | 0.124 | 0.386 |
4 | 9 | 17 | Q4167836 | Wikimedia category | 5.85 | 5.165 | 162 | 1,383,343 | 0.626 | 96 | 0.006 | 0.019 | 0.594 | 0.152 | 0.0 | 0.01 |
10 | 10 | 14 | Q11173 | chemical compound | 0.693 | 1.302 | 19 | 1,307,852 | 0.592 | 133 | 0.008 | 0.022 | 0.548 | 0.449 | 0.001 | 0.014 |
20 | 11 | 22 | Q13406463 | Wikimedia list article | 0.252 | 0.352 | 7 | 1,283,160 | 0.58 | 73 | 0.005 | 0.018 | 0.409 | 0.357 | 0.0 | 0.048 |
63 | 12 | 6 | Q5398426 | television series | 0.055 | 0.062 | 2 | 1,206,285 | 0.546 | 366 | 0.023 | 0.05 | 0.332 | 0.252 | 0.0 | 0.128 |
243 | 13 | 24 | Q14204246 | Wikimedia project page | 0.008 | 0.033 | 0 | 1,114,113 | 0.504 | 62 | 0.004 | 0.009 | 0.227 | 0.016 | 0.0 | 0.275 |
92 | 14 | 11 | Q6881511 | enterprise | 0.036 | 0.052 | 1 | 943,613 | 0.427 | 164 | 0.01 | 0.034 | 0.338 | 0.144 | 0.0 | 0.042 |
26 | 15 | 29 | Q484170 | commune of France | 0.18 | 0.043 | 5 | 866,766 | 0.392 | 46 | 0.003 | 0.006 | 0.278 | 0.004 | 0.098 | 0.007 |
165 | 16 | 12 | Q891723 | public company | 0.015 | 0.013 | 0 | 837,595 | 0.379 | 157 | 0.01 | 0.034 | 0.277 | 0.061 | 0.0 | 0.054 |
12 | 17 | 19 | Q3305213 | painting | 0.432 | 0.578 | 12 | 834,752 | 0.378 | 79 | 0.005 | 0.012 | 0.332 | 0.187 | 0.005 | 0.012 |
91 | 18 | 16 | Q43229 | organization | 0.037 | 0.08 | 1 | 806,840 | 0.365 | 123 | 0.008 | 0.128 | 0.213 | 0.097 | 0.0 | 0.012 |
89 | 19 | 8 | Q4164871 | position | 0.037 | 0.128 | 1 | 788,077 | 0.356 | 332 | 0.021 | 0.004 | 0.343 | 0.016 | 0.0 | 0.003 |
28 | 20 | 30 | Q482994 | album | 0.161 | 0.287 | 4 | 776,845 | 0.351 | 37 | 0.002 | 0.012 | 0.287 | 0.209 | 0.0 | 0.016 |
86 | 21 | 23 | Q47461344 | written work | 0.038 | 0.078 | 1 | 774,947 | 0.351 | 67 | 0.004 | 0.244 | 0.085 | 0.039 | 0.0 | 0.003 |
62 | 22 | 35 | Q7889 | video game | 0.056 | 0.047 | 2 | 741,401 | 0.335 | 30 | 0.002 | 0.006 | 0.195 | 0.256 | 0.005 | 0.007 |
16 | 23 | 21 | Q486972 | human settlement | 0.302 | 0.602 | 8 | 721,789 | 0.327 | 73 | 0.005 | 0.095 | 0.22 | 0.107 | 0.0 | 0.006 |
8 | 24 | 18 | Q7187 | gene | 0.927 | 1.273 | 26 | 628,916 | 0.284 | 94 | 0.006 | 0.107 | 0.063 | 0.007 | 0.021 | 0.113 |
25 | 25 | 46 | Q532 | village | 0.201 | 0.292 | 6 | 584,789 | 0.265 | 21 | 0.001 | 0.001 | 0.246 | 0.109 | 0.0 | 0.013 |
70 | 26 | 27 | Q732577 | publication | 0.048 | 0.076 | 1 | 512,416 | 0.232 | 53 | 0.003 | 0.229 | 0.003 | 0.23 | 0.0 | 0.0 |
42 | 27 | 45 | Q7725634 | literary work | 0.077 | 0.176 | 2 | 468,204 | 0.212 | 22 | 0.001 | 0.017 | 0.16 | 0.104 | 0.0 | 0.007 |
138 | 28 | 57 | Q18340514 | events in a specific year or time period | 0.019 | 0.048 | 1 | 463,683 | 0.21 | 17 | 0.001 | 0.0 | 0.2 | 0.056 | 0.0 | 0.005 |
54 | 29 | 60 | Q215380 | musical group | 0.063 | 0.087 | 2 | 461,181 | 0.209 | 17 | 0.001 | 0.009 | 0.164 | 0.073 | 0.0 | 0.008 |
2 | 30 | 28 | Q6999 | astronomical object | 8.75 | 8.942 | 242 | 448,032 | 0.203 | 51 | 0.003 | 0.0 | 0.175 | 0.085 | 0.015 | 0.003 |
41 | 31 | 56 | Q22808320 | Wikimedia human name disambiguation page | 0.078 | 0.075 | 2 | 433,986 | 0.196 | 17 | 0.001 | 0.0 | 0.174 | 0.154 | 0.0 | 0.001 |
53 | 32 | 63 | Q134556 | single | 0.065 | 0.103 | 2 | 431,003 | 0.195 | 16 | 0.001 | 0.001 | 0.167 | 0.138 | 0.0 | 0.004 |
37 | 33 | 32 | Q3331189 | version, edition, or translation | 0.087 | 0.19 | 2 | 410,352 | 0.186 | 34 | 0.002 | 0.103 | 0.053 | 0.118 | 0.004 | 0.028 |
31 | 34 | 41 | Q16970 | church building | 0.129 | 0.226 | 4 | 396,936 | 0.18 | 25 | 0.002 | 0.005 | 0.172 | 0.112 | 0.0 | 0.001 |
71 | 35 | 25 | Q86850539 | Whitaker's Latin frequency type C | 0.048 | 0.011 | 1 | 355,247 | 0.161 | 56 | 0.003 | 0.0 | 0.0 | 0.0 | 0.0 | 0.16 |
11 | 36 | 65 | Q8054 | protein | 0.67 | 1.05 | 19 | 349,573 | 0.158 | 16 | 0.001 | 0.079 | 0.034 | 0.002 | 0.02 | 0.066 |
49 | 37 | 167 | Q2225692 | fourth-level administrative division in Indonesia | 0.07 | 0.088 | 2 | 344,964 | 0.156 | 5 | 0.0 | 0.0 | 0.147 | 0.098 | 0.0 | 0.009 |
223 | 38 | 87 | Q571 | book | 0.009 | 0.022 | 0 | 340,900 | 0.154 | 12 | 0.001 | 0.114 | 0.016 | 0.01 | 0.0 | 0.023 |
112 | 39 | 76 | Q476028 | association football club | 0.026 | 0.038 | 1 | 320,422 | 0.145 | 14 | 0.001 | 0.006 | 0.12 | 0.029 | 0.0 | 0.003 |
21 | 40 | 10 | Q2668072 | collection | 0.248 | 0.534 | 7 | 312,822 | 0.142 | 166 | 0.01 | 0.056 | 0.084 | 0.058 | 0.0 | 0.001 |
113 | 41 | 54 | Q15632617 | fictional human | 0.026 | 0.056 | 1 | 306,319 | 0.139 | 18 | 0.001 | 0.006 | 0.1 | 0.05 | 0.0 | 0.003 |
121 | 42 | 42 | Q3957 | town | 0.023 | 0.015 | 1 | 294,685 | 0.133 | 24 | 0.001 | 0.047 | 0.079 | 0.014 | 0.0 | 0.002 |
133 | 43 | 58 | Q506240 | television film | 0.02 | 0.019 | 1 | 290,899 | 0.132 | 17 | 0.001 | 0.009 | 0.098 | 0.07 | 0.0 | 0.02 |
136 | 44 | 5 | Q15416 | television program | 0.019 | 0.05 | 1 | 286,609 | 0.13 | 386 | 0.024 | 0.024 | 0.084 | 0.072 | 0.0 | 0.01 |
72 | 45 | 79 | Q105543609 | musical work/composition | 0.048 | 0.099 | 1 | 285,889 | 0.129 | 13 | 0.001 | 0.004 | 0.095 | 0.061 | 0.004 | 0.009 |
64 | 46 | 38 | Q811979 | architectural structure | 0.055 | 0.119 | 2 | 282,739 | 0.128 | 28 | 0.002 | 0.09 | 0.035 | 0.024 | 0.0 | 0.001 |
23 | 47 | 51 | Q4022 | river | 0.219 | 0.425 | 6 | 280,190 | 0.127 | 20 | 0.001 | 0.002 | 0.12 | 0.045 | 0.0 | 0.002 |
32 | 48 | 31 | Q41176 | building | 0.125 | 0.287 | 3 | 271,666 | 0.123 | 36 | 0.002 | 0.034 | 0.084 | 0.065 | 0.002 | 0.001 |
45 | 49 | 50 | Q55488 | railway station | 0.075 | 0.104 | 2 | 258,862 | 0.117 | 20 | 0.001 | 0.001 | 0.109 | 0.072 | 0.0 | 0.001 |
192 | 50 | 143 | Q3464665 | television series season | 0.011 | 0.02 | 0 | 254,318 | 0.115 | 6 | 0.0 | 0.031 | 0.077 | 0.009 | 0.0 | 0.0 |
More on query time
The query time can be broken down to classes for better visualization. Below is a figure with the query class distribution (number of queries per query time class per subgraph) for the top 50 subgraphs. Some of the takeaways are:
- Most subgraphs have most queries in the range of 10-100ms
- Second most commons class is 100ms to 1s
collection
andphotograph
have most queries (~150k) timed at 1-10s. Around 10 more subgraphs have a little (~10-20k) query in this time range.
File:Top 50 query time class.png
User agent
Analysis on user-agent is an approximation because these don't completely represent distinct users. For example lots people use the same bot or script without changing the user-agent, or the same person or bot uses multiple user-agent strings. Yet based on the available data we can get an estimate nevertheless.
User agent count
- Total number of unique user agents across all subgraphs: 981,180
- First, a list of subgraphs with most and least distinct user-agents is listed. It seems the least number of user-agents a subgraph has is at least 10. So the large subgraphs are used by multiple users.
- The largest numbers of user-agents are present in a variety of type of subgraphs.
gene, protein, biological process, molecular function
appear to be similar among them. It is possible the same queries represent several of these subgraphs. More on subgraph connectivity in #Subgraph Connectivity.
|
|
- There are 50 subgraphs with more than 1000 user agents, and 300 subgraphs with less than 1000 user agents. Most subgraphs are therefore not queried overly-widely. The distribution of user-agent counts less than 1000 is shown in the figure below. This clearly shows the small number of user counts in most subgraphs.
User agent distribution in subgraphs
- Next, the user agent vs query count distribution was analyzed for some of the top subgraphs. While user agent count gives us an idea of how many users may be using a subgraph, it is not clear whether all of them query the subgraph equally, or very few user agents perform most of the queries.
- ~30 out of 341 subgraphs have a user agent that queries >=50% of all queries of that particular subgraphs.
- 6 subgraphs have a user agent querying around 80-90% of the time.
- So the trend of dominating single source queries is not wide spread among subgraphs, but is present in few subgraphs nonetheless.
The figure below shows the top 2 user-agent query in percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph. This figure shows the top 2 user-agent query percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph.
The figure below shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent and most other subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail of the distribution This figure shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail 10% of the distribution.
Top user agents in subgraphs
- The top user agents in various subgraphs is listed below. More analysis on Q5 (human) and Q16521 (taxon) is done at the end of the page as they are the most queried subgraphs.
Subgraph | Subgraph label | User agent | Query count (in subgraph) | Query percent (within subgraph) | Query percent overall |
---|---|---|---|---|---|
Q16521 | taxon | mix-n-match | 50622670 | 89.697 | 22.899 |
Q5 | human | UA # 2 | 9017930 | 13.134 | 4.079 |
Q5 | human | mix-n-match | 8548335 | 12.45 | 3.867 |
Q5 | human | UA # 3 | 5059258 | 7.369 | 2.289 |
Q5 | human | UA # 4 | 4020496 | 5.856 | 1.819 |
Q5 | human | UA # 5 | 3828747 | 5.576 | 1.732 |
Q101352 | family name | UA # 5 | 3828747 | 68.811 | 1.732 |
Q5 | human | UA # 6 | 2685807 | 3.912 | 1.215 |
Q5 | human | UA # 7 | 2434486 | 3.546 | 1.101 |
Q4830453 | business | UA # 8 | 2403677 | 59.476 | 1.087 |
Q5 | human | UA # 9 | 2020598 | 2.943 | 0.914 |
Q16521 | taxon | Hub | 1984437 | 3.516 | 0.898 |
Q5 | human | UA # 11 | 1877700 | 2.735 | 0.849 |
Q5 | human | UA # 12 | 1781161 | 2.594 | 0.806 |
Q16521 | taxon | UA # 13 | 1294113 | 2.293 | 0.585 |
User agent vs Subgraph
So far we have explored the user-agent count and distribution per subgraph. It is also important to note the user agent's query across subgraphs. In other words,
- Do users have a very specific use case and so the queries spans only a few subgraphs? or is it spread across a lot of subgraphs?
- Are there some user agents that query the most in multiple subgraphs? This could be due to the nature of the use case or simply because some subgraphs overlap a lot.
We start by looking at how many user agents access how many subgraphs. From the table below, we see that most user agents (89% of them) query one subgraphs only. Some user agents query a lot of subgraphs as well. A clearer picture is seem from the plot below.
|
File:Ua vs subgraph.png |
Next we isolate user agents from each subgraph who query drastically more (>=10% difference) than other user agents in the same subgraph, and perform at least 100k queries (0.05% of all queries) a month. A list of ~30 such user agents was found. A plot with subgraph distributions of all these user agents was observed to find some large buckets where they tend to query. The plot is shows below, followed by some explicit observations.
Percentages below are percent of all monthly queries.
|
For reference:
|
Subgraph connectivity through queries
Subgraph connectivity was explored to some extent using only Wikidata in Wikidata_Subgraph_Analysis. This was based on what items or properties were common between subgraphs and how many direct connections were present between them. A visualization was created to show the strength of this connectivity between subgraphs here: wikidata_graph. This section aims to analyze the connectivity of subgraphs through the queries, i.e, how often are some subgraphs queried together.
- Subgaph Queries: The total number of queries that touch on at least one of the top 341 subgraps is 72% of all queries.
- First we look at how many subgraphs do most queries access. The tables below show the least and most query groups by number of subgraphs accessed.
- 70% of all queries (97% of subgraph queries) touch on 1 or 2 subgraph. 64% of all queries (90% of subgraph queries) touch on only 1 subgraph.
|
|
File:NumQuery vs numSubgraph.png
- It is hard to view which subgraphs occur together from the data above. So the subgraphs that occured together were broken into pairs and pars of subgraphs that occur together the most were listed.
- There are 57,970 subgraphs pairs that occur togther in queries. Total possible subgrah pair count is (340*341)/2 = 57,970. This shows that every subgraph is connected to every other subgraph through queries! Ofcourse the number of queries vary widely.
- A list of some of the most queried subgraphs is shown below.
Subgraph 1 | Subgraph 2 | Query | |||
---|---|---|---|---|---|
Subgraph | Subgraph label | Subgraph | Subgraph label | #of Query | %of Query |
Q101352 | family name | Q5 | human | 4935675 | 2.233 |
Q4830453 | business | Q6881511 | enterprise | 883757 | 0.4 |
Q11424 | film | Q5 | human | 771698 | 0.349 |
Q4830453 | business | Q891723 | public company | 735902 | 0.333 |
Q3305213 | painting | Q4167410 | Wikimedia disambiguation page | 629633 | 0.285 |
Q4164871 | position | Q5 | human | 541257 | 0.245 |
Q47461344 | written work | Q732577 | publication | 493402 | 0.223 |
Q11424 | film | Q14204246 | Wikimedia project page | 483338 | 0.219 |
Q6881511 | enterprise | Q891723 | public company | 480426 | 0.217 |
Q4167410 | Wikimedia disambiguation page | Q5 | human | 466217 | 0.211 |
Q14204246 | Wikimedia project page | Q4167410 | Wikimedia disambiguation page | 436192 | 0.197 |
Q13406463 | Wikimedia list article | Q5 | human | 394815 | 0.179 |
Q4830453 | business | Q5 | human | 354945 | 0.161 |
Q13442814 | scholarly article | Q4167410 | Wikimedia disambiguation page | 316720 | 0.143 |
Q13442814 | scholarly article | Q5 | human | 282237 | 0.128 |
Q13406463 | Wikimedia list article | Q18340514 | events in a specific year or time period | 274841 | 0.124 |
Q3331189 | version, edition, or translation | Q5 | human | 273761 | 0.124 |
Q571 | book | Q5 | human | 259234 | 0.117 |
Q16521 | taxon | Q5 | human | 222118 | 0.1 |
Q4167410 | Wikimedia disambiguation page | Q811979 | architectural structure | 204572 | 0.093 |
Q4167410 | Wikimedia disambiguation page | Q838948 | work of art | 200810 | 0.091 |
Q5398426 | television series | Q5 | human | 197997 | 0.09 |
Q47461344 | written work | Q5 | human | 194750 | 0.088 |
Q43229 | organization | Q4830453 | business | 179640 | 0.081 |
Q5 | human | Q6881511 | enterprise | 172486 | 0.078 |
Q43229 | organization | Q5 | human | 171567 | 0.078 |
Q2225692 | fourth-level administrative division in Indonesia | Q532 | village | 171086 | 0.077 |
Q215380 | musical group | Q5 | human | 168318 | 0.076 |
Q15632617 | fictional human | Q5 | human | 163992 | 0.074 |
Q3305213 | painting | Q838948 | work of art | 161979 | 0.073 |
- The distribution of the number of times each subgraph pair in wikidata occurs in queries is shown below. Note that (A,B) pair is the same as (B,A) pair, so there is no duplication in the plots. Since the plot is extremely skewed, three plots with various limits on the number of occurrences are shown. We can see how only a small number of pairs occur a lot together, they can be viewed from the table above. Whereas a huge number of pairs occur a very small number of times.
- Below is a heatmap of the number of queries, where both x and y axis represent subgraph indices (names of subgrahps not shown due to space)
- The diagonals show queries that use only 1 subgraph and are represented as Q5-Q5, or Q42-Q42 for example. Other are represented as Q5-Q42 or Q42-Q5
- It is a Symmetrical plot.
- The tons of vertical and horizontal lines indicate there are lots of subgraphs that happen to pair with many other subgraphs.
File:Subgraph pair heatmap.png