You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
User:AKhatun/Wikidata Subgraph Query Analysis: Difference between revisions
(Add query time analysis)
(User agent information)
|Line 144:||Line 144:|
== User agent ==
== User agent ==
== Triples analysis ==
== Triples analysis ==
Revision as of 15:52, 2 December 2021
Analysis on Subgraphs in Wikidata showed how large each of the subgraphs are in Wikidata and how connected they are. This page shows the results from analysis on the queries that relate to these subgraph. The questions that needed to be answered were:
- How many(percent) queries access each subgraph?
- How many queries access multiple subgraphs at once? i.e, how much overlap can we expect in subgraphs?
- How long do these queries take?
- How many user-agents access each subgraph? How many of them access lots of subgraphs, or are they confined to a small set of subgraphs? Do some of them dominate queries in mutiple subgraphs?
- Are there chunks of similar queries in these subgraphs? i.e, how diverse the queries in each subgraph are.
We define some parameters to identify whether a query touches on a subgraph based on the items and properties a query uses. Some queries may even touch on mutiple subgraphs. See more on what a subgraph means here. Note: Subgraphs have overlaps.
The parameters that define which subgraph a query belongs to are:
- If the query uses the subgraph's Qid. Example: Q5 containing queries are part of Q5 subgraph.
- If the query uses items that are
instance ofa particular subgraph.
- If the query uses items that occur 99% of the times in a particular subgraph.
- If the query uses properties that occur 99% of the times in a particular subgraph.
- If the query uses literals that occur 99% of the times in a particular subgraph. The literals can occur with or without language tags. Both versions are compared to check for match. Note that whole literals are matched in queries and Wikidata. Queries that ask for partial matches, using regex for example, are not included. The assumption is that such queries are more likely to contain other items from a subgraph and are caught anyways. This of course excludes generic queries that wish to search the entirety of Wikidata for a certain label or other strings.
The following analysis uses Wikidata dump of
20211101 and WDQS public SPARQL queries of 10/2021. All query related values below are monthly counts.
Query count and time
- All queries here refer to queries with status code 200 and 500, i.e correct queries, successful or time-out.
- WDQS receives ~220M queries a month.
- Total query time for all queries for a month is ~16,000 hours.
The table below lists the top 50 most queried subgraphs with subgraph size and query time information. A breakdown of what caused the match is also present, which corresponds to the parameters mentioned in #What are subgraph related queries. It also ranks the subgraphs by size, query count, and query time consumed.
A more complete list containing 341 subgraphs, that form ~90% of Wikidata triples, is available here: [csvlink|all_subgraph_data.csv]
|Subgraph rank by size||Subgraph rank by query count||Subgraph rank by query time||Subgraph||Subgraph label||%of triples||%of entities||Query count||%count of all queries||Query time (hr)||%time of all queries||%count of query from Qid||%count of query from instance items||%count of query from items||%count of query from properties||%count of query from literals|
|7||6||9||Q4167410||Wikimedia disambiguation page||1.374||1.459||3,737,550||1.691||223||0.014||0.195||0.484||0.554||0.0||0.938|
|20||11||22||Q13406463||Wikimedia list article||0.252||0.352||1,283,160||0.58||73||0.005||0.018||0.409||0.357||0.0||0.048|
|243||13||24||Q14204246||Wikimedia project page||0.008||0.033||1,114,113||0.504||62||0.004||0.009||0.227||0.016||0.0||0.275|
|26||15||29||Q484170||commune of France||0.18||0.043||866,766||0.392||46||0.003||0.006||0.278||0.004||0.098||0.007|
|138||28||57||Q18340514||events in a specific year or time period||0.019||0.048||463,683||0.21||17||0.001||0.0||0.2||0.056||0.0||0.005|
|41||31||56||Q22808320||Wikimedia human name disambiguation page||0.078||0.075||433,986||0.196||17||0.001||0.0||0.174||0.154||0.0||0.001|
|37||33||32||Q3331189||version, edition, or translation||0.087||0.19||410,352||0.186||34||0.002||0.103||0.053||0.118||0.004||0.028|
|71||35||25||Q86850539||Whitaker's Latin frequency type C||0.048||0.011||355,247||0.161||56||0.003||0.0||0.0||0.0||0.0||0.16|
|49||37||167||Q2225692||fourth-level administrative division in Indonesia||0.07||0.088||344,964||0.156||5||0.0||0.0||0.147||0.098||0.0||0.009|
|112||39||76||Q476028||association football club||0.026||0.038||320,422||0.145||14||0.001||0.006||0.12||0.029||0.0||0.003|
|192||50||143||Q3464665||television series season||0.011||0.02||254,318||0.115||6||0.0||0.031||0.077||0.009||0.0||0.0|
More on query time
The query time can be broken down to classes for better visualization. Below is a figure with the query class distribution (number of queries per query time class per subgraph) for the top 50 subgraphs. Some of the takeaways are:
- Most subgraphs have most queries in the range of 10-100ms
- Second most commons class is 100ms to 1s
photographhave most queries (~150k) timed at 1-10s. Around 10 more subgraphs have a little (~10-20k) query in this time range.
Analysis on user-agent is an approximation because these don't completely represent distint users. For example lots people use the same bot or script without changing the user-agent, or the same person or bot uses multiple user-agent strings. Yet based on the available data we can get an idea nevertheless.
- Total number of unique user agents across all subgraphs: 981,180
- First, a list of subgraphs with most and least distinct user-agents is listed. It seems the least number of user-agents a subgraph has is atleast 10. So the large subgraphs are used by mutiple users.
- The largest numbers of user-agents are present in a variety of type of subgraphs, gene-protien-biological_process-molecular_function appear to be similar among them. It is possible the same queries represent several of these subgraphs. More on subgraph connectivity in #Subgraph Connectivity.
- There are 50 subgraphs with more than 1000 user agents, and 300 subgraphs with less than 1000 user agents. Most subgraphs are therefore not queried overly-widely. The distribution of user-agent counts less than 1000 is shown in the figure below. This clearly shows the small number of user counts in most subgraphs.
- Next, the user agent vs query count distribution was analyzed for some of the top subgraphs. While user agent count gives us an idea of how many users may be using a subgraph, it is not clear whether all of them query the subgraph equally, or very few user agents perform most of the queries.
- ~30 out of 341 subgraphs have a user agent that queries >=50% of all queries of that particular subgraphs.
- 6 subgraphs have a user agent querying around 80-90% of the time.
- So the trend of dominating single source queries is not wide spread among subgraphs, but is present in few subraphs nonetheless.
The figure below shows the top 2 user-agent query percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph. This figure shows the top 2 user-agent query percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph.
The figure below shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail 10% of the distribution This figure shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent while most subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail 10% of the distribution.