You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
This page overviews a design and specific suggestions for Wikidata SPARQL query testing. These tests will be useful to evaluate Blazegraph backend alternatives and to (possibly) establish a Wikidata SPARQL benchmark for the industry.
- Definition of multiple data sets exercising the SPARQL functions and complexities seen in actual Wikidata queries, as well as extensions, federated query, and workloads
- Definition of specific INSERT, DELETE, CONSTRUCT and SELECT queries for performance and capabilities analysis
- Definition of read/write workloads for stress testing
- Tests of system characteristics and SPARQL compliance, and to evaluate system behavior under load
Design based on insights gathered (largely) from the following papers:
- An Analytical Study of Large SPARQL Query Logs
- Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph
- Navigating the Maze of Wikidata Query Logs
Also, the following analyses (conducted by members of the WDQS team) examined more recent data:
Testing SPARQL 1.1 and GeoSPARQL Compliance
Testing compliance to the SPARQL 1.1 specification (using the W3C test suite) will be accomplished using a modified form of the Tests for Triplestore (TFT) codebase. Details are provided on the Running TFT page.
GeoSPARQL testing will be accomplished similarly, and is also described on the Running TFT page.
Testing Wikidata-Specific Updates and Queries
This section expands on the specific SPARQL language constructs (such as FILTER, OPTIONAL, GROUP BY, ...), and query and update patterns that will be tested. Testing will include federated and geospatial queries, and support for the (evolution of the) label, GAS and MediaWiki local SERVICEs.
As regards SPARQL, tests will be defined to exercise:
- SELECT, ASK and CONSTRUCT queries, as well as INSERT and DELETE updates
- Language keywords
- Solution modifiers - Distinct, Limit, Offset, Order By, Offset
- Assignment operators - Bind, Values
- Algebraic operators - Filter, Union, Optional, Exists, Not Exists, Minus
- Aggregation operators - Count, Min/Max, Avg, Sum, Group By, Group_Concat, Sample, Having
- With both constants and variables in the triples
- With varying numbers of triples (from 1 to 50+)
- With combinations (co-occurrences) of the above language constructs
- Utilizing different property path lengths and structures
- For example, property paths of the form, a*, ab*, ab*c, abc*, a|b, a*|b*, etc.
- Using different graph patterns
- From simple chains of nodes (such as a 'connected to' b 'connected to' c, e.g., a - b - c) to
- "Stars" (consisting of a set of nodes where there is only 1 path between any 2 nodes and at most one node can have more than 2 neighbors - for example, a 'connected to' b + c also 'connected to' b + b - d - e - f)
- "Trees" (consisting of a set of nodes where there is only 1 path between any 2 nodes, e.g., a collection of stars) to
- "Petals" (where there may be multiple paths between 2 nodes - for example, a - b - c or a - z - c defines two paths from a to c) to
- "Flowers" (which have chains + trees + petals) to
- "Bouquets" (which have component flowers)
- Using the terminology of Navigating the Maze of Wikidata Query Logs
- Cold-start and warm-start scenarios (to understand the effects of caching)
- Mixes of highly selective, equally selective and non-selective triples (to understand optimization)
- Small and large result sets, some with the potential for large intermediate result sets
The tests will be defined using both static and query templates (the latter allowing varying entity selections). They will be executed in batches and the following statistics collected per query:
- Execution time (longest, shortest, average) or time out
- Execution time standard deviation (to understand variability)
- Correctness and completeness of response/update
Specific test details TBD
This evaluation will utilize combinations of the above queries/updates with the proportions of different query complexities defined based on these investigations:
The loading will be based on the:
- Highest (+ some configurable percentage) and lowest number of "queries per second" (for a single server)
- As captured on the WDQS queries dashboard
- Highest (+ some configurable percentage) and lowest, added and deleted "triples ingestion rate" (for a single server)
- As captured on the Streaming Updater dashboard
Note that these workloads reflect both user and bot queries.
For each stress test iteration, the following will be reported:
- Total execution time
- Mean, geometric mean and standard deviation across the individual queries
- Number of queries that executed and completed, and their times
- Number of queries that timed out
- Number of results for queries that completed
Recommend at least 10 iterations of 30 minutes to 1 hour duration.
Specific workload details TBD
TBD ... The test infrastructure will utilize one or more of the existing frameworks.
TBD - Full Wikidata dump + subsets with determination of query loads/second and adds/deletes per second
Background on SPARQL Benchmarks
See this page.