You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User:AndreaWest/WDQS Testing: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>AndreaWest
imported>AndreaWest
(Minor reorganization of the text)
Line 20: Line 20:
Testing compliance to the [https://www.w3.org/TR/sparql11-query/ SPARQL 1.1 specification] (using the [https://www.w3.org/2009/sparql/docs/tests/ W3C test suite]) will be accomplished using a modified form of the Tests for Triplestore (TFT) codebase. Details are provided on the [https://wikitech.wikimedia.org/w/index.php?title=User:AndreaWest/WDQS_Testing/Running_TFT Running TFT] page.
Testing compliance to the [https://www.w3.org/TR/sparql11-query/ SPARQL 1.1 specification] (using the [https://www.w3.org/2009/sparql/docs/tests/ W3C test suite]) will be accomplished using a modified form of the Tests for Triplestore (TFT) codebase. Details are provided on the [https://wikitech.wikimedia.org/w/index.php?title=User:AndreaWest/WDQS_Testing/Running_TFT Running TFT] page.


GeoSPARQL testing will be accomplished similarly, and is also described on the Running TFT page.
GeoSPARQL testing will be accomplished similarly, and is also described on that same page.


== Testing Wikidata-Specific Updates and Queries ==
== Testing Wikidata-Specific Updates and Queries ==
Line 72: Line 72:
For each stress test iteration, the following will be reported:  
For each stress test iteration, the following will be reported:  
* Total execution time  
* Total execution time  
* Mean, geometric mean and standard deviation across the individual queries
* Minimum and maximum execution times for queries that successfully executed
* Mean and geometric mean of the queries that successfully executed
* Mean and geometric mean of all queries using their execution times if success or a penalized amount (= timeout by default) for failed queries
** To adjust for queries completing quickly due to errors (e.g., they will have a low execution time but not produce results)
* Number of queries that executed and completed, and their times
* Number of queries that executed and completed, and their times
* Number of queries that timed out
* Number of queries that timed out
* Number of results for queries that completed
* Largest number of results for queries
Recommend at least 10 iterations of 30 minutes to 1 hour duration.
Recommend at least 10 iterations of 30 minutes to 1 hour duration.


Specific workload details TBD
Specific workload details TBD


== Test Infrastructure ==  
== Stress Test Infrastructure ==  
TBD ... The test infrastructure will utilize one or more of the existing [https://wikitech.wikimedia.org/wiki/User:AndreaWest/Background_on_SPARQL_Benchmarks#Test_Frameworks frameworks].
The stress test infrastructure will extend the existing Iguana framework. See the [https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing/Running_Iguana/ Running Iguana page] for more details.
 
=== Background on SPARQL Benchmarks ===
For background on existing SPARQL benchmarks and test frameworks, see [https://wikitech.wikimedia.org/wiki/User:AndreaWest/Background_on_SPARQL_Benchmarks#Test_Frameworks_and_Tools this page].


== Test Data ==
== Test Data ==
TBD - Full Wikidata dump + subsets with determination of query loads/second and adds/deletes per second
TBD - Full Wikidata dump + subsets with determination of query loads/second and adds/deletes per second
== Background on SPARQL Benchmarks ==
See [https://wikitech.wikimedia.org/wiki/User:AndreaWest/Background_on_SPARQL_Benchmarks#Test_Frameworks_and_Tools this page].

Revision as of 14:49, 1 June 2022

This page overviews a design and specific suggestions for Wikidata SPARQL query testing. These tests will be useful to evaluate Blazegraph backend alternatives and to (possibly) establish a Wikidata SPARQL benchmark for the industry.

Goals

  • Definition of multiple data sets exercising the SPARQL functions and complexities seen in actual Wikidata queries, as well as extensions, federated query, and workloads
    • Definition of specific INSERT, DELETE, CONSTRUCT and SELECT queries for performance and capabilities analysis
    • Definition of read/write workloads for stress testing
    • Tests of system characteristics and SPARQL compliance, and to evaluate system behavior under load

Test Design

Design based on insights gathered (largely) from the following papers:

Also, the following analyses (conducted by members of the WDQS team) examined more recent data:

Testing SPARQL 1.1 and GeoSPARQL Compliance

Testing compliance to the SPARQL 1.1 specification (using the W3C test suite) will be accomplished using a modified form of the Tests for Triplestore (TFT) codebase. Details are provided on the Running TFT page.

GeoSPARQL testing will be accomplished similarly, and is also described on that same page.

Testing Wikidata-Specific Updates and Queries

This section expands on the specific SPARQL language constructs (such as FILTER, OPTIONAL, GROUP BY, ...), and query and update patterns that will be tested. Testing will include federated and geospatial queries, and support for the (evolution of the) label, GAS and MediaWiki local SERVICEs.

As regards SPARQL, tests will be defined to exercise:

  • SELECT, ASK and CONSTRUCT queries, as well as INSERT and DELETE updates
  • Language keywords
    • Solution modifiers - Distinct, Limit, Offset, Order By, Offset
    • Assignment operators - Bind, Values
    • Algebraic operators - Filter, Union, Optional, Exists, Not Exists, Minus
    • Aggregation operators - Count, Min/Max, Avg, Sum, Group By, Group_Concat, Sample, Having
  • Subqueries
  • With both constants and variables in the triples
  • With varying numbers of triples (from 1 to 50+)
  • With combinations (co-occurrences) of the above language constructs
  • Utilizing different property path lengths and structures
    • For example, property paths of the form, a*, ab*, ab*c, abc*, a|b, a*|b*, etc.
  • Using different graph patterns
    • From simple chains of nodes (such as a 'connected to' b 'connected to' c, e.g., a - b - c) to
    • "Stars" (consisting of a set of nodes where there is only 1 path between any 2 nodes and at most one node can have more than 2 neighbors - for example, a 'connected to' b + c also 'connected to' b + b - d - e - f)
    • "Trees" (consisting of a set of nodes where there is only 1 path between any 2 nodes, e.g., a collection of stars) to
    • "Petals" (where there may be multiple paths between 2 nodes - for example, a - b - c or a - z - c defines two paths from a to c) to
    • "Flowers" (which have chains + trees + petals) to
    • "Bouquets" (which have component flowers)
    • Using the terminology of Navigating the Maze of Wikidata Query Logs
  • Cold-start and warm-start scenarios (to understand the effects of caching)
  • Mixes of highly selective, equally selective and non-selective triples (to understand optimization)
  • Small and large result sets, some with the potential for large intermediate result sets

The tests will be defined using both static and query templates (the latter allowing varying entity selections). They will be executed in batches and the following statistics collected per query:

  • Execution time (longest, shortest, average) or time out
  • Execution time standard deviation (to understand variability)
  • Correctness and completeness of response/update

Specific test details TBD

Workload Testing

This evaluation will utilize combinations of the above queries/updates with the proportions of different query complexities defined based on these investigations:

The loading will be based on the:

  • Highest (+ some configurable percentage) and lowest number of "queries per second" (for a single server)
  • Highest (+ some configurable percentage) and lowest, added and deleted "triples ingestion rate" (for a single server)

Note that these workloads reflect both user and bot queries.

For each stress test iteration, the following will be reported:

  • Total execution time
  • Minimum and maximum execution times for queries that successfully executed
  • Mean and geometric mean of the queries that successfully executed
  • Mean and geometric mean of all queries using their execution times if success or a penalized amount (= timeout by default) for failed queries
    • To adjust for queries completing quickly due to errors (e.g., they will have a low execution time but not produce results)
  • Number of queries that executed and completed, and their times
  • Number of queries that timed out
  • Largest number of results for queries

Recommend at least 10 iterations of 30 minutes to 1 hour duration.

Specific workload details TBD

Stress Test Infrastructure

The stress test infrastructure will extend the existing Iguana framework. See the Running Iguana page for more details.

Background on SPARQL Benchmarks

For background on existing SPARQL benchmarks and test frameworks, see this page.

Test Data

TBD - Full Wikidata dump + subsets with determination of query loads/second and adds/deletes per second