You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

User:AndreaWest/WDQS Testing: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>AndreaWest
imported>AndreaWest
Line 41: Line 41:
** "Bouquets" (which have component flowers)
** "Bouquets" (which have component flowers)
** Using the terminology of [https://hal.inria.fr/hal-02096714/document Navigating the Maze of Wikidata Query Logs]
** Using the terminology of [https://hal.inria.fr/hal-02096714/document Navigating the Maze of Wikidata Query Logs]
* Cold-start and warm-start scenarios (to understand the effects of caching)
* Mixes of highly selective, equally selective and non-selective triples (to understand optimization)
* Small and large result sets, some with the potential for large intermediate result sets


The tests will be defined using query templates with varying entity selections to avoid pre-defined, static queries (that are known in advance and for which a platform can be tuned). They will be executed in batches and the following statistics collected per query:
The tests will be defined using query templates with varying entity selections to avoid pre-defined, static queries (that are known in advance and for which a platform can be tuned). They will be executed in batches and the following statistics collected per query:
* Execution time (longest, shortest, average) or time out
* Execution time (longest, shortest, average) or time out
* Execution time standard deviation (to understand variability)
* Correctness and completeness of response/update
* Correctness and completeness of response/update


Line 51: Line 56:
For each test iteration, the following will be reported:  
For each test iteration, the following will be reported:  
* Total execution time  
* Total execution time  
* Mean and geometric mean across the individual queries
* Mean, geometric mean and standard deviation across the individual queries
* Number of queries that executed and completed, and their times
* Number of queries that executed and completed, and their times
* Number of queries that timed out
* Number of queries that timed out
Line 57: Line 62:


== Test Infrastructure ==  
== Test Infrastructure ==  
TBD ... The test infrastructure will likely utilize one or more of the existing [https://wikitech.wikimedia.org/wiki/User:AndreaWest/Background_on_SPARQL_Benchmarks#Test_Frameworks frameworks or tools].
TBD ... The test infrastructure will likely utilize one or more of the existing [https://wikitech.wikimedia.org/wiki/User:AndreaWest/Background_on_SPARQL_Benchmarks#Test_Frameworks frameworks].


== Background on SPARQL Benchmarks ==
== Background on SPARQL Benchmarks ==
See [https://wikitech.wikimedia.org/wiki/User:AndreaWest/Background_on_SPARQL_Benchmarks Background on SPARQL Benchmarks].
See [https://wikitech.wikimedia.org/wiki/User:AndreaWest/Background_on_SPARQL_Benchmarks Background on SPARQL Benchmarks].

Revision as of 17:04, 27 April 2022

This page overviews a design and specific suggestions for Wikidata SPARQL query testing. These tests will be useful to evaluate Blazegraph backend alternatives and to (possibly) establish a Wikidata SPARQL benchmark for the industry.

Goals

  • Definition of multiple data sets exercising the SPARQL functions and complexities seen in actual Wikidata queries, as well as extensions, federated query, and workloads
    • Definition of specific INSERT, DELETE, CONSTRUCT and SELECT queries for performance and capabilities analysis
    • Definition of read/write workloads for stress testing
    • Tests of system characteristics and SPARQL compliance, and to evaluate system behavior under load

Test Design

Design based on insights gathered (largely) from the following papers:

Also, the following analyses (conducted by members of the WDQS team) examined more recent data:

Testing Wikidata-Specific Updates and Queries

Requirement to address a wide variety of SPARQL language constructs (such as FILTER, OPTIONAL, GROUP BY, ...), and query and update patterns. Testing will include federated and geospatial queries, and support for the (evolution of the) label, GAS and MediaWiki services.

As regards SPARQL, tests will be defined to exercise:

  • SELECT, ASK and CONSTRUCT queries, as well as INSERT and DELETE updates
  • Language keywords
    • Solution modifiers - Distinct, Limit, Offset, Order By, Offset
    • Assignment operators - Bind, Values
    • Algebraic operators - Filter, Union, Optional, Exists, Not Exists, Minus
    • Aggregation operators - Count, Min/Max, Avg, Sum, Group By, Group_Concat, Sample, Having
  • Subqueries
  • With both constants and variables in the triples
  • With varying numbers of triples (from 1 to 50+)
  • With combinations (co-occurrences) of the above language constructs
  • Utilizing different property path lengths and structures
    • For example, property paths of the form, a*, ab*, ab*c, abc*, a|b, a*|b*, etc.
  • Using different graph patterns
    • From simple chains of nodes (such as a 'connected to' b 'connected to' c, e.g., a - b - c) to
    • "Stars" (consisting of a set of nodes where there is only 1 path between any 2 nodes and at most one node can have more than 2 neighbors - for example, a 'connected to' b + c also 'connected to' b + b - d - e - f)
    • "Trees" (consisting of a set of nodes where there is only 1 path between any 2 nodes, e.g., a collection of stars) to
    • "Petals" (where there may be multiple paths between 2 nodes - for example, a - b - c or a - z - c defines two paths from a to c) to
    • "Flowers" (which have chains + trees + petals) to
    • "Bouquets" (which have component flowers)
    • Using the terminology of Navigating the Maze of Wikidata Query Logs
  • Cold-start and warm-start scenarios (to understand the effects of caching)
  • Mixes of highly selective, equally selective and non-selective triples (to understand optimization)
  • Small and large result sets, some with the potential for large intermediate result sets


The tests will be defined using query templates with varying entity selections to avoid pre-defined, static queries (that are known in advance and for which a platform can be tuned). They will be executed in batches and the following statistics collected per query:

  • Execution time (longest, shortest, average) or time out
  • Execution time standard deviation (to understand variability)
  • Correctness and completeness of response/update

Workload Testing

This evaluation will utilize combinations of the above queries/updates (TBD), characterized by the actual workloads captured on the WDQS queries and Streaming Updater dashboards. These workloads reflect both user and bot queries.

For each test iteration, the following will be reported:

  • Total execution time
  • Mean, geometric mean and standard deviation across the individual queries
  • Number of queries that executed and completed, and their times
  • Number of queries that timed out
  • Number of results for queries that completed

Test Infrastructure

TBD ... The test infrastructure will likely utilize one or more of the existing frameworks.

Background on SPARQL Benchmarks

See Background on SPARQL Benchmarks.