You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User:AndreaWest/WDQS Testing: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>AndreaWest
imported>AndreaWest
Line 50: Line 50:
* Small and large result sets, some with the potential for large intermediate result sets
* Small and large result sets, some with the potential for large intermediate result sets


These tests for capabilities are defined using static queries. They will be executed using the [https://wikitech.wikimedia.org/w/index.php?title=User:AndreaWest/WDQS_Testing/Running_TFT updated TFT framework] to evaluate a triple store's/endpoint's support (or lack of support) for each of the Wikidata requirements, as well as the correctness and completeness of the response. In addition, the tests will be executed using the [https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing/Running_Iguana modified Iguana framework], to get an understanding of execution times.
These tests for capabilities are defined using static queries. They will be executed using the [https://wikitech.wikimedia.org/w/index.php?title=User:AndreaWest/WDQS_Testing/Running_TFT updated TFT framework] to evaluate a triple store's/endpoint's support (or lack of support) for each of the Wikidata requirements, as well as the correctness and completeness of the response. In addition, the tests will be executed using the [https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing/Running_Iguana modified Iguana framework] to obtain an estimate of execution times.


Since TFT loads the test data for each query, a small data set is used in this environment. It is available from the wikidata-tests repository, in the tft directory (link to be provided), and will be included in the TFT repository as a submodule. The corresponding test definition for use in the Iguana framework is defined in the iguana-compliance directory (link to be provided). Details coming.
The TFT compliance test definitions are stored in the xxx repository, which is included in the TFT code repository as another submodule. The corresponding test definitions for use in the Iguana framework are defined at yyy. Details coming.
 
=== Wikidata Triples for Compliance Testing ===
Since TFT (re)loads test data for every query, a small data set is used in this environment. It is available as the file, [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/data/wikidata-subset.nt wikidata-subset.nt], located in the [https://github.com/AndreaWesterinen/wikidata-tests/tree/main/data wikidata-tests repository, in the data directory].
 
Initially, a "small" set of Wikidata triples was created ([https://github.com/AndreaWesterinen/wikidata-tests/blob/main/notebooks/subgraphs_5.csv subgraphs-5.csv]) to sample triples for entities from every major Wikidata sub-graph. This was created as a test set for the work on [https://phabricator.wikimedia.org/T303831 Phabricator ticket T303831], subgraph analysis. The information in the CSV file was modified using the Jupyter notebook, [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/notebooks/Create_Wikidata_Sample.ipynb Create_Wikidata_Sample.ipynb], to create the query-triples.nt file. However, this data set alone was insufficient, since INSERT/DELETE data requests also would be processed during stress testing. Almost all of the deleted triples did not exist in query-triples.nt. This was not necessarily a problem, since deleting non-existent triples does not result in an error, but there was concern that the time to process a non-existent versus existing triple might be different.
 
To address this, a 15-minute capture of the [https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater WDQS Streaming Updater] JSON output was created. That output is captured in [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/notebooks/wikidata_update_stream_6k_edits_20220531.ndjson this file]. From the JSON, a sequence of RDF added/deleted triples was extracted and transformed into a series of SPARQL INSERT/DELETE DATA requests, using the same Jupyter notebook as noted above. The resulting SPARQL requests can be found in the file, [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/data/sparql-update.ru sparql-update.ru], also in the wikidata-tests/data directory.
 
Although sparql-update.ru is not needed for compliance testing, it was used to create additional Wikidata triples to add to the "small" query test set. sparql-update.ru was parsed (again using the Jupyter notebook, Create_Wikidata_Sample.ipynb) to extract the first occurrence of each deleted subject-predicate pair. In this way, new triples were created and then added to the "small" set from above. This was done in order to populate the store with the actual triples that would be deleted when mimicking the Stream Updater processing.
 
The resulting data set is found in the file, [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/data/wikidata-subset.nt wikidata-subset.nt].


== Workload Testing ==
== Workload Testing ==
Line 78: Line 89:
* Average number of queries per second (across all queries) that can be processed by a triple store for the data set
* Average number of queries per second (across all queries) that can be processed by a triple store for the data set


As above, the test details are defined in the wikidata-tests repository, in the iguana-stresstest directory (link to be provided). The stress test/workload environment assumes that the complete Wikidata RDF is loaded, and will be executed using [https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing/Running_Iguana the modified Iguana framework]. More details coming.
As above, the test details are defined at yyy. The stress test/workload environment assumes that the complete Wikidata RDF is loaded, and will be executed using [https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing/Running_Iguana the modified Iguana framework]. More details coming.
 
=== Wikidata Triples for Stress Testing ===
Stress testing requires a load of the complete set of Wikidata triples, and then a capture of the streaming updates that will be applied to it. Processing of the [https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater WDQS Streaming Updater] JSON output can be handled using the functionality in the Jupyter notebook discussed above, [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/notebooks/Create_Wikidata_Sample.ipynb Create_Wikidata_Sample.ipynb]. (See the processing in the second code block of the notebook, which produces the [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/notebooks/sparql-update.txt sparql-update.txt file].)
 
However, there is still a problem scenario to address. There is no need to add triples to a complete Wikidata dump before the first execution of the stress tests (the dump is, after all, "complete"). But, deletion and reload of some triples will be necessary after executing a workload test, if another execution is to be run against the same data store. The Wikidata triples are modified by the inclusion of Stream Updater INSERT/DELETE data. Those modifications need to be reversed if further tests should be run. Although the small Wikidata data set ([https://github.com/AndreaWesterinen/wikidata-tests/blob/main/data/wikidata-subset.nt wikidata-subset.nt]) can be reloaded for TFT and Iguana compliance testing, the full data set cannot, due to its size.
 
Taking an approach similar to the processing for the Wikidata subset, it will be necessary to use the Stream Updater INSERT/DELETE DATA specifics to:
* Capture the last INSERTed triple for each subject/predicate pair - Which can then be removed from the data store after a test run
** See the fifth code block of the Create_Wikidata_Sample.ipynb notebook
* Capture the first set of DELETEd triples for each subject/predicate pair - Which can then be restored to the database after a test run
** See the third code block of the Create_Wikidata_Sample.ipynb notebook


== Testing the Evaluation Infrastructure ==
== Testing the Evaluation Infrastructure ==
The full Wikidata dump will be used for evaluation testing. However, a small subset of Wikidata has been created as a test data set, to evaluate the testing infrastructure. The details of that data set and its evaluation using a local Stardog installation are shown on [https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing/Testing_the_Testing this page].
The full Wikidata dump will be used for evaluation testing. However, a small subset of Wikidata has been created as a test data set, to evaluate the testing infrastructure. The details of that data set are [https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing#Wikidata_Triples_for_Compliance_Testing described above] and the data set's evaluation using a local Stardog installation are shown on [https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing/Testing_the_Testing this page].

Revision as of 20:31, 13 June 2022

This page overviews a design and specific suggestions for Wikidata SPARQL query testing. These tests will be useful to evaluate Blazegraph backend alternatives and to (possibly) establish a Wikidata SPARQL benchmark for the industry.

Goals

  • Definition of multiple test sets exercising the SPARQL functions and complexities seen in actual Wikidata queries, as well as extensions, federated query, and workloads
    • Definition of specific INSERT, DELETE, CONSTRUCT and SELECT queries for performance and capabilities analysis
    • Definition of read/write workloads for stress testing
    • Tests of system characteristics and SPARQL compliance, and to evaluate system behavior under load

Test Design

Design based on insights gathered (largely) from the following papers:

Also, the following analyses (conducted by members of the WDQS team) examined more recent data:

Testing SPARQL 1.1 and GeoSPARQL Compliance

Testing compliance to the SPARQL 1.1 specification (using the W3C test suite) will be accomplished using a modified form of the Tests for Triplestore (TFT) codebase. Details are provided on the Running TFT page.

GeoSPARQL testing will be accomplished similarly, and is also described on that same page.

Testing Wikidata-Specific Updates and Queries

This section expands on the specific SPARQL language constructs (such as FILTER, OPTIONAL, GROUP BY, ...), and query and update patterns that will be tested. Testing includes federated and geospatial queries, and support for the (evolution of the) label, GAS and MediaWiki local SERVICEs.

As regards SPARQL, tests are defined to exercise:

  • SELECT, ASK and CONSTRUCT queries, as well as INSERT and DELETE updates
  • Language keywords
    • Solution modifiers - Distinct, Limit, Offset, Order By, Offset
    • Assignment operators - Bind, Values
    • Algebraic operators - Filter, Union, Optional, Exists, Not Exists, Minus
    • Aggregation operators - Count, Min/Max, Avg, Sum, Group By, Group_Concat, Sample, Having
  • Subqueries
  • With both constants and variables in the triples
  • With varying numbers of triples (from 1 to 50+)
  • With combinations (co-occurrences) of the above language constructs
  • Utilizing different property path lengths and structures
    • For example, property paths of the form, a*, ab*, ab*c, abc*, a|b, a*|b*, etc.
  • Using different graph patterns
    • From simple chains of nodes (such as a 'connected to' b 'connected to' c, e.g., a - b - c) to
    • "Stars" (consisting of a set of nodes where there is only 1 path between any 2 nodes and at most one node can have more than 2 neighbors - for example, a 'connected to' b + c also 'connected to' b + b - d - e - f)
    • "Trees" (consisting of a set of nodes where there is only 1 path between any 2 nodes, e.g., a collection of stars) to
    • "Petals" (where there may be multiple paths between 2 nodes - for example, a - b - c or a - z - c defines two paths from a to c) to
    • "Flowers" (which have chains + trees + petals) to
    • "Bouquets" (which have component flowers)
    • Using the terminology of Navigating the Maze of Wikidata Query Logs
  • Cold-start and warm-start scenarios (to understand the effects of caching)
  • Mixes of highly selective, equally selective and non-selective triples (to understand optimization)
  • Small and large result sets, some with the potential for large intermediate result sets

These tests for capabilities are defined using static queries. They will be executed using the updated TFT framework to evaluate a triple store's/endpoint's support (or lack of support) for each of the Wikidata requirements, as well as the correctness and completeness of the response. In addition, the tests will be executed using the modified Iguana framework to obtain an estimate of execution times.

The TFT compliance test definitions are stored in the xxx repository, which is included in the TFT code repository as another submodule. The corresponding test definitions for use in the Iguana framework are defined at yyy. Details coming.

Wikidata Triples for Compliance Testing

Since TFT (re)loads test data for every query, a small data set is used in this environment. It is available as the file, wikidata-subset.nt, located in the wikidata-tests repository, in the data directory.

Initially, a "small" set of Wikidata triples was created (subgraphs-5.csv) to sample triples for entities from every major Wikidata sub-graph. This was created as a test set for the work on Phabricator ticket T303831, subgraph analysis. The information in the CSV file was modified using the Jupyter notebook, Create_Wikidata_Sample.ipynb, to create the query-triples.nt file. However, this data set alone was insufficient, since INSERT/DELETE data requests also would be processed during stress testing. Almost all of the deleted triples did not exist in query-triples.nt. This was not necessarily a problem, since deleting non-existent triples does not result in an error, but there was concern that the time to process a non-existent versus existing triple might be different.

To address this, a 15-minute capture of the WDQS Streaming Updater JSON output was created. That output is captured in this file. From the JSON, a sequence of RDF added/deleted triples was extracted and transformed into a series of SPARQL INSERT/DELETE DATA requests, using the same Jupyter notebook as noted above. The resulting SPARQL requests can be found in the file, sparql-update.ru, also in the wikidata-tests/data directory.

Although sparql-update.ru is not needed for compliance testing, it was used to create additional Wikidata triples to add to the "small" query test set. sparql-update.ru was parsed (again using the Jupyter notebook, Create_Wikidata_Sample.ipynb) to extract the first occurrence of each deleted subject-predicate pair. In this way, new triples were created and then added to the "small" set from above. This was done in order to populate the store with the actual triples that would be deleted when mimicking the Stream Updater processing.

The resulting data set is found in the file, wikidata-subset.nt.

Workload Testing

This evaluation utilizes combinations of the above queries/updates (and others) with the proportions of different query complexities defined based on these investigations:

The loading will be based on the:

  • Highest (+ some configurable percentage) and lowest number of "queries per second" (for a single server)
  • Highest (+ some configurable percentage) and lowest, added and deleted "triples ingestion rate" (for a single server)

Note that these workloads reflect both user and bot queries.

The tests are defined using query patterns based on the compliance queries from above, with additional updates and queries informed by the analyses. They will be executed using a modified version of the Iguana Framework on a single machine, using multiple "client" threads/workers. The following statistics will be reported:

  • Total execution time for each query overall and by worker
  • Minimum and maximum execution times for each query (if successful) by worker and across all workers
  • Mean and geometric mean of each query (that successfully executed) by worker and across all workers
  • Mean and geometric mean of each query using its execution time (if successful) or a penalized amount (= timeout by default) for failed queries
    • By query overall and by worker
    • This adjusts for queries completing quickly due to errors (e.g., they will have a low execution time but not produce results)
  • Number of queries overall that executed and completed by worker and dataset
  • Number of queries that timed out by worker and dataset
  • Average number of queries per second (across all queries) that can be processed by a triple store for the data set

As above, the test details are defined at yyy. The stress test/workload environment assumes that the complete Wikidata RDF is loaded, and will be executed using the modified Iguana framework. More details coming.

Wikidata Triples for Stress Testing

Stress testing requires a load of the complete set of Wikidata triples, and then a capture of the streaming updates that will be applied to it. Processing of the WDQS Streaming Updater JSON output can be handled using the functionality in the Jupyter notebook discussed above, Create_Wikidata_Sample.ipynb. (See the processing in the second code block of the notebook, which produces the sparql-update.txt file.)

However, there is still a problem scenario to address. There is no need to add triples to a complete Wikidata dump before the first execution of the stress tests (the dump is, after all, "complete"). But, deletion and reload of some triples will be necessary after executing a workload test, if another execution is to be run against the same data store. The Wikidata triples are modified by the inclusion of Stream Updater INSERT/DELETE data. Those modifications need to be reversed if further tests should be run. Although the small Wikidata data set (wikidata-subset.nt) can be reloaded for TFT and Iguana compliance testing, the full data set cannot, due to its size.

Taking an approach similar to the processing for the Wikidata subset, it will be necessary to use the Stream Updater INSERT/DELETE DATA specifics to:

  • Capture the last INSERTed triple for each subject/predicate pair - Which can then be removed from the data store after a test run
    • See the fifth code block of the Create_Wikidata_Sample.ipynb notebook
  • Capture the first set of DELETEd triples for each subject/predicate pair - Which can then be restored to the database after a test run
    • See the third code block of the Create_Wikidata_Sample.ipynb notebook

Testing the Evaluation Infrastructure

The full Wikidata dump will be used for evaluation testing. However, a small subset of Wikidata has been created as a test data set, to evaluate the testing infrastructure. The details of that data set are described above and the data set's evaluation using a local Stardog installation are shown on this page.