You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User:AndreaWest/WDQS Testing: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>AndreaWest
imported>AndreaWest
 
(24 intermediate revisions by the same user not shown)
Line 2: Line 2:


== Goals ==
== Goals ==
* Definition of one or more data sets
* Definition of multiple test sets exercising the SPARQL functions and complexities seen in actual Wikidata queries, as well as extensions, federated query, and workloads
* Definition of specific INSERT, DELETE, CONSTRUCT and SELECT queries for performance and capabilities analysis
** Definition of specific INSERT, DELETE, CONSTRUCT and SELECT queries for performance and capabilities analysis
* Definition of read/write workloads for stress testing
** Definition of read/write workloads for stress testing
** Tests of system characteristics and SPARQL compliance, and to evaluate system behavior under load


== Testing Specific Updates and Queries ==
== Test Design ==
Address different query and update patterns, including a variety of SPARQL features (such as FILTER, OPTIONAL, GROUP BY, ...), federation, geospatial analysis, support for label, GAS, sampling and MediaWiki "services", and more
Design based on insights gathered (largely) from the following papers:
* [https://arxiv.org/abs/1708.00363 An Analytical Study of Large SPARQL Query Logs]
* [https://iccl.inf.tu-dresden.de/w/images/5/5a/Malyshev-et-al-Wikidata-SPARQL-ISWC-2018.pdf Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph]
* [https://hal.inria.fr/hal-02096714/document Navigating the Maze of Wikidata Query Logs]
 
Also, the following analyses (conducted by members of the WDQS team) examined more recent data:
* [https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Queries_Analysis WDQS Queries Analysis]
* Subpages linked from https://wikitech.wikimedia.org/wiki/User:AKhatun
 
== Testing SPARQL 1.1 and GeoSPARQL Compliance ==
Testing compliance to the [https://www.w3.org/TR/sparql11-query/ SPARQL 1.1 specification] (using the [https://www.w3.org/2009/sparql/docs/tests/ W3C test suite]) will be accomplished using a modified form of the Tests for Triplestore (TFT) codebase. Details are provided on the [https://wikitech.wikimedia.org/w/index.php?title=User:AndreaWest/WDQS_Testing/Running_TFT Running TFT] page.
 
GeoSPARQL testing will be accomplished similarly, and is also described on that same page.
 
== Testing Wikidata-Specific Updates and Queries ==
This section expands on the specific SPARQL language constructs (such as FILTER, OPTIONAL, GROUP BY, ...), and query and update patterns that will be tested. Testing includes federated and geospatial queries, and direct comparisons using the predicates, rdfs:label/skos:altLabel/schema:description.
 
'''NOTE''': Until specific support for the evolution of the wikibase:label, GAS and wikibase:mwapi (MediaWiki API) SERVICEs are added to a backend, these capabilities '''cannot''' be tested. They are non-standard SPARQL extensions. The goal of this testing is to perform an overall evaluation of the various backend alternatives in order to select one or two for additional investigation and development.
 
As regards SPARQL, tests are defined to exercise:
* SELECT, ASK, DESCRIBE and CONSTRUCT queries, as well as INSERT and DELETE updates
** Note that the INSERT/DELETE requests are defined from the Streaming Updater output, as discussed below
* Language keywords
** Solution modifiers - Distinct, Limit, Offset, Order By, Reduced
** Assignment operators - Bind, Values
** Algebraic operators - Filter, Union, Optional, Exists, Not Exists, Minus
** Aggregation operators - Count, Min/Max, Avg, Sum, Group By, Group_Concat, Sample, Having
* Subqueries
* With both constants and variables in the triples
* With varying numbers of triples (from 1 to 50+)
* With combinations (co-occurrences) of the above language constructs
* Utilizing different property path lengths and structures
** For example, property paths of the form, a*, ab*, ab*c, abc*, a|b, a*|b*, etc.
* Using different graph patterns distinguished by the number of triples, variables and joined nodes (URIs and variables that are used in multiple triples as the subject or object) in the query, plus the largest number of joins for any of the joined entities (the "join degree") and the longest chain of triples from any subject to an object (including counting the number of sequential properties in a property path)
** For example, [https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Queries_Analysis#Query_class_1 this query] has three triples, two variables, one join node (?q) with the longest chain = 2 (due to the property path), and the largest join degree = 3 (related to ?q)
** As another example, [https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Queries_Analysis#Query_class_7 this query] has eleven triples (not counting the comments), ten variables, five join nodes (?article, ?person, ?country, ?givenName and ?familyName) with the longest chain = 3 (from ?article to ?person, ?person to ?country, and ?coumtry to its ?countrylabel; similar result ending with ?givenName or ?familyName), and the largest join degree = 6 (related to ?person)
* Mixes of highly selective, equally selective and non-selective triples (to understand optimization)
* Small and large result sets, some with the potential for large intermediate result sets
 
These tests for capabilities are defined using static queries. They will be executed using the [https://wikitech.wikimedia.org/w/index.php?title=User:AndreaWest/WDQS_Testing/Running_TFT updated TFT framework] to evaluate a triple store's/endpoint's support (or lack of support) for each of the Wikidata requirements, as well as the correctness and completeness of the response. In addition, the tests will be run using the [https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing/Running_Iguana modified Iguana framework] to obtain an estimate of execution times.
 
The TFT compliance test definitions are stored in the [https://github.com/AndreaWesterinen/wikidata-tests/tree/main/tft tft directory of the wikidata-tests GitHub repository], which is included in the [https://github.com/AndreaWesterinen/TFT TFT code repository] as another submodule. The corresponding test definitions for use in the Iguana framework are defined at yyy. Details coming.
 
Note that this set of queries specifically is designed to '''not cause time outs'''. These tests are not stress tests but tests of functionality.
 
=== Wikidata Triples for Compliance Testing ===
Since TFT (re)loads test data for every query, a relatively small data set is used in this environment (~17M). It is available from the N-Triples files in the [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/data data directory in the wikidata-tests repository].
 
Initially, several "small" sets of Wikidata triples were created (see the .csv files in the [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/notebooks notebooks directory of the wikidata-tests repository]). Items were selected at random representing entities from the human, film, gene and scholarly articles Wikidata sub-graphs. For these entities, all triples were captured, including statements and metadata triples (such as site links). Some of these files were created as a test set for the work on [https://phabricator.wikimedia.org/T303831 Phabricator ticket T303831], subgraph analysis.
 
The information in the CSV files was manipulated using the Jupyter notebook, [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/notebooks/Create_Wikidata_Sample.ipynb Create_Wikidata_Sample.ipynb], to create N-Triples outputs. (The CSV processing is defined in the sixth code block in the notebook.) However, this data set alone was insufficient, since INSERT/DELETE data requests also would be processed during testing. Almost all of the deleted triples did not exist in query-triples.nt. This is not necessarily a problem, since deleting non-existent triples does not result in an error, but there was concern that the time to process a non-existent (versus existing) triple might be different.
 
To address this, a 15-minute capture of the [https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater WDQS Streaming Updater] JSON output was created. That output is captured in [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/notebooks/wikidata_update_stream_6k_edits_20220531.ndjson this file]. From the JSON, a sequence of RDF added/deleted triples was extracted and transformed into a series of SPARQL INSERT/DELETE DATA requests, using the same Jupyter notebook as noted above. The resulting SPARQL requests can be found in the file, [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/data/sparql-update.txt sparql-update.txt], also in the wikidata-tests/data directory.
 
Although sparql-update.txt is not needed for compliance testing, it was used to create additional N_Triples to add to the "small" data set. sparql-update.txt was parsed (again using the Jupyter notebook, Create_Wikidata_Sample.ipynb) to extract the first occurrence of each deleted subject-predicate pair. In this way, new triples were created. This was done in order to populate the store with the actual triples that would be deleted when testing the evaluation infrastructure, when mimicking the Stream Updater processing. The file holding these new triples is named [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/data/deleted_triples.nt deleted_triples.nt].
 
Note, however, that the Jupyter notebook's approach to creating additional triples (that would later be deleted) does not capture ALL triples that are removed. If a specific IRI and predicate pair can be multi-valued, then some triples may be missed. This is because only the first triple of each IRI-predicate pair is captured. Since the goal was only to create a rough approximation of a Wikidata data set, and deletion of non-existent triples is not an error, this inadequacy was not addressed further.
 
In addition to all the new data discussed above, each evaluation query was executed. If no triples existed in the test data set that might satisfy the query, triples were CONSTRUCTed using WDQS and added to the test data set. The resulting files are found in the [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/notebooks/test-triples directory] of wikidata-tests. The new triples are in the file, [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/data/new_triples.nt new_triples.nt]. They were created using the last block of code in the Jupyter notebook. Note that some additional cleanup of the resulting triples was needed. For example, 3 triples with invalid dates (in the far distant past) were removed as were 2 triples with language tags of "he-x-q21283070".
 
The final data set is created by loading all the [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/data .nt files in the wikidata-tests/data directory] (92,108 triples).


== Workload Testing ==
== Workload Testing ==
TBD
This evaluation utilizes combinations of the above queries/updates with the proportions of different query complexities defined based on these investigations:
* [https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Queries_Analysis WDQS Queries Analysis]
* [https://wikitech.wikimedia.org/wiki/User:AKhatun/WDQS_Triples_Analysis WDQS Triples Analysis]
 
The loading is based on the:
* Highest and lowest number of "queries per second" (for a single server) as of mid-June 2022
** As captured on the [https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=now-6M&to=now&refresh=1m WDQS queries dashboard]
* Highest and lowest added and deleted "triples ingestion rate" (for a single server) as of mid-June 2022
** As captured on the [https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1 Streaming Updater dashboard]
Note that these workloads reflect both user and bot queries.
 
The tests are defined using query patterns based ''some''' of the static (compliance) queries from above, supplemented by other queries that should cause time outs on a full Wikidata load, supplemented with updates as generated by the WDQS Streaming Updater. They are executed using a modified version of the Iguana Framework across multiple systems, with multiple "client" threads/workers. The following statistics are reported:
* Total execution time for each query overall and by worker
* Minimum and maximum execution times for each query (if successful) by worker and across all workers
* Mean and geometric mean of each query (that successfully executed) by worker and across all workers
* Mean and geometric mean of each query using its execution time (if successful) or a penalized amount (= timeout by default) for failed queries
** By query overall and by worker
** This adjusts for queries completing quickly due to errors (e.g., they will have a low execution time but not produce results)
* Number of queries overall that executed and completed by worker and dataset
* Number of queries that timed out by worker and dataset
* Average number of queries per second (across all queries) that can be processed by a triple store for the data set
 
As above, the test details are defined at yyy (details coming). The stress test/workload environment assumes that a complete Wikidata RDF is loaded, and will be executed using [https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing/Running_Iguana the modified Iguana framework].
 
As mentioned, the current set of workload/stress query definitions are derived from the compliance queries with additional queries and updates added. The queries are defined in the Stress worksheet of the file, [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/Wikidata-Benchmark-Queries.xlsx Wikidata-Benchmark-Queries.xlsx in the wikidata-tests repository]. (For a detailed discussion of the spreadsheet, see the page, [https://wikitech.wikimedia.org/w/index.php?title=User:AndreaWest/WDQS_Testing/Explaining_the_Benchmark_Queries_Spreadsheet Explaining the Benchmark Query Spreadsheet].)
 
All workload queries utilize Iguana's [http://iguana-benchmark.eu/docs/3.3/usage/queries/#sparql-pattern-queries SPARQL query patterns]. The pattern is created by examining a query, and identifying and modifying ''some or all'' of its static items and properties that should be exchanged for different Wikidata entities by the Iguana framework. Iguana uses the pattern by first parsing it to determine the "replaceable" pieces, then querying a reference endpoint for possible substitutions for the "replaceable" pieces, and then substituting actual, valid values into the pattern before issuing the query.
 
As an example, consider the static query ([https://github.com/AndreaWesterinen/wikidata-tests/blob/main/tft/query-joal11.rq query-joal11.rq]):
<nowiki>SELECT ?q ?x {
    wd:Q37938621 wdt:P131* ?q .
    ?q wdt:P300 ?x }</nowiki>
 
This is transformed to the pattern:
<nowiki>SELECT ?q ?x {
    %%var0%% wdt:P131* ?q .
    ?q %%var1%% ?x }</nowiki>
 
Iguana will query for up to 2000 possible item/property pairs to substitute for var0 and var1. When executing the tests, a var0 and var1 pair will be chosen at random and substituted into the query before issuing it to the SPARQL endpoint being tested.
 
Over time, additional queries will be added to the workload testing (to better mimic the query characteristics shown below). The queries defined in the spreadsheet are meant to be an '''initial''' set.
{| class="wikitable"
|+
!Time of Execution
!Percent of Queries
|-
|< 10ms
|5.63%
|-
|10-100ms
|74.07%
|-
|100ms-1sec
|18.2%
|-
|1sec - 10sec
|1.43%
|-
|10-30sec
|0.5%
|-
|30-60sec
|0.118%
|-
|Timeout
|0.05%
|}
 
Note that there is no specific information captured for the number of query timeouts. This value is determined by examining the number of queries that failed and that had an execution time >55 seconds.
 
=== Wikidata Triples for Stress Testing ===
Stress testing requires a load of the complete set of Wikidata triples, and then a capture of the streaming updates that will be applied to it. Processing of the [https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater WDQS Streaming Updater] JSON output can be handled using the functionality in the Jupyter notebook discussed above, [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/notebooks/Create_Wikidata_Sample.ipynb Create_Wikidata_Sample.ipynb]. (See the processing in the seventh and eighth code blocks of the notebook, which produces the [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/notebooks/sparql-update-batches.txt sparql-update-batches.txt file].)
 
Note that:
* WDQS updates are not applied incrementally. They are batched at approximately 1 update/second, which is mimicked by the functionality in the notebook.
* If a 1 second update includes both added and deleted triples, these are captured on a single line.
** E.g., "INSERT DATA { ... } DELETE DATA { ... }" followed by a new line character
** This is done because the Iguana test infrastructure transmits one "line" at a time to the SPARQL server (a "line" is delimited by the new line character) and the data needs to be batched as it would from the WDQS Streaming Updater
 
However, there is still a problem scenario to address. There is no need to add triples to a complete Wikidata dump before the first execution of the stress tests (the dump is, after all, "complete"). But, deletion and reload of some triples will be necessary after executing a workload test, if another execution is to be run against the same data store. The Wikidata triples are modified by the inclusion of Stream Updater INSERT/DELETE data. Those modifications need to be reversed if further tests should be run. Although the small Wikidata data set ([https://github.com/AndreaWesterinen/wikidata-tests/blob/main/data/wikidata-subset.nt wikidata-subset.nt]) can be reloaded for TFT and Iguana compliance testing, the full data set cannot, due to its size.
 
Referring to the Create-Wikidata-Sample.ipynb notebook, see the fifth code block for one way to reverse the Stream Updater changes.
 
=== Query Characteristics/Frequencies ===
Along with using a complete Wikidata dump, the query load on a server is best tested using the actual queries that are issued to a Blazegraph server in one of the backend clusters (e.g., one of the wdqs100x servers, as shown on the [https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=now-6M&to=now&refresh=1m WDQS Grafana dashboard]). This approach is equivalent to capturing the Streaming Updater output exactly as it is sent to a server, and has the advantage of also completely capturing the real-time user and bot (versus simulated) queries.
 
This approach would work for WDQS internal testing, but presents privacy problems for a generic SPARQL benchmark or external testing. For this reason, a set of general query patterns are defined as discussed above.
 
Given that the specific queries are defined, the next issue is the frequency or mix of the queries to be sent. The mix is specified using the characteristics of the two investigations noted at the beginning of [https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing#Workload_Testing this section]. These frequencies are documented on the Stress worksheet of the file, [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/Wikidata-Benchmark-Queries.xlsx Wikidata-Benchmark-Queries.xlsx in the wikidata-tests repository], and explained [https://wikitech.wikimedia.org/w/index.php?title=User:AndreaWest/WDQS_Testing/Explaining_the_Benchmark_Queries_Spreadsheet here].
 
Lastly, the total number of queries over specific times must be defined. It is recommended that at least 10 sets of queries be issued. Each set should last for at least 3-6 minutes, and include a 30 second warm up period. The query rates after the warm-up should vary and reflect a range of high and low usage patterns. For example:
* 30 secs of high usage, 120 secs of average usage, 60 secs of low usage, 30 secs of medium usage (5 mins total)
* 60 secs of average usage, 30 secs of very high usage, 30 secs of medium usage, 90 secs of average usage, 30 secs of medium usage, 60 secs of very low usage (5 mins total)
* 60 secs of low usage, 60 secs of medium usage, 60 secs of high usage, 60 secs of low usage, 60 secs of average usage (5 mins total)
* Etc.
 
In addition, tests should also be run to average the highest QPS rate observed (186 QPS). For example, a mix of 60 seconds of 200 QPS, followed by 60 seconds of 160 QPS, followed by 30 seconds of 210 QPS, followed by 30 seconds of 190 QPS, followed by 60 seconds of 180 QPS will execute for 4 minutes total (after warmup) with an average of 185 QPS.
 
What are the recommended values for very low to very high usage?
{| class="wikitable"
|+
!Usage
!Queries/Sec
|-
|Very Low
|2
|-
|Low
|10
|-
|Average
|25
|-
|Medium
|45
|-
|High
|70
|-
|Very High
|95
|}
 
Rates are defined based on the following data from Grafana (February to June 2022):
* For the eqiad servers (wdqs1004, wdqs1005, wdqs1006, wdqs1007, wdqs1012 and wdqs1013):
** Max - 186 QPS
** Min - 0.233 QPS
** Average - 27.56 QPS
** Geometric mean - 19.22 QPS
* For the codfw servers (wdqs2001, wdqs2002, wdqs2003, wdqs2004 and wdqs2007):
** Max - 146 QPS
** Min - 0.35 QPS
** Average -  9.14 QPS
** Geometric mean - 5.64 QPS
* As would be expected, distribution of the QPS is weighted around the average and low end. Examining the eqiad servers (since they are the more heavily used):
** 27.14% of the 12 hour time slices reported <20 QPS on a server
** 57.53% reported 20-40 QPS
** 12.02% reported 40-60 QPS
** 2.21% reported 60-80 QPS
** 1.1% reported >80 QPS
 
Finally, achieving the query mix comes down to sending each of the defined queries (query patterns) with a certain delay in order to achieve the desired QPS. This is accomplished on Iguana by using 1 worker for each query pattern, with a delay between queries for that worker defined by the formula:
 
<math>workerDelay_i = \frac{1}{(desiredQPS * patternPercent_i)}</math>
 
where the ''patternPercent'' is the expected percentage of the query pattern as documented in the [https://github.com/AndreaWesterinen/wikidata-tests/blob/main/Wikidata-Benchmark-Queries.xlsx Wikidata-Benchmark-Queries spreadsheet]) and where:
 
<math>desiredQPS = \sum_{i=1}^n\frac{1}{workerDelay_i}</math>
 
where ''n'' is the number of query patterns.
 
Also note that updates will be sent at a minimum frequency of 1 per second, and may use multiple workers.


== Background on SPARQL Benchmarks ==
As an example, assume that you want to achieve 25 QPS and have 4 patterns with expected percentages of 40%, 25%, 20% and 15% respectively. Then:
The W3C maintains a web page on [https://www.w3.org/wiki/RdfStoreBenchmarking RDF Store Benchmarks]. Here is background on a few of those as well as several geospatial benchmarks (listed in alphabetical order).
* The first worker would issue a query every .1 second (or 10 queries in 1 sec)
* [http://wbsg.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/ BSBM] (Berlin SPARQL Benchmark)
* The second would issue a query every .16 seconds (or 6.25 queries in 1 sec)
** Dataset is based on an e-commerce use case with eight classes (Product, ProductType, ProductFeature, Producer, Vendor, Offer, Review, and Person) and 51 properties
* The third would issue a query every .2 seconds (or 5 queries in 1 sec)
** Synthetically-generated data scaled in size based on the number of products
* The fourth worker would issue a query every .267 seconds (or ~3.745 queries in 1 sec)
*** For example, a 100M triple dataset has approximately 9M instances across the various classes
for a total of ~25 QPS.
*** Both an RDF and relational representation created to allow comparison of backing storage technologies
** Benchmark utilizes a mix of 12 distinct queries (1 CONSTRUCT, 1 DESCRIBE and 10 SELECT) intended to test combinations of moderately complex queries in concurrent loads from multiple clients
*** Queries vary with respect to parameterized (but randomized) properties, differing across the various test runs
*** SPARQL language features exercised are: Filtering, 9+ graph patterns, unbound predicates, negation, OPTIONAL, LIMIT, ORDER BY, DISTINCT, REGEX and UNION
** Performance metrics include dataset load time, query mixes per hour (QMpH), and queries per second (QpS, determined by taking the number of queries of a specific type in a test run and dividing by their total execution time)
*** All performance metrics require reporting of the size of the dataset, and the first and second metrics also should report the number of concurrent query clients
* [https://aksw.org/Projects/DBPSB.html DBPedia Benchmark] (deprecated, but the approach to query definition is informative)
** Dataset uses one or more [https://www.dbpedia.org/resources/ DBPedia resources], with possibility to create larger sets by changing namespace names, and to create smaller subsets by selecting a random fraction of triples or by sampling triples across classes
*** The sampling approach attempts to preserve data characteristics of indegree/outdegree (min/max/avg number of edges into/out of a vertex)
** Queries defined by analyzing requests made against the DBPedia SPARQL endpoint, coupled with specifying SPARQL features to test
*** Analysis process involved 4 steps: Query selection from the SPARQL endpoint log; stripping syntactic constructs (such as namespace prefix definitions); calculation of similarity measures (e.g., Levenshtein string similarity); and, clustering based on the similarity measures (as documented in [https://link.springer.com/content/pdf/10.1007%2F978-3-642-25073-6_29.pdf DBPedia SPARQL Benchmark])
*** SPARQL features to test: Number of triple patterns (to exercise JOIN operations, from 1 to 25), plus the inclusion of UNION and OPTIONAL constructors, the DISTINCT solution modifier, and FILTER, LANG, REGEX and STR operators
*** Result was 25 SPARQL SELECT templates with different variable components (usually an IRI, a literal or a filter condition), with goal of 1000+ different possible values per component
** Benchmark tests utilize variable DBPedia dataset sizes (10% to 200%) and query mixes based on the 25 templates and parameterized values
** Performance metrics include query mixes per hour (QMpH), number and type of queries which timed out, and queries per second (QpS, calculated as a mean and geometric mean)
*** Performance is reported relative to the dataset size
* [https://code.google.com/archive/p/fbench/wikis/Queries.wiki FedBench] (Evaluating federated query)
** Three, interlinked data collections defined that differ in size, coverage, types of links and types of data (actual vs synthetic)
*** First is cross-domain, holding data from DBpedia, GeoNames, Jamendo, Linked-MDB, New York Times and Semantic Web Dog Food (approximately 160M triples)
*** Second is targeted at Life Sciences, holding data from DBPedia, KEGG, DrugBank and ChEBI (approx 53M triples)
*** Last is the SP2Bench dataset (10M triples)
** 36 total, fixed SELECT queries specified that exercise both SPARQL language and use-case scenarios
*** 7 cross-domain, 7 life-science, 11 SP2Bench and 11 linked-data queries
*** Cross-domain and life-science queries test "federation-specific aspects, in particular (1) number of data sources involved, (2) join complexity, (3) types of links used to join sources, and (4) varying query (and intermediate) result size"
*** SP2Bench queries are discussed below and included in FedBench to exercise SPARQL language features (only the SELECT queries are used)
*** Linked-data queries focused on basic graph patterns (e.g., conjunctive query)
** Performance metrics based mainly on query execution time
* Geospatial/GeoSPARQL benchmarks
** [https://github.com/RightBank/Benchmarking-spatially-enabled-RDF-stores EuroSDR geospatial benchmark]
*** Tests performed in two scenarios:
**** Linked data environment integrating geospatial and other data, based on the [https://www.icos-cp.eu/data-services/about-data-portal ICOS Data Portal]
***** ICOS uses several, backing [https://github.com/ICOS-Carbon-Portal/meta/tree/master/src/main/resources/owl ontologies] but they are not GeoSPARQL compliant
***** The EuroSDR work redesigned the ontologies for compliance and transformed the ICOS geometry data from GeoJSON to WKT ([https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry Well-Known Text])
***** Resulting dataset generated from the ICOS data in March 2019 and has over 2M RDF statements
**** Using the Geographica dataset (discussed below)
*** 25 fixed queries in both scenarios, which were selected from/modifications of the Geographica micro-benchmark discussed below (5 queries test non-topological construct functions, 10 queries evaluate spatial selection, and 10 queries test spatial join)
*** Performance metrics include load time, query execution time in each test iteration, and result correctness related to # of results and the reported geometries
** [https://github.com/semagrow/benchmark-geofedbench GeoFedBench]
** [http://geographica.di.uoa.gr/ Geographica]
*** Two datasets defined - one based on publicly available linked data and the other based on synthetic data
**** Publicly available data focused on Greece and used information from DBpedia, GeoNames, LinkedGeoData (related to road networks and rivers in Greece), Greek Administrative Geography, CORINE Land Use/Land Cover, and wildfire hotspots from the National Observatory of Athens' TELEIOS project)
***** Complete dataset contains more than 30K points, 12K polylines and 82K polygons
**** Synthetic data produces different sized datasets with different thematic and spatial selectivity
*** Two benchmarks defined to exercise the publicly available data - a micro and a macro benchmark
**** Micro benchmark focused on evaluation of primitive spatial functions testing "non-topological functions, spatial selections, spatial joins and spatial aggregate functions"
***** 29 fixed queries - 6 non-topological queries, 11 spatial selection queries, 10 spatial join queries (joining across different named graphs) and 2 aggregate function queries (one is specific to the stSPARQL language developed for [https://www.semanticscholar.org/paper/Modeling-and-Querying-Metadata-in-the-Semantic-Web%3A-Koubarakis-Kyzirakos/545146d9672a646e5dd32e69c59431146e6baaf1 Strabon])
**** Macro benchmark focused on performance in different use cases/application scenarios
***** 16 fixed queries - 4 geocoding queries (related to finding the name of a location, given certain criteria, or finding a city or street closest to a specified point), 3 map queries (related to finding a point of interest given some criteria and then roads or buildings around it), 6 "wildfire" use case queries (related to finding land cover area, primary roads, cities and municipalities within a bounding box, as well as forests on fire or roads which may be damaged) and 3 aggregation/counting of location (CLC) queries
*** For the synthetic data, various queries are generated from two templates using different properties and criteria
**** One template selects a location based on criteria + within or intersecting a bounding box, and the other template selects 2 locations based on criteria + within or intersecting or touching each other
*** Performance metrics for both datasets include statistics on load time, the overall time to execute a test run, and the execution times of individual queries
** [https://github.com/OpenLinkSoftware/GeoSPARQLBenchmark GeoSPARQL Benchmark] (Evaluating ''compliance'' to the GeoSPARQL specification)
* [http://swat.cse.lehigh.edu/projects/lubm/ LUBM] (Lehigh University Benchmark)
** Dataset based on a "university" ontology (Universities, Professors, Students, Courses, etc.) with 43 classes and 32 properties
** Synthetically-generated data scaled in size
*** Defined datasets range from 1 to 8000 universities, the largest one having approximately 1B triples
** 14 fixed queries defined focused on instance retrieval (SELECT queries) and limited inference (based on subsumption/subclassing, owl:TransitiveProperty and owl:inverseOf)  
*** Factors of importance: Proportion of the instances involved (size and selectivity); Complexity of the query; Requirement for traversal of class/property hierarchies; and Requirement for inference
*** Queries do not include language features such as OPTIONAL, UNION, DESCRIBE, etc.
** Performance metrics include load time, query response time, answer completeness and correctness, and a combined metric (similar to F-Measure) based on completeness/correctness
* [https://arxiv.org/pdf/0806.4627.pdf SP2Bench] (SPARQL Performance Benchmark)
** Dataset based on the structure of the [https://dblp.org/ DBLP Computer Science Bibliography] with 8 classes and 22 properties
** Synthetic data (of different sizes) generated based on the characteristics of the underlying DBLP information
** 17 different, fixed queries exercising SPARQL language features and JOIN operations, as well as SPARQL complexity and result size
*** 3 ASK queries and 14 SELECT queries defined that test JOINs, FILTER, UNION, OPTIONAL, DISTINCT, ORDER BY, LIMIT, OFFSET, and blank node and container processing
*** Evaluating "long path chains (i.e. nodes linked to ... other nodes via a long path), bushy patterns (i.e. single nodes that are linked to a multitude of other nodes), and combinations of these two"
** Performance metrics include the load time, success rate (separately reporting success rates for for all document sizes, and distinguishing between success, timeout, memory issues and other errors), global and per-query performance (where the latter combines the per-query results and produces both the mean and geometric mean), and memory consumption (reporting both the maximum consumption and the average across all queries)
* [https://www.researchgate.net/publication/226235487_Towards_a_Complete_OWL_Ontology_Benchmark UOBM] (University Ontology Benchmark, very similar to but extending LUBM)
** Two ontologies defined with different inferencing requirements (OWL Lite and OWL DL)
** Ontology classes and properties added (69 total classes and 43 properties in the OWL DL ontology)
** Generation of synthetic data to include links between universities' and departments' data
** 15 fixed SELECT queries defined


Lastly, note that there is nothing hard-coded in any of the testing. It is straightforward to define different queries, test characteristics and frequencies as WDQS evolves or to also test the Commons Query Service.


In addition, the following papers and the [https://jimgray.azurewebsites.net/BenchmarkHandbook/TOC.htm 1993 Benchmark Handbook] also provided important background information:
== Testing the Evaluation Infrastructure ==
* [https://dl.acm.org/doi/10.1145/1989323.1989340 Apples and Oranges: A Comparison of RDF Benchmarks and Real RDF Datasets]
The full Wikidata dump will be used for evaluation testing. However, a small subset of Wikidata has been created as a test data set, to evaluate the testing infrastructure. The details of that data set are [https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing#Wikidata_Triples_for_Compliance_Testing described above] and the data set's evaluation using a local Stardog installation are shown on [https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing/Testing_the_Testing this page].
* [https://link.springer.com/content/pdf/10.1007/978-3-319-11964-9_13.pdf Diversified Stress Testing of RDF Data Management Systems]
* [https://svn.aksw.org/papers/2017/ISWC_Iguana/public.pdf Iguana: A Generic Framework for Benchmarking the Read-Write Performance of Triple Stores] with an [https://github.com/dice-group/IGUANA implementation]
* [https://openreview.net/pdf?id=RSTtlH8Ycol KOBE: Cloud-native Open Benchmarking Engine for Federated Query Processors]
* [https://ieeexplore.ieee.org/document/4039291 A Requirements Driven Framework for Benchmarking Semantic Web Knowledge Base Systems]
* [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.139.6934&rep=rep1&type=pdf What’s Wrong with OWL Benchmarks?]

Latest revision as of 19:39, 27 June 2022

This page overviews a design and specific suggestions for Wikidata SPARQL query testing. These tests will be useful to evaluate Blazegraph backend alternatives and to (possibly) establish a Wikidata SPARQL benchmark for the industry.

Goals

  • Definition of multiple test sets exercising the SPARQL functions and complexities seen in actual Wikidata queries, as well as extensions, federated query, and workloads
    • Definition of specific INSERT, DELETE, CONSTRUCT and SELECT queries for performance and capabilities analysis
    • Definition of read/write workloads for stress testing
    • Tests of system characteristics and SPARQL compliance, and to evaluate system behavior under load

Test Design

Design based on insights gathered (largely) from the following papers:

Also, the following analyses (conducted by members of the WDQS team) examined more recent data:

Testing SPARQL 1.1 and GeoSPARQL Compliance

Testing compliance to the SPARQL 1.1 specification (using the W3C test suite) will be accomplished using a modified form of the Tests for Triplestore (TFT) codebase. Details are provided on the Running TFT page.

GeoSPARQL testing will be accomplished similarly, and is also described on that same page.

Testing Wikidata-Specific Updates and Queries

This section expands on the specific SPARQL language constructs (such as FILTER, OPTIONAL, GROUP BY, ...), and query and update patterns that will be tested. Testing includes federated and geospatial queries, and direct comparisons using the predicates, rdfs:label/skos:altLabel/schema:description.

NOTE: Until specific support for the evolution of the wikibase:label, GAS and wikibase:mwapi (MediaWiki API) SERVICEs are added to a backend, these capabilities cannot be tested. They are non-standard SPARQL extensions. The goal of this testing is to perform an overall evaluation of the various backend alternatives in order to select one or two for additional investigation and development.

As regards SPARQL, tests are defined to exercise:

  • SELECT, ASK, DESCRIBE and CONSTRUCT queries, as well as INSERT and DELETE updates
    • Note that the INSERT/DELETE requests are defined from the Streaming Updater output, as discussed below
  • Language keywords
    • Solution modifiers - Distinct, Limit, Offset, Order By, Reduced
    • Assignment operators - Bind, Values
    • Algebraic operators - Filter, Union, Optional, Exists, Not Exists, Minus
    • Aggregation operators - Count, Min/Max, Avg, Sum, Group By, Group_Concat, Sample, Having
  • Subqueries
  • With both constants and variables in the triples
  • With varying numbers of triples (from 1 to 50+)
  • With combinations (co-occurrences) of the above language constructs
  • Utilizing different property path lengths and structures
    • For example, property paths of the form, a*, ab*, ab*c, abc*, a|b, a*|b*, etc.
  • Using different graph patterns distinguished by the number of triples, variables and joined nodes (URIs and variables that are used in multiple triples as the subject or object) in the query, plus the largest number of joins for any of the joined entities (the "join degree") and the longest chain of triples from any subject to an object (including counting the number of sequential properties in a property path)
    • For example, this query has three triples, two variables, one join node (?q) with the longest chain = 2 (due to the property path), and the largest join degree = 3 (related to ?q)
    • As another example, this query has eleven triples (not counting the comments), ten variables, five join nodes (?article, ?person, ?country, ?givenName and ?familyName) with the longest chain = 3 (from ?article to ?person, ?person to ?country, and ?coumtry to its ?countrylabel; similar result ending with ?givenName or ?familyName), and the largest join degree = 6 (related to ?person)
  • Mixes of highly selective, equally selective and non-selective triples (to understand optimization)
  • Small and large result sets, some with the potential for large intermediate result sets

These tests for capabilities are defined using static queries. They will be executed using the updated TFT framework to evaluate a triple store's/endpoint's support (or lack of support) for each of the Wikidata requirements, as well as the correctness and completeness of the response. In addition, the tests will be run using the modified Iguana framework to obtain an estimate of execution times.

The TFT compliance test definitions are stored in the tft directory of the wikidata-tests GitHub repository, which is included in the TFT code repository as another submodule. The corresponding test definitions for use in the Iguana framework are defined at yyy. Details coming.

Note that this set of queries specifically is designed to not cause time outs. These tests are not stress tests but tests of functionality.

Wikidata Triples for Compliance Testing

Since TFT (re)loads test data for every query, a relatively small data set is used in this environment (~17M). It is available from the N-Triples files in the data directory in the wikidata-tests repository.

Initially, several "small" sets of Wikidata triples were created (see the .csv files in the notebooks directory of the wikidata-tests repository). Items were selected at random representing entities from the human, film, gene and scholarly articles Wikidata sub-graphs. For these entities, all triples were captured, including statements and metadata triples (such as site links). Some of these files were created as a test set for the work on Phabricator ticket T303831, subgraph analysis.

The information in the CSV files was manipulated using the Jupyter notebook, Create_Wikidata_Sample.ipynb, to create N-Triples outputs. (The CSV processing is defined in the sixth code block in the notebook.) However, this data set alone was insufficient, since INSERT/DELETE data requests also would be processed during testing. Almost all of the deleted triples did not exist in query-triples.nt. This is not necessarily a problem, since deleting non-existent triples does not result in an error, but there was concern that the time to process a non-existent (versus existing) triple might be different.

To address this, a 15-minute capture of the WDQS Streaming Updater JSON output was created. That output is captured in this file. From the JSON, a sequence of RDF added/deleted triples was extracted and transformed into a series of SPARQL INSERT/DELETE DATA requests, using the same Jupyter notebook as noted above. The resulting SPARQL requests can be found in the file, sparql-update.txt, also in the wikidata-tests/data directory.

Although sparql-update.txt is not needed for compliance testing, it was used to create additional N_Triples to add to the "small" data set. sparql-update.txt was parsed (again using the Jupyter notebook, Create_Wikidata_Sample.ipynb) to extract the first occurrence of each deleted subject-predicate pair. In this way, new triples were created. This was done in order to populate the store with the actual triples that would be deleted when testing the evaluation infrastructure, when mimicking the Stream Updater processing. The file holding these new triples is named deleted_triples.nt.

Note, however, that the Jupyter notebook's approach to creating additional triples (that would later be deleted) does not capture ALL triples that are removed. If a specific IRI and predicate pair can be multi-valued, then some triples may be missed. This is because only the first triple of each IRI-predicate pair is captured. Since the goal was only to create a rough approximation of a Wikidata data set, and deletion of non-existent triples is not an error, this inadequacy was not addressed further.

In addition to all the new data discussed above, each evaluation query was executed. If no triples existed in the test data set that might satisfy the query, triples were CONSTRUCTed using WDQS and added to the test data set. The resulting files are found in the directory of wikidata-tests. The new triples are in the file, new_triples.nt. They were created using the last block of code in the Jupyter notebook. Note that some additional cleanup of the resulting triples was needed. For example, 3 triples with invalid dates (in the far distant past) were removed as were 2 triples with language tags of "he-x-q21283070".

The final data set is created by loading all the .nt files in the wikidata-tests/data directory (92,108 triples).

Workload Testing

This evaluation utilizes combinations of the above queries/updates with the proportions of different query complexities defined based on these investigations:

The loading is based on the:

  • Highest and lowest number of "queries per second" (for a single server) as of mid-June 2022
  • Highest and lowest added and deleted "triples ingestion rate" (for a single server) as of mid-June 2022

Note that these workloads reflect both user and bot queries.

The tests are defined using query patterns based some' of the static (compliance) queries from above, supplemented by other queries that should cause time outs on a full Wikidata load, supplemented with updates as generated by the WDQS Streaming Updater. They are executed using a modified version of the Iguana Framework across multiple systems, with multiple "client" threads/workers. The following statistics are reported:

  • Total execution time for each query overall and by worker
  • Minimum and maximum execution times for each query (if successful) by worker and across all workers
  • Mean and geometric mean of each query (that successfully executed) by worker and across all workers
  • Mean and geometric mean of each query using its execution time (if successful) or a penalized amount (= timeout by default) for failed queries
    • By query overall and by worker
    • This adjusts for queries completing quickly due to errors (e.g., they will have a low execution time but not produce results)
  • Number of queries overall that executed and completed by worker and dataset
  • Number of queries that timed out by worker and dataset
  • Average number of queries per second (across all queries) that can be processed by a triple store for the data set

As above, the test details are defined at yyy (details coming). The stress test/workload environment assumes that a complete Wikidata RDF is loaded, and will be executed using the modified Iguana framework.

As mentioned, the current set of workload/stress query definitions are derived from the compliance queries with additional queries and updates added. The queries are defined in the Stress worksheet of the file, Wikidata-Benchmark-Queries.xlsx in the wikidata-tests repository. (For a detailed discussion of the spreadsheet, see the page, Explaining the Benchmark Query Spreadsheet.)

All workload queries utilize Iguana's SPARQL query patterns. The pattern is created by examining a query, and identifying and modifying some or all of its static items and properties that should be exchanged for different Wikidata entities by the Iguana framework. Iguana uses the pattern by first parsing it to determine the "replaceable" pieces, then querying a reference endpoint for possible substitutions for the "replaceable" pieces, and then substituting actual, valid values into the pattern before issuing the query.

As an example, consider the static query (query-joal11.rq):

SELECT ?q ?x {
    wd:Q37938621 wdt:P131* ?q .
    ?q wdt:P300 ?x }

This is transformed to the pattern:

SELECT ?q ?x {
    %%var0%% wdt:P131* ?q .
    ?q %%var1%% ?x }

Iguana will query for up to 2000 possible item/property pairs to substitute for var0 and var1. When executing the tests, a var0 and var1 pair will be chosen at random and substituted into the query before issuing it to the SPARQL endpoint being tested.

Over time, additional queries will be added to the workload testing (to better mimic the query characteristics shown below). The queries defined in the spreadsheet are meant to be an initial set.

Time of Execution Percent of Queries
< 10ms 5.63%
10-100ms 74.07%
100ms-1sec 18.2%
1sec - 10sec 1.43%
10-30sec 0.5%
30-60sec 0.118%
Timeout 0.05%

Note that there is no specific information captured for the number of query timeouts. This value is determined by examining the number of queries that failed and that had an execution time >55 seconds.

Wikidata Triples for Stress Testing

Stress testing requires a load of the complete set of Wikidata triples, and then a capture of the streaming updates that will be applied to it. Processing of the WDQS Streaming Updater JSON output can be handled using the functionality in the Jupyter notebook discussed above, Create_Wikidata_Sample.ipynb. (See the processing in the seventh and eighth code blocks of the notebook, which produces the sparql-update-batches.txt file.)

Note that:

  • WDQS updates are not applied incrementally. They are batched at approximately 1 update/second, which is mimicked by the functionality in the notebook.
  • If a 1 second update includes both added and deleted triples, these are captured on a single line.
    • E.g., "INSERT DATA { ... } DELETE DATA { ... }" followed by a new line character
    • This is done because the Iguana test infrastructure transmits one "line" at a time to the SPARQL server (a "line" is delimited by the new line character) and the data needs to be batched as it would from the WDQS Streaming Updater

However, there is still a problem scenario to address. There is no need to add triples to a complete Wikidata dump before the first execution of the stress tests (the dump is, after all, "complete"). But, deletion and reload of some triples will be necessary after executing a workload test, if another execution is to be run against the same data store. The Wikidata triples are modified by the inclusion of Stream Updater INSERT/DELETE data. Those modifications need to be reversed if further tests should be run. Although the small Wikidata data set (wikidata-subset.nt) can be reloaded for TFT and Iguana compliance testing, the full data set cannot, due to its size.

Referring to the Create-Wikidata-Sample.ipynb notebook, see the fifth code block for one way to reverse the Stream Updater changes.

Query Characteristics/Frequencies

Along with using a complete Wikidata dump, the query load on a server is best tested using the actual queries that are issued to a Blazegraph server in one of the backend clusters (e.g., one of the wdqs100x servers, as shown on the WDQS Grafana dashboard). This approach is equivalent to capturing the Streaming Updater output exactly as it is sent to a server, and has the advantage of also completely capturing the real-time user and bot (versus simulated) queries.

This approach would work for WDQS internal testing, but presents privacy problems for a generic SPARQL benchmark or external testing. For this reason, a set of general query patterns are defined as discussed above.

Given that the specific queries are defined, the next issue is the frequency or mix of the queries to be sent. The mix is specified using the characteristics of the two investigations noted at the beginning of this section. These frequencies are documented on the Stress worksheet of the file, Wikidata-Benchmark-Queries.xlsx in the wikidata-tests repository, and explained here.

Lastly, the total number of queries over specific times must be defined. It is recommended that at least 10 sets of queries be issued. Each set should last for at least 3-6 minutes, and include a 30 second warm up period. The query rates after the warm-up should vary and reflect a range of high and low usage patterns. For example:

  • 30 secs of high usage, 120 secs of average usage, 60 secs of low usage, 30 secs of medium usage (5 mins total)
  • 60 secs of average usage, 30 secs of very high usage, 30 secs of medium usage, 90 secs of average usage, 30 secs of medium usage, 60 secs of very low usage (5 mins total)
  • 60 secs of low usage, 60 secs of medium usage, 60 secs of high usage, 60 secs of low usage, 60 secs of average usage (5 mins total)
  • Etc.

In addition, tests should also be run to average the highest QPS rate observed (186 QPS). For example, a mix of 60 seconds of 200 QPS, followed by 60 seconds of 160 QPS, followed by 30 seconds of 210 QPS, followed by 30 seconds of 190 QPS, followed by 60 seconds of 180 QPS will execute for 4 minutes total (after warmup) with an average of 185 QPS.

What are the recommended values for very low to very high usage?

Usage Queries/Sec
Very Low 2
Low 10
Average 25
Medium 45
High 70
Very High 95

Rates are defined based on the following data from Grafana (February to June 2022):

  • For the eqiad servers (wdqs1004, wdqs1005, wdqs1006, wdqs1007, wdqs1012 and wdqs1013):
    • Max - 186 QPS
    • Min - 0.233 QPS
    • Average - 27.56 QPS
    • Geometric mean - 19.22 QPS
  • For the codfw servers (wdqs2001, wdqs2002, wdqs2003, wdqs2004 and wdqs2007):
    • Max - 146 QPS
    • Min - 0.35 QPS
    • Average - 9.14 QPS
    • Geometric mean - 5.64 QPS
  • As would be expected, distribution of the QPS is weighted around the average and low end. Examining the eqiad servers (since they are the more heavily used):
    • 27.14% of the 12 hour time slices reported <20 QPS on a server
    • 57.53% reported 20-40 QPS
    • 12.02% reported 40-60 QPS
    • 2.21% reported 60-80 QPS
    • 1.1% reported >80 QPS

Finally, achieving the query mix comes down to sending each of the defined queries (query patterns) with a certain delay in order to achieve the desired QPS. This is accomplished on Iguana by using 1 worker for each query pattern, with a delay between queries for that worker defined by the formula:

<math>workerDelay_i = \frac{1}{(desiredQPS * patternPercent_i)}</math>

where the patternPercent is the expected percentage of the query pattern as documented in the Wikidata-Benchmark-Queries spreadsheet) and where:

<math>desiredQPS = \sum_{i=1}^n\frac{1}{workerDelay_i}</math>

where n is the number of query patterns.

Also note that updates will be sent at a minimum frequency of 1 per second, and may use multiple workers.

As an example, assume that you want to achieve 25 QPS and have 4 patterns with expected percentages of 40%, 25%, 20% and 15% respectively. Then:

  • The first worker would issue a query every .1 second (or 10 queries in 1 sec)
  • The second would issue a query every .16 seconds (or 6.25 queries in 1 sec)
  • The third would issue a query every .2 seconds (or 5 queries in 1 sec)
  • The fourth worker would issue a query every .267 seconds (or ~3.745 queries in 1 sec)

for a total of ~25 QPS.

Lastly, note that there is nothing hard-coded in any of the testing. It is straightforward to define different queries, test characteristics and frequencies as WDQS evolves or to also test the Commons Query Service.

Testing the Evaluation Infrastructure

The full Wikidata dump will be used for evaluation testing. However, a small subset of Wikidata has been created as a test data set, to evaluate the testing infrastructure. The details of that data set are described above and the data set's evaluation using a local Stardog installation are shown on this page.