You are browsing a read-only backup copy of Wikitech. The primary site can be found at

User:AndreaWest/WDQS Testing: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
(→‎Background on SPARQL Benchmarks: Moved details to a separate page)
Line 2: Line 2:

== Goals ==
== Goals ==
* Definition of one or more data sets
* Definition of multiple data sets exercising the SPARQL functions and complexities seen in actual Wikidata queries, as well as extensions, federated query, and workloads
* Definition of specific INSERT, DELETE, CONSTRUCT and SELECT queries for performance and capabilities analysis
** Definition of specific INSERT, DELETE, CONSTRUCT and SELECT queries for performance and capabilities analysis
* Definition of read/write workloads for stress testing
** Definition of read/write workloads for stress testing
** Goal to test both system characteristics and SPARQL compliance, and behavior in real-world scenarios

== Testing Specific Updates and Queries ==
== Testing Specific Updates and Queries ==
Line 13: Line 14:

== Background on SPARQL Benchmarks ==
== Background on SPARQL Benchmarks ==
The W3C maintains a web page on [ RDF Store Benchmarks]. Here is background on a few of those as well as several geospatial benchmarks (listed in alphabetical order).
See [ Background on SPARQL Benchmarks].
* [ BSBM] (Berlin SPARQL Benchmark)
** Dataset is based on an e-commerce use case with eight classes (Product, ProductType, ProductFeature, Producer, Vendor, Offer, Review, and Person) and 51 properties
** Synthetically-generated data scaled in size based on the number of products
*** For example, a 100M triple dataset has approximately 9M instances across the various classes
*** Both an RDF and relational representation created to allow comparison of backing storage technologies
** Benchmark utilizes a mix of 12 distinct queries (1 CONSTRUCT, 1 DESCRIBE and 10 SELECT) intended to test combinations of moderately complex queries in concurrent loads from multiple clients
*** Queries vary with respect to parameterized (but randomized) properties, differing across the various test runs
*** SPARQL language features exercised are: Filtering, 9+ graph patterns, unbound predicates, negation, OPTIONAL, LIMIT, ORDER BY, DISTINCT, REGEX and UNION
** Performance metrics include dataset load time, query mixes per hour (QMpH), and queries per second (QpS, determined by taking the number of queries of a specific type in a test run and dividing by their total execution time)
*** All performance metrics require reporting of the size of the dataset, and the first and second metrics also should report the number of concurrent query clients
* [ DBPedia Benchmark] (deprecated, but the approach to query definition is informative)
** Dataset uses one or more [ DBPedia resources], with possibility to create larger sets by changing namespace names, and to create smaller subsets by selecting a random fraction of triples or by sampling triples across classes
*** The sampling approach attempts to preserve data characteristics of indegree/outdegree (min/max/avg number of edges into/out of a vertex)
** Queries defined by analyzing requests made against the DBPedia SPARQL endpoint, coupled with specifying SPARQL features to test
*** Analysis process involved 4 steps: Query selection from the SPARQL endpoint log; stripping syntactic constructs (such as namespace prefix definitions); calculation of similarity measures (e.g., Levenshtein string similarity); and, clustering based on the similarity measures (as documented in [ DBPedia SPARQL Benchmark])
*** SPARQL features to test: Number of triple patterns (to exercise JOIN operations, from 1 to 25), plus the inclusion of UNION and OPTIONAL constructors, the DISTINCT solution modifier, and FILTER, LANG, REGEX and STR operators
*** Result was 25 SPARQL SELECT templates with different variable components (usually an IRI, a literal or a filter condition), with goal of 1000+ different possible values per component
** Benchmark tests utilize variable DBPedia dataset sizes (10% to 200%) and query mixes based on the 25 templates and parameterized values
** Performance metrics include query mixes per hour (QMpH), number and type of queries which timed out, and queries per second (QpS, calculated as a mean and geometric mean)
*** Performance is reported relative to the dataset size
* [ FedBench] (Evaluating federated query)
** Three, interlinked data collections defined that differ in size, coverage, types of links and types of data (actual vs synthetic)
*** First is cross-domain, holding data from DBpedia, GeoNames, Jamendo, Linked-MDB, New York Times and Semantic Web Dog Food (approximately 160M triples)
*** Second is targeted at Life Sciences, holding data from DBPedia, KEGG, DrugBank and ChEBI (approx 53M triples)
*** Last is the SP2Bench dataset (10M triples)
** 36 total, fixed SELECT queries specified that exercise both SPARQL language and use-case scenarios
*** 7 cross-domain, 7 life-science, 11 SP2Bench and 11 linked-data queries
*** Cross-domain and life-science queries test "federation-specific aspects, in particular (1) number of data sources involved, (2) join complexity, (3) types of links used to join sources, and (4) varying query (and intermediate) result size"
*** SP2Bench queries are discussed below and included in FedBench to exercise SPARQL language features (only the SELECT queries are used)
*** Linked-data queries focused on basic graph patterns (e.g., conjunctive query)
** Performance metrics based mainly on query execution time
* Geospatial/GeoSPARQL benchmarks
** [ EuroSDR geospatial benchmark]
*** Tests performed in two scenarios:
**** Linked data environment integrating geospatial and other data, based on the [ ICOS Data Portal]
***** ICOS uses several, backing [ ontologies] but they are not GeoSPARQL compliant
***** The EuroSDR work redesigned the ontologies for compliance and transformed the ICOS geometry data from GeoJSON to WKT ([ Well-Known Text])
***** Resulting dataset generated from the ICOS data in March 2019 and has over 2M RDF statements
**** Using the Geographica dataset (discussed below)
*** 25 fixed queries in both scenarios, which were selected from/modifications of the Geographica micro-benchmark discussed below (5 queries test non-topological construct functions, 10 queries evaluate spatial selection, and 10 queries test spatial join)
*** Performance metrics include load time, query execution time in each test iteration, and result correctness related to # of results and the reported geometries
** [ GeoFedBench]
** [ Geographica]
*** Two datasets defined - one based on publicly available linked data and the other based on synthetic data
**** Publicly available data focused on Greece and used information from DBpedia, GeoNames, LinkedGeoData (related to road networks and rivers in Greece), Greek Administrative Geography, CORINE Land Use/Land Cover, and wildfire hotspots from the National Observatory of Athens' TELEIOS project)
***** Complete dataset contains more than 30K points, 12K polylines and 82K polygons
**** Synthetic data produces different sized datasets with different thematic and spatial selectivity
*** Two benchmarks defined to exercise the publicly available data - a micro and a macro benchmark
**** Micro benchmark focused on evaluation of primitive spatial functions testing "non-topological functions, spatial selections, spatial joins and spatial aggregate functions"
***** 29 fixed queries - 6 non-topological queries, 11 spatial selection queries, 10 spatial join queries (joining across different named graphs) and 2 aggregate function queries (one is specific to the stSPARQL language developed for [ Strabon])
**** Macro benchmark focused on performance in different use cases/application scenarios
***** 16 fixed queries - 4 geocoding queries (related to finding the name of a location, given certain criteria, or finding a city or street closest to a specified point), 3 map queries (related to finding a point of interest given some criteria and then roads or buildings around it), 6 "wildfire" use case queries (related to finding land cover area, primary roads, cities and municipalities within a bounding box, as well as forests on fire or roads which may be damaged) and 3 aggregation/counting of location (CLC) queries
*** For the synthetic data, various queries are generated from two templates using different properties and criteria
**** One template selects a location based on criteria + within or intersecting a bounding box, and the other template selects 2 locations based on criteria + within or intersecting or touching each other
*** Performance metrics for both datasets include statistics on load time, the overall time to execute a test run, and the execution times of individual queries
** [ GeoSPARQL Benchmark] (Evaluating ''compliance'' to the GeoSPARQL specification)
* [ LUBM] (Lehigh University Benchmark)
** Dataset based on a "university" ontology (Universities, Professors, Students, Courses, etc.) with 43 classes and 32 properties
** Synthetically-generated data scaled in size
*** Defined datasets range from 1 to 8000 universities, the largest one having approximately 1B triples
** 14 fixed queries defined focused on instance retrieval (SELECT queries) and limited inference (based on subsumption/subclassing, owl:TransitiveProperty and owl:inverseOf)
*** Factors of importance: Proportion of the instances involved (size and selectivity); Complexity of the query; Requirement for traversal of class/property hierarchies; and Requirement for inference
*** Queries do not include language features such as OPTIONAL, UNION, DESCRIBE, etc.
** Performance metrics include load time, query response time, answer completeness and correctness, and a combined metric (similar to F-Measure) based on completeness/correctness
* [ SP2Bench] (SPARQL Performance Benchmark)
** Dataset based on the structure of the [ DBLP Computer Science Bibliography] with 8 classes and 22 properties
** Synthetic data (of different sizes) generated based on the characteristics of the underlying DBLP information
** 17 different, fixed queries exercising SPARQL language features and JOIN operations, as well as SPARQL complexity and result size
*** 3 ASK queries and 14 SELECT queries defined that test JOINs, FILTER, UNION, OPTIONAL, DISTINCT, ORDER BY, LIMIT, OFFSET, and blank node and container processing
*** Evaluating "long path chains (i.e. nodes linked to ... other nodes via a long path), bushy patterns (i.e. single nodes that are linked to a multitude of other nodes), and combinations of these two"
** Performance metrics include the load time, success rate (separately reporting success rates for for all document sizes, and distinguishing between success, timeout, memory issues and other errors), global and per-query performance (where the latter combines the per-query results and produces both the mean and geometric mean), and memory consumption (reporting both the maximum consumption and the average across all queries)
* [ UOBM] (University Ontology Benchmark, very similar to but extending LUBM)
** Two ontologies defined with different inferencing requirements (OWL Lite and OWL DL)
** Ontology classes and properties added (69 total classes and 43 properties in the OWL DL ontology)
** Generation of synthetic data to include links between universities' and departments' data
** 15 fixed SELECT queries defined
In addition, the following papers and the [ 1993 Benchmark Handbook] also provided important background information:
* [ Apples and Oranges: A Comparison of RDF Benchmarks and Real RDF Datasets]
* [ Diversified Stress Testing of RDF Data Management Systems]
* [ Iguana: A Generic Framework for Benchmarking the Read-Write Performance of Triple Stores] with an [ implementation]
* [ KOBE: Cloud-native Open Benchmarking Engine for Federated Query Processors]
* [ A Requirements Driven Framework for Benchmarking Semantic Web Knowledge Base Systems]
* [ What’s Wrong with OWL Benchmarks?]

Revision as of 18:13, 3 April 2022

This page overviews a design and specific suggestions for Wikidata SPARQL query testing. These tests will be useful to evaluate Blazegraph backend alternatives and to (possibly) establish a Wikidata SPARQL benchmark for the industry.


  • Definition of multiple data sets exercising the SPARQL functions and complexities seen in actual Wikidata queries, as well as extensions, federated query, and workloads
    • Definition of specific INSERT, DELETE, CONSTRUCT and SELECT queries for performance and capabilities analysis
    • Definition of read/write workloads for stress testing
    • Goal to test both system characteristics and SPARQL compliance, and behavior in real-world scenarios

Testing Specific Updates and Queries

Address different query and update patterns, including a variety of SPARQL features (such as FILTER, OPTIONAL, GROUP BY, ...), federation, geospatial analysis, support for label, GAS, sampling and MediaWiki "services", and more

Workload Testing


Background on SPARQL Benchmarks

See Background on SPARQL Benchmarks.