You are browsing a read-only backup copy of Wikitech. The live site can be found at

User:AndreaWest/WDQS Testing

From Wikitech-static
< User:AndreaWest
Revision as of 18:30, 30 March 2022 by imported>AndreaWest (→‎Background on SPARQL Benchmarks)
Jump to navigation Jump to search

This page overviews a design and specific suggestions for Wikidata SPARQL query testing. These tests will be useful to evaluate Blazegraph backend alternatives and to (possibly) establish a Wikidata SPARQL benchmark for the industry.


  • Definition of one or more data sets
  • Definition of specific INSERT, DELETE, CONSTRUCT and SELECT queries for performance and capabilities analysis
  • Definition of read/write workloads for stress testing

Testing Specific Updates and Queries

Address different query and update patterns, including a variety of SPARQL features (such as FILTER, OPTIONAL, GROUP BY, ...), federation, geospatial analysis, support for label, GAS, sampling and MediaWiki "services", and more

Workload Testing


Background on SPARQL Benchmarks

The W3C maintains a web page on RDF Store Benchmarks. Here is background on a few of these (listed in alphabetical order) whose designs provided insights used in the work above.

  • BSBM (Berlin SPARQL Benchmark)
    • Dataset is based on an e-commerce use case with eight classes (Product, ProductType, ProductFeature, Producer, Vendor, Offer, Review, and Person) and 51 properties
    • Synthetically-generated data scaled in size based on the number of products
      • For example, a 100M triple dataset has approximately 9M instances across the various classes
      • Both an RDF and relational representation created to allow comparison of backing storage technologies
    • Benchmark utilizes a mix of 12 distinct queries (1 CONSTRUCT, 1 DESCRIBE and 10 SELECT) intended to test combinations of moderately complex queries in concurrent loads from multiple clients
      • Queries vary with respect to parameterized (but randomized) properties, differing across the various test runs
      • SPARQL language features exercised are: Filtering, 9+ graph patterns, unbound predicates, negation, OPTIONAL, LIMIT, ORDER BY, DISTINCT, REGEX and UNION
    • Performance metrics include dataset load time, query mixes per hour (QMpH), and queries per second (QpS, determined by taking the number of queries of a specific type in a test run and dividing by their total execution time)
      • All performance metrics require reporting of the size of the dataset, and the first and second metrics also should report the number of concurrent query clients
  • DBPedia Benchmark (deprecated, but the approach to query definition is informative)
    • Dataset uses one or more DBPedia resources, with possibility to create larger sets by changing namespace names, and to create smaller subsets by selecting a random fraction of triples or by sampling triples across classes
      • The sampling approach attempts to preserve data characteristics of indegree/outdegree (min/max/avg number of edges into/out of a vertex)
    • Queries defined by analyzing requests made against the DBPedia SPARQL endpoint, coupled with specifying SPARQL features to test
      • Analysis process involved 4 steps: Query selection from the SPARQL endpoint log; stripping syntactic constructs (such as namespace prefix definitions); calculation of similarity measures (e.g., Levenshtein string similarity); and, clustering based on the similarity measures (as documented in DBPedia SPARQL Benchmark)
      • SPARQL features to test: Number of triple patterns (to exercise JOIN operations, from 1 to 25), plus the inclusion of UNION and OPTIONAL constructors, the DISTINCT solution modifier, and FILTER, LANG, REGEX and STR operators
      • Result was 25 SPARQL SELECT templates with different variable components (usually an IRI, a literal or a filter condition), with goal of 1000+ different possible values per component
    • Benchmark tests utilize variable DBPedia dataset sizes (10% to 200%) and query mixes based on the 25 templates and parameterized values
    • Performance metrics include query mixes per hour (QMpH), number and type of queries which timed out, and queries per second (QpS, calculated as a mean and geometric mean)
      • Performance is reported relative to the dataset size
  • FedBench (Evaluating federated query)
    • Three, interlinked data collections defined that differ in size, coverage, types of links and types of data (actual vs synthetic)
      • First is cross-domain, holding data from DBpedia, GeoNames, Jamendo, Linked-MDB, New York Times and Semantic Web Dog Food (approximately 160M triples)
      • Second is targeted at Life Sciences, holding data from DBPedia, KEGG, DrugBank and ChEBI (approx 53M triples)
      • Last is the SP2Bench dataset (10M triples)
    • 36 total, fixed SELECT queries specified that exercise both SPARQL language and use-case scenarios
      • 7 cross-domain, 7 life-science, 11 SP2Bench and 11 linked-data queries
      • Cross-domain and life-science queries test "federation-specific aspects, in particular (1) number of data sources involved, (2) join complexity, (3) types of links used to join sources, and (4) varying query (and intermediate) result size"
      • SP2Bench queries are discussed below and included in FedBench to exercise SPARQL language features (only the SELECT queries are used)
      • Linked-data queries focused on basic graph patterns (e.g., conjunctive query)
    • Performance metrics based mainly on query execution time
  • LUBM (Lehigh University Benchmark)
    • Dataset based on a "university" ontology (Universities, Professors, Students, Courses, etc.) with 43 classes and 32 properties
    • Synthetically-generated data scaled in size
      • Defined datasets range from 1 to 8000 universities, the largest one having approximately 1B triples
    • 14 fixed queries defined focused on instance retrieval (SELECT queries) and limited inference (based on subsumption/subclassing, owl:TransitiveProperty and owl:inverseOf)
      • Factors of importance: Proportion of the instances involved (size and selectivity); Complexity of the query; Requirement for traversal of class/property hierarchies; and Requirement for inference
      • Queries do not include language features such as OPTIONAL, UNION, DESCRIBE, etc.
    • Performance metrics include load time, query response time, answer completeness and correctness, and a combined metric (similar to F-Measure) based on completeness/correctness
  • SP2Bench (SPARQL Performance Benchmark)
    • Dataset based on the structure of the DBLP Computer Science Bibliography with 8 classes and 22 properties
    • Synthetic data (of different sizes) generated based on the characteristics of the underlying DBLP information
    • 12 different queries exercising SPARQL language features and JOIN operations, as well as SPARQL complexity and result size
      • ASK and SELECT queries defined that test JOINs, FILTER, UNION, OPTIONAL, DISTINCT, ORDER BY, LIMIT, OFFSET, and blank node and container processing
      • Evaluating "long path chains (i.e. nodes linked to ... other nodes via a long path), bushy patterns (i.e. single nodes that are linked to a multitude of other nodes), and combinations of these two"
    • Benchmark queries parameterized?
    • Performance metrics include the load time, success rate (separately reporting success rates for for all document sizes, and distinguishing between success, timeout, memory issues and other errors), global and per-query performance (where the latter combines the per-query results and produces both the mean and geometric mean), and memory consumption (reporting both the maximum consumption and the average across all queries)
  • UOBM (University Ontology Benchmark, very similar to but extending LUBM)
    • Two ontologies defined with different inferencing requirements (OWL Lite and OWL DL)
    • Ontology classes and properties added (69 total classes and 43 properties in the OWL DL ontology)
    • Generation of synthetic data to include links between universities' and departments' data
    • 15 fixed SELECT queries defined

In addition, the following papers and the 1993 Benchmark Handbook also provided important background information: