You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User:AndreaWest/WDQS Testing

From Wikitech-static
< User:AndreaWest
Revision as of 20:10, 29 March 2022 by imported>AndreaWest
Jump to navigation Jump to search

This page overviews a design and specific suggestions for Wikidata SPARQL query testing. These tests will be useful to evaluate Blazegraph backend alternatives and to (possibly) establish a Wikidata SPARQL benchmark for the industry.

Goals

  • Definition of one or more data sets
  • Definition of specific INSERT, DELETE, CONSTRUCT and SELECT queries for performance and capabilities analysis
  • Definition of read/write workloads for stress testing

Testing Specific Updates and Queries

Address different query and update patterns, including a variety of SPARQL features (such as FILTER, OPTIONAL, GROUP BY, ...), federation, geospatial analysis, support for label, GAS, sampling and MediaWiki "services", and more

Workload Testing

TBD

Background on SPARQL Benchmarks

The W3C maintains a web page on RDF Store Benchmarks. Here is background on a few of these (listed in alphabetical order) whose designs provided insights used in the work above.

  • BSBM (Berlin SPARQL Benchmark)
    • Dataset is based on an e-commerce use case with eight classes (Product, ProductType, ProductFeature, Producer, Vendor, Offer, Review, and Person) and 51 properties
    • Synthetically-generated data scaled in size based on the number of products
      • For example, a 100M triple dataset has approximately 9M instances across the various classes
      • Both an RDF and relational representation created to allow comparison of backing storage technologies
    • Benchmark utilizes a mix of 12 distinct queries (1 CONSTRUCT, 1 DESCRIBE and 10 SELECT) intended to test combinations of moderately complex queries in concurrent loads from multiple clients
      • Queries vary with respect to parameterized (but randomized) properties, differing across the various test runs
      • SPARQL language features exercised are: Filtering, 9+ graph patterns, unbound predicates, negation, OPTIONAL, LIMIT, ORDER BY, DISTINCT, REGEX and UNION
    • Performance metrics include query mixes per hour (QMpH), queries per second (QpS, determined by taking the number of queries of a specific type in a test run and dividing by their total execution time), and load time
      • All performance metrics require reporting of the size of the dataset, and the first and second metrics also should report the number of concurrent query clients
  • DBPedia Benchmark
    • Dataset uses one or more DBPedia resources, with possibility to create larger sets by changing namespace names, and to create smaller subsets by selecting a random fraction of triples or by sampling triples across classes
      • The sampling approach attempts to preserve data characteristics of indegree/outdegree (min/max/avg number of edges into/out of a vertex)
    • Queries defined by analyzing requests made against the DBPedia SPARQL endpoint, coupled with specifying SPARQL features to test
      • Analysis process involved 4 steps: Query selection from the SPARQL endpoint log; stripping syntactic constructs (such as namespace prefix definitions); calculation of similarity measures (e.g., Levenshtein string similarity); and, clustering based on the similarity measures (as documented in DBPedia SPARQL Benchmark)
      • SPARQL features to test: Number of triple patterns (to exercise JOIN operations, from 1 to 25), plus the inclusion of UNION and OPTIONAL constructors, the DISTINCT solution modifier, and FILTER, LANG, REGEX and STR operators
      • Result was 25 SPARQL SELECT templates with different variable components (usually an IRI, a literal or a filter condition), with goal of 1000+ different possible values per component
    • Benchmark tests utilize variable DBPedia dataset sizes (10% to 200%) and query mixes based on the 25 templates and parameterized values
    • Performance metrics include query mixes per hour (QMpH), number and type of queries which timed out, and queries per second (QpS, calculated as a mean and geometric mean)
      • Performance varies and is reported based on the dataset size
  • FedBench (evaluates federated query)
  • Geographica (Tests geospatial query)
  • GeoFedBench (Tests GeoSPARQL federated query)
  • LUBM (Lehigh University Benchmark)
    • Dataset based on a "university" ontology (Universities, Professors, Students, Courses, etc.) with 43 classes and 32 properties
    • Synthetically-generated data scaled in size
      • Defined datasets range from 1 to 8000 universities, the largest one having approximately 1B triples
    • 14 fixed queries defined focused on instance retrieval (SELECT queries) and limited inference (based on subsumption/subclassing, owl:TransitiveProperty and owl:inverseOf)
      • Factors of importance: Proportion of the instances involved (size and selectivity); Complexity of the query; Requirement for traversal of class/property hierarchies; and Requirement for inference
      • Queries do not include language features such as OPTIONAL, UNION, DESCRIBE, etc.
    • Performance metrics include load time, query response time, answer completeness and correctness, and a combined metric (similar to F-Measure) based on completeness/correctness
  • SP2Bench (Evaluates SPARQL operators and RDF access patterns)
    • Dataset based on the structure of the DBLP Computer Science Bibliography - 8 classes and 22 properties
    • Synthetic data
    • 12 different queries
    • Benchmark queries parameterized?, focused on "basic performance of the approaches (rather than caching or learning strategies of the systems)"
  • UOBM (University Ontology Benchmark, extending LUBM)


In addition, several papers were informative: