You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

User:AndreaWest/WDQS Testing: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>AndreaWest
mNo edit summary
imported>AndreaWest
Line 22: Line 22:
*** Queries vary with respect to parameterized (but randomized) properties, differing across the various test runs
*** Queries vary with respect to parameterized (but randomized) properties, differing across the various test runs
*** SPARQL language features exercised are: Filtering, 9+ graph patterns, unbound predicates, negation, OPTIONAL, LIMIT, ORDER BY, DISTINCT, REGEX and UNION
*** SPARQL language features exercised are: Filtering, 9+ graph patterns, unbound predicates, negation, OPTIONAL, LIMIT, ORDER BY, DISTINCT, REGEX and UNION
** Performance metrics include query mixes per hour (QMpH), queries per second (QpS, determined by taking the number of queries of a specific type in a test run and dividing by their total execution time), and load time
** Performance metrics include dataset load time, query mixes per hour (QMpH), and queries per second (QpS, determined by taking the number of queries of a specific type in a test run and dividing by their total execution time)
*** All performance metrics require reporting of the size of the dataset, and the first and second metrics also should report the number of concurrent query clients
*** All performance metrics require reporting of the size of the dataset, and the first and second metrics also should report the number of concurrent query clients
* [https://aksw.org/Projects/DBPSB.html DBPedia Benchmark]
* [https://aksw.org/Projects/DBPSB.html DBPedia Benchmark] (deprecated, but the approach to query definition is informative)
** Dataset uses one or more [https://www.dbpedia.org/resources/ DBPedia resources], with possibility to create larger sets by changing namespace names, and to create smaller subsets by selecting a random fraction of triples or by sampling triples across classes
** Dataset uses one or more [https://www.dbpedia.org/resources/ DBPedia resources], with possibility to create larger sets by changing namespace names, and to create smaller subsets by selecting a random fraction of triples or by sampling triples across classes
*** The sampling approach attempts to preserve data characteristics of indegree/outdegree (min/max/avg number of edges into/out of a vertex)
*** The sampling approach attempts to preserve data characteristics of indegree/outdegree (min/max/avg number of edges into/out of a vertex)
Line 33: Line 33:
** Benchmark tests utilize variable DBPedia dataset sizes (10% to 200%) and query mixes based on the 25 templates and parameterized values
** Benchmark tests utilize variable DBPedia dataset sizes (10% to 200%) and query mixes based on the 25 templates and parameterized values
** Performance metrics include query mixes per hour (QMpH), number and type of queries which timed out, and queries per second (QpS, calculated as a mean and geometric mean)
** Performance metrics include query mixes per hour (QMpH), number and type of queries which timed out, and queries per second (QpS, calculated as a mean and geometric mean)
*** Performance varies and is reported based on the dataset size
*** Performance is reported relative to the dataset size
* [https://code.google.com/archive/p/fbench/wikis/Queries.wiki FedBench] (evaluates federated query)
* [https://code.google.com/archive/p/fbench/wikis/Queries.wiki FedBench] (Evaluating federated query)
* [https://github.com/semagrow/benchmark-geographica Geographica] (Tests geospatial query)
** Three, interlinked data collections defined that differ in size, coverage, types of links and types of data (actual vs synthetic)
* [https://github.com/semagrow/benchmark-geofedbench GeoFedBench] (Tests GeoSPARQL federated query)
*** First is cross-domain, holding data from DBpedia, GeoNames, Jamendo, Linked-MDB, New York Times and Semantic Web Dog Food (approximately 160M triples)
*** Second is targeted at Life Sciences, holding data from DBPedia, KEGG, DrugBank and ChEBI (approx 53M triples)
*** Last is the SP2Bench dataset (10M triples)
** 36 total, fixed SELECT queries specified that exercise both SPARQL language and use-case scenarios
*** 7 cross-domain, 7 life-science, 11 SP2Bench and 11 linked-data queries
*** Cross-domain and life-science queries test "federation-specific aspects, in particular (1) number of data sources involved, (2) join complexity, (3) types of links used to join sources, and (4) varying query (and intermediate) result size"
*** SP2Bench queries are discussed below and included in FedBench to exercise SPARQL language features (only the SELECT queries are used)
*** Linked-data queries focused on basic graph patterns (e.g., conjunctive query)
** Performance metrics based mainly on query execution time
 
* Geospatial/GeoSPARQL benchmarks
** [https://github.com/RightBank/Benchmarking-spatially-enabled-RDF-stores EuroSDR geospatial benchmark] (Testing geospatial query ''performance and compliance'')
** [https://github.com/semagrow/benchmark-geofedbench GeoFedBench] (Testing geospatial ''federated query'')
** [https://github.com/semagrow/benchmark-geographica Geographica] (Testing geospatial query ''performance'')
** [https://github.com/OpenLinkSoftware/GeoSPARQLBenchmark GeoSPARQL Benchmark] (Evaluating ''compliance'' to the GeoSPARQL specification)
 
* [http://swat.cse.lehigh.edu/projects/lubm/ LUBM] (Lehigh University Benchmark)
* [http://swat.cse.lehigh.edu/projects/lubm/ LUBM] (Lehigh University Benchmark)
** Dataset based on a "university" ontology (Universities, Professors, Students, Courses, etc.) with 43 classes and 32 properties
** Dataset based on a "university" ontology (Universities, Professors, Students, Courses, etc.) with 43 classes and 32 properties
Line 45: Line 60:
*** Queries do not include language features such as OPTIONAL, UNION, DESCRIBE, etc.
*** Queries do not include language features such as OPTIONAL, UNION, DESCRIBE, etc.
** Performance metrics include load time, query response time, answer completeness and correctness, and a combined metric (similar to F-Measure) based on completeness/correctness
** Performance metrics include load time, query response time, answer completeness and correctness, and a combined metric (similar to F-Measure) based on completeness/correctness
* [https://arxiv.org/pdf/0806.4627.pdf SP2Bench] (Evaluates SPARQL operators and RDF access patterns)
* [https://arxiv.org/pdf/0806.4627.pdf SP2Bench] (SPARQL Performance Benchmark)
** Dataset based on the structure of the [https://dblp.org/ DBLP Computer Science Bibliography] - 8 classes and 22 properties
** Dataset based on the structure of the [https://dblp.org/ DBLP Computer Science Bibliography] with 8 classes and 22 properties
** Synthetic data
** Synthetic data (of different sizes) generated based on the characteristics of the underlying DBLP information
** 12 different queries
** 12 different queries exercising SPARQL language features and JOIN operations, as well as SPARQL complexity and result size
** Benchmark queries parameterized?, focused on "basic performance of the approaches (rather than caching or learning strategies of the systems)"
*** ASK and SELECT queries defined that test JOINs, FILTER, UNION, OPTIONAL, DISTINCT, ORDER BY, LIMIT, OFFSET, and blank node and container processing
* [https://www.researchgate.net/publication/226235487_Towards_a_Complete_OWL_Ontology_Benchmark UOBM] (University Ontology Benchmark, extending LUBM)
*** Evaluating "long path chains (i.e. nodes linked to ... other nodes via a long path), bushy patterns (i.e. single nodes that are linked to a multitude of other nodes), and combinations of these two"
** Benchmark queries parameterized?
** Performance metrics include the load time, success rate (separately reporting success rates for for all document sizes, and distinguishing between success, timeout, memory issues and other errors), global and per-query performance (where the latter combines the per-query results and produces both the mean and geometric mean), and memory consumption (reporting both the maximum consumption and the average across all queries)
* [https://www.researchgate.net/publication/226235487_Towards_a_Complete_OWL_Ontology_Benchmark UOBM] (University Ontology Benchmark, very similar to but extending LUBM)
** Two ontologies defined with different inferencing requirements (OWL Lite and OWL DL)
** Ontology classes and properties added (69 total classes and 43 properties in the OWL DL ontology)
** Generation of synthetic data to include links between universities' and departments' data
** 15 fixed SELECT queries defined




In addition, several papers were informative:
In addition, the following papers and the [https://jimgray.azurewebsites.net/BenchmarkHandbook/TOC.htm 1993 Benchmark Handbook] also provided important background information:
* [https://dl.acm.org/doi/10.1145/1989323.1989340 Apples and Oranges: A Comparison of RDF Benchmarks and Real RDF Datasets]
* [https://dl.acm.org/doi/10.1145/1989323.1989340 Apples and Oranges: A Comparison of RDF Benchmarks and Real RDF Datasets]
* [https://jimgray.azurewebsites.net/BenchmarkHandbook/TOC.htm The Benchmark Handbook, 1993]
* [https://link.springer.com/content/pdf/10.1007/978-3-319-11964-9_13.pdf Diversified Stress Testing of RDF Data Management Systems]
* [https://link.springer.com/content/pdf/10.1007/978-3-319-11964-9_13.pdf Diversified Stress Testing of RDF Data Management Systems]
* [https://svn.aksw.org/papers/2017/ISWC_Iguana/public.pdf Iguana: A Generic Framework for Benchmarking the Read-Write Performance of Triple Stores] with an [https://github.com/dice-group/IGUANA implementation]
* [https://openreview.net/pdf?id=RSTtlH8Ycol KOBE: Cloud-native Open Benchmarking Engine for Federated Query Processors]
* [https://openreview.net/pdf?id=RSTtlH8Ycol KOBE: Cloud-native Open Benchmarking Engine for Federated Query Processors]
* [https://ieeexplore.ieee.org/document/4039291 A Requirements Driven Framework for Benchmarking Semantic Web Knowledge Base Systems]
* [https://ieeexplore.ieee.org/document/4039291 A Requirements Driven Framework for Benchmarking Semantic Web Knowledge Base Systems]
* [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.139.6934&rep=rep1&type=pdf What’s Wrong with OWL Benchmarks?]
* [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.139.6934&rep=rep1&type=pdf What’s Wrong with OWL Benchmarks?]

Revision as of 18:30, 30 March 2022

This page overviews a design and specific suggestions for Wikidata SPARQL query testing. These tests will be useful to evaluate Blazegraph backend alternatives and to (possibly) establish a Wikidata SPARQL benchmark for the industry.

Goals

  • Definition of one or more data sets
  • Definition of specific INSERT, DELETE, CONSTRUCT and SELECT queries for performance and capabilities analysis
  • Definition of read/write workloads for stress testing

Testing Specific Updates and Queries

Address different query and update patterns, including a variety of SPARQL features (such as FILTER, OPTIONAL, GROUP BY, ...), federation, geospatial analysis, support for label, GAS, sampling and MediaWiki "services", and more

Workload Testing

TBD

Background on SPARQL Benchmarks

The W3C maintains a web page on RDF Store Benchmarks. Here is background on a few of these (listed in alphabetical order) whose designs provided insights used in the work above.

  • BSBM (Berlin SPARQL Benchmark)
    • Dataset is based on an e-commerce use case with eight classes (Product, ProductType, ProductFeature, Producer, Vendor, Offer, Review, and Person) and 51 properties
    • Synthetically-generated data scaled in size based on the number of products
      • For example, a 100M triple dataset has approximately 9M instances across the various classes
      • Both an RDF and relational representation created to allow comparison of backing storage technologies
    • Benchmark utilizes a mix of 12 distinct queries (1 CONSTRUCT, 1 DESCRIBE and 10 SELECT) intended to test combinations of moderately complex queries in concurrent loads from multiple clients
      • Queries vary with respect to parameterized (but randomized) properties, differing across the various test runs
      • SPARQL language features exercised are: Filtering, 9+ graph patterns, unbound predicates, negation, OPTIONAL, LIMIT, ORDER BY, DISTINCT, REGEX and UNION
    • Performance metrics include dataset load time, query mixes per hour (QMpH), and queries per second (QpS, determined by taking the number of queries of a specific type in a test run and dividing by their total execution time)
      • All performance metrics require reporting of the size of the dataset, and the first and second metrics also should report the number of concurrent query clients
  • DBPedia Benchmark (deprecated, but the approach to query definition is informative)
    • Dataset uses one or more DBPedia resources, with possibility to create larger sets by changing namespace names, and to create smaller subsets by selecting a random fraction of triples or by sampling triples across classes
      • The sampling approach attempts to preserve data characteristics of indegree/outdegree (min/max/avg number of edges into/out of a vertex)
    • Queries defined by analyzing requests made against the DBPedia SPARQL endpoint, coupled with specifying SPARQL features to test
      • Analysis process involved 4 steps: Query selection from the SPARQL endpoint log; stripping syntactic constructs (such as namespace prefix definitions); calculation of similarity measures (e.g., Levenshtein string similarity); and, clustering based on the similarity measures (as documented in DBPedia SPARQL Benchmark)
      • SPARQL features to test: Number of triple patterns (to exercise JOIN operations, from 1 to 25), plus the inclusion of UNION and OPTIONAL constructors, the DISTINCT solution modifier, and FILTER, LANG, REGEX and STR operators
      • Result was 25 SPARQL SELECT templates with different variable components (usually an IRI, a literal or a filter condition), with goal of 1000+ different possible values per component
    • Benchmark tests utilize variable DBPedia dataset sizes (10% to 200%) and query mixes based on the 25 templates and parameterized values
    • Performance metrics include query mixes per hour (QMpH), number and type of queries which timed out, and queries per second (QpS, calculated as a mean and geometric mean)
      • Performance is reported relative to the dataset size
  • FedBench (Evaluating federated query)
    • Three, interlinked data collections defined that differ in size, coverage, types of links and types of data (actual vs synthetic)
      • First is cross-domain, holding data from DBpedia, GeoNames, Jamendo, Linked-MDB, New York Times and Semantic Web Dog Food (approximately 160M triples)
      • Second is targeted at Life Sciences, holding data from DBPedia, KEGG, DrugBank and ChEBI (approx 53M triples)
      • Last is the SP2Bench dataset (10M triples)
    • 36 total, fixed SELECT queries specified that exercise both SPARQL language and use-case scenarios
      • 7 cross-domain, 7 life-science, 11 SP2Bench and 11 linked-data queries
      • Cross-domain and life-science queries test "federation-specific aspects, in particular (1) number of data sources involved, (2) join complexity, (3) types of links used to join sources, and (4) varying query (and intermediate) result size"
      • SP2Bench queries are discussed below and included in FedBench to exercise SPARQL language features (only the SELECT queries are used)
      • Linked-data queries focused on basic graph patterns (e.g., conjunctive query)
    • Performance metrics based mainly on query execution time
  • LUBM (Lehigh University Benchmark)
    • Dataset based on a "university" ontology (Universities, Professors, Students, Courses, etc.) with 43 classes and 32 properties
    • Synthetically-generated data scaled in size
      • Defined datasets range from 1 to 8000 universities, the largest one having approximately 1B triples
    • 14 fixed queries defined focused on instance retrieval (SELECT queries) and limited inference (based on subsumption/subclassing, owl:TransitiveProperty and owl:inverseOf)
      • Factors of importance: Proportion of the instances involved (size and selectivity); Complexity of the query; Requirement for traversal of class/property hierarchies; and Requirement for inference
      • Queries do not include language features such as OPTIONAL, UNION, DESCRIBE, etc.
    • Performance metrics include load time, query response time, answer completeness and correctness, and a combined metric (similar to F-Measure) based on completeness/correctness
  • SP2Bench (SPARQL Performance Benchmark)
    • Dataset based on the structure of the DBLP Computer Science Bibliography with 8 classes and 22 properties
    • Synthetic data (of different sizes) generated based on the characteristics of the underlying DBLP information
    • 12 different queries exercising SPARQL language features and JOIN operations, as well as SPARQL complexity and result size
      • ASK and SELECT queries defined that test JOINs, FILTER, UNION, OPTIONAL, DISTINCT, ORDER BY, LIMIT, OFFSET, and blank node and container processing
      • Evaluating "long path chains (i.e. nodes linked to ... other nodes via a long path), bushy patterns (i.e. single nodes that are linked to a multitude of other nodes), and combinations of these two"
    • Benchmark queries parameterized?
    • Performance metrics include the load time, success rate (separately reporting success rates for for all document sizes, and distinguishing between success, timeout, memory issues and other errors), global and per-query performance (where the latter combines the per-query results and produces both the mean and geometric mean), and memory consumption (reporting both the maximum consumption and the average across all queries)
  • UOBM (University Ontology Benchmark, very similar to but extending LUBM)
    • Two ontologies defined with different inferencing requirements (OWL Lite and OWL DL)
    • Ontology classes and properties added (69 total classes and 43 properties in the OWL DL ontology)
    • Generation of synthetic data to include links between universities' and departments' data
    • 15 fixed SELECT queries defined


In addition, the following papers and the 1993 Benchmark Handbook also provided important background information: