You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User:AndreaWest/Blazegraph Features and Capabilities: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>AndreaWest
imported>AndreaWest
Line 46: Line 46:


== SERVICE Extensions ==
== SERVICE Extensions ==
This section describes how the Blazegraph-specific SERVICEs (wikibase:label, wikibase:mwapi, wikibase:around, wikibase:box, gas:service, bd:sample and bd:slice) will be supported moving forward.  
This section describes how the WDQS- and Blazegraph-specific SERVICEs (wikibase:label, wikibase:mwapi, wikibase:around, wikibase:box, gas:service, bd:sample and bd:slice) could be supported moving forward.  


The geospatial SERVICES, wikibase:around and :box, can (and should) be provided by the use of GeoSPARQL. Details and examples are discussed below.  
The geospatial SERVICES, wikibase:around and :box, can (and should) be provided by the use of GeoSPARQL. Details and examples are discussed below.  


Unfortunately, there is no straightforward, functional approach for supporting wikibase:mwapi, the GAS service, bd:sample and bd:slice. The problem is that these SERVICEs return multiple (possibly many) results, and some execute based on complex parameters that are defined using unique triple patterns. That combination of requirements does not translate into the standard SPARQL function extensions, which take a set of predefined parameters and return a single result. '''In order to support these SERVICEs, modifications to the backend code bases will be required''' - to distinguish local SERVICE IRIs from HTTP federated requests, and then invoke appropriate "handlers".
Unfortunately, there is no straightforward, functional approach for supporting wikibase:mwapi, the GAS service and bd:sample. The problem is that these SERVICEs return multiple (possibly many) results, and some execute based on complex parameters that are defined using unique triple patterns. That combination of requirements does not translate into the standard SPARQL function extensions, which take a set of predefined parameters and return a single result. '''In order to support these SERVICEs, modifications to the backend code bases will be required''' - to distinguish local SERVICE IRIs from HTTP federated requests, and then invoke appropriate "handlers".


Note that this discussion did not reference the wikibase:label SERVICE. Label details can be provided using a SPARQL function extension, although that function (described below) will be less convenient than the existing SERVICE approach. If it is decided to extend the various backend alternatives to support custom, local SERVICES, then that same approach should also be applied to wikibase:label.
Note that this discussion did not reference the bd:slice and wikibase:label SERVICEs. bd:slice functionality can be provided by a judicious use of sub-queries. On the other hand, label details can be provided using a SPARQL function extension, although that function will be less convenient than (but with equivalent capabilities to) the existing SERVICE approach. The inconvenience will be due to the need to repeat language preferences. The alternatives for bd:slice and wikibase:label are described in more detail below.


=== wikibase:around and wikibase:box ===
=== wikibase:around and wikibase:box ===
Line 180: Line 180:
The downside of this approach is the need to repeat the language preferences in each function call.
The downside of this approach is the need to repeat the language preferences in each function call.


=== wikibase:mwapi and Blazegraph GAS, bd:sample and bd:slice ===
=== bd:slice ===
The remainder of the Blazegraph SERVICEs are described on the following pages:
The functionality of bd:slice is discussed in the code, [https://github.com/blazegraph/database/blob/3127706f0b6504838daae226b9158840d2df1744/bigdata-core/bigdata-rdf/src/java/com/bigdata/rdf/sparql/ast/eval/SliceServiceFactory.java Slice Service Factory documentation]. In its simplest form, it provides a means to get a subset of results. However, the same functionality can be provided by using a sub-query with a limit/offset.
 
Let's illustrate this with an example. The query below returned 3743 results in 37074 ms. (The query without bd:slice, with the WHERE clause, "?item wdt:P31 wd:Q13442814. MINUS {?item wdt:P577 ?date}", timed out.)
<nowiki># Work-around for query for scholarly articles with no date of publication (which times out without bd:slice)
SELECT ?item ?itemLabel
WHERE
{
  SERVICE bd:slice {
    ?item wdt:P31 wd:Q13442814.
    bd:serviceParam bd:slice.limit 1000000  # 1M items returned
  }
  minus {
    ?item wdt:P577 ?date.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}</nowiki>
 
The same functionality can be achieved by using a sub-query, as follows:
<nowiki>SELECT ?item ?itemLabel
WHERE
{
  {
    SELECT ?item WHERE { ?item wdt:P31 wd:Q13442814 } LIMIT 1000000
  }
  minus {
    ?item wdt:P577 ?date.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}</nowiki>
Running this query returned 3743 results but took 49671 ms.
 
The other use of bd:slice is to return a count (using the bd:slice.range predicate). There no queries in the February 2022 set that used this predicate, and one query in March (shown below). It returned 1 result (the count of triples = 151295) in 111 ms.
<nowiki>SELECT ?range WHERE
{
  SERVICE bd:slice
  {
    ?item wdt:P6039 ?o .
    bd:serviceParam bd:slice.range ?range .
  }
}</nowiki>
 
This query can be rewritten using a simple SPARQL COUNT feature. It also returns 1 result (151295) but in 205 ms.
<nowiki>SELECT (COUNT(*) as ?range) WHERE
{
  ?item wdt:P6039 ?o . 
}</nowiki>
 
Note that the timings above do vary based on caching of results.
 
=== wikibase:mwapi, gas:service and bd:sample ===
The remainder of the Blazegraph SERVICEs are each described on the following pages:
* [https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#MediaWiki_API MediaWiki API], wikibase:mwapi
* [https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#MediaWiki_API MediaWiki API], wikibase:mwapi
* [https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization#GAS_Service GAS (gather, apply, scatter)], gas:service
* [https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization#GAS_Service GAS (gather, apply, scatter)], gas:service
* [https://github.com/blazegraph/database/blob/3127706f0b6504838daae226b9158840d2df1744/bigdata-core/bigdata-rdf/src/java/com/bigdata/rdf/sparql/ast/eval/SampleServiceFactory.java Sample Service Factory documentation], bd:sample
* [https://github.com/blazegraph/database/blob/3127706f0b6504838daae226b9158840d2df1744/bigdata-core/bigdata-rdf/src/java/com/bigdata/rdf/sparql/ast/eval/SampleServiceFactory.java Sample Service Factory documentation], bd:sample
* [https://github.com/blazegraph/database/blob/3127706f0b6504838daae226b9158840d2df1744/bigdata-core/bigdata-rdf/src/java/com/bigdata/rdf/sparql/ast/eval/SliceServiceFactory.java Slice Service Factory documentation], bd:slice


In order to provide similar functionality, the backend code base would have to be modified to distinguish a SERVICE invocation addressed to a local IRI (with the prefix, "urn:", "wikibase:" or similar) and an HTTP endpoint. That checking could occur when the SPARQL is being parsed (its algebra/semantics are being defined) or while iterating through/executing the component clauses. It would seem that the later is a better option - since the backend infrastructure would already account for variable bindings.  
In order to provide similar functionality, each of the backend code bases would have to be modified to distinguish a SERVICE invocation addressed to a local IRI (e.g., with the prefix, "urn:", "wikibase:" or similar) and an actual, external HTTP endpoint. That checking could occur:
# When the SPARQL is being parsed (its algebra/semantics are being defined)
# While iterating through/executing the component clauses of the query
# By modifying the SERVICE processing itself
The latter two options are likely preferable - since the backend infrastructure would already account for variable bindings and combining results into the final solution.  


For a local IRI, it is most logical to check a registry of possible "handlers" and then invoke the appropriate code or return an error. The graph patterns of the SERVICE clause and current variable bindings would be passed to the "handler" code, as is done for all SERVICEs. Results would have to be returned consistent with the [https://www.w3.org/TR/sparql11-federated-query/ SPARQL 1.1 Federated Query specification], meaning that they would be an array of variable-RDF term bindings.
When executing a local IRI/SERVICE, it is most logical to check a registry of possible "handlers" and then invoke the appropriate code or return an error. The graph patterns of the SERVICE clause and current variable bindings would be passed to the "handler" code, as is done for all SERVICEs. Results would have to be returned consistent with the [https://www.w3.org/TR/sparql11-federated-query/ SPARQL 1.1 Federated Query specification], meaning that they would be an array of variable-RDF term bindings.


It is likely that the current SERVICE implementations would need to be adapted to the design points of the specific backends, but the majority of the processing logic should be able to be reused.
It is likely that the current SERVICE implementations would need to be adapted to the design points of the specific backends, but the majority of the processing logic should be able to be reused.
Note that one of the backend alternatives (Apache Jena) already has [https://github.com/apache/jena/tree/main/jena-arq/src/main/java/org/apache/jena/sparql/service hooks for providing custom SERVICES]. This implementation takes the approach of invoking the custom SERVICE while iterating through the query processing (bullet #2, above). Unfortunately, at the time of writing (late April 2022), there is no documentation related to its use. There is, however, a [https://github.com/apache/jena/tree/main/jena-integration-tests/src/test/java/org/apache/jena/test/service simple test scenario defined].
==== Frequency of Use of the Blazegraph SERVICE Extensions ====
Modifying the existing implementations to support local SERVICE extensions and adjusting the logic of those extensions to execute in the particular backend environment may be costly and/or introduce errors. In addition, the Wikidata documentation related to the [https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Blazegraph_extensions Blazegraph-specific extensions] (gas:service and bd:sample) states that support may be discontinued at some time in the future. Making such a call will require discussion with the community. 
To inform discussion, the following shows the usage statistics of these extensions (across all queries issued in February 2022):
{| class="wikitable"
|+ Usage of Custom SERVICE Extensions
|-
! SERVICE !! Percentage of queries
|-
| wikibase:mwapi || 11.88%
|-
| gas:service || 0.024%
|-
| bd:sample || 0.002%
|}
As an aside, the bd:slice SERVICE (discussed above) is used in 0.04% of the February 2022 queries (significantly higher than gas:service or bd:sample).
Of these extensions, the MWAPI SERVICE is the most critical to support.

Revision as of 18:23, 28 April 2022

The following is a list of Blazegraph-specific features and capabilities used by WDQS and its community. Defining alternative implementations that minimize the user impact is of critical importance.

Overview of Blazegraph-Specific Features and Capabilities

  • SPARQL functionality extensions
    • Typically, SPARQL is extended by new datatypes and functions
      • The current Blazegraph implementation has a mix of function extensions (geof:distance, geof:globe, geof:latitude, geof:longitude and wikibase:decodeUri) and SERVICE extensions (such as wikibase:label, wikibase:around or wikibase:mwapi)
      • Whereas functions return a single value, the WDQS SERVICES provide multiple outputs
      • Each of the current datatypes/functions and SERVICES are discussed below
  • Named subqueries
    • Documentation
    • Note that although support for subqueries is required for SPARQL compliance, naming is not a compliant feature
    • It is likely that subqueries will not be name-able
    • Based on the placement of the subquery in the overall SPARQL and the use of query hints, a subquery's order of execution can be controlled

SPARQL Functional Extensions

The current Blazegraph functional extensions require the creation of custom SPARQL functions (which are supported by all the Blazegraph alternative backends):

  • The functions, geof:globe, geof:latitude and geof:longitude, are simple decompositions of the geometry data of a POINT
    • In both Wikidata and GeoSPARQL, a geometric POINT utilizes a WKT (well-known text) representation, and is specified by a coordinate system,followed by a longitude/latitude
    • The coordinate system (also known as the spatial reference system) is defined either by WGS84 on Earth or identified by an item ID (within right and left carets, '<' and '>') which specifies a non-Earth/planetary body
    • The geof:globe function retrieves the coordinate system; If a coordinate system is not specified, then the default is <http://www.opengis.net/def/crs/OGC/1.3/CRS84> (Earth)
    • geof:latitude and geof:longitude split the POINT data with longitude specified first and latitude second
    • The wikibase:geoGlobe, wikibase:geoLatitude and wikibase:geoLongitude value properties correspond to the geof:globe, geof:latitude and geof:longitude functions
      • The properties and functions can be used to construct/destruct a geospatial POINT
      • For example, a POINT location can be defined from the value properties using the following graph pattern: BIND (STRDT(CONCAT("POINT(<", wikibase:geoGlobe, "> ", wikibase:geoLongitude, " ", wikibase:geoLatitude, ")"), <http://www.opengis.net/ont/geosparql#wktLiteral>) as ?location)
  • The function, wikibase:decodeURI, will be defined using the logic at https://github.com/wikimedia/wikidata-query-rdf/blob/master/blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/constraints/DecodeUriBOp.java

Geospatial Support Using GeoSPARQL

The last Blazegraph property extension is geof:distance. That is supported directly by GeoSPARQL, and is also identified as geof:distance. Blazegraph's geof:distance takes as input two POINTs and returns the distance between them in kilometers. The GeoSPARQL function, geof:distance, also supports the input of two POINTs and adds a third parameter, units (which could be defaulted in the code base to kilometers). Note that GeoSPARQL 1.0 has only a few basic units of measure defined. But, the proposed GeoSPARQL 1.1 specification indicates the use of the Quantities, Units, Dimensions and Types ontology (QUDT) which is much broader.

Beyond the distance function, there are other valuable GeoSPARQL features and functions which could be used in Wikidata queries. These include:

  • Specification of geometries/locations beyond POINTs, such as POLYGONs (which are specified as a group of POINTs that define the geometry's boundary)
  • geof:buffer function, which conceptualizes the space around a geometry (such as a POINT), where the space is defined by a radius given by some units
  • geof:envelope function, which returns the minimal bounding box for an input geometry
    • Given a complex POLYGON, the function would return another POLYGON defining the 4 corners of the minimal bounding box
  • Specification of topology relation functions which compare two geometries and return a boolean indicating if they meet the criteria of the function:
    • geof:sfEquals, returns true if the 2 geometries are equal
    • geof:sfDisjoint, returns true if the 2 geometries are disjoint/separate (inverse of geof:sfEquals)
    • geof:sfIntersects, returns true if any part of the first geometry overlaps with any part of the second
    • geof:sfTouches, returns true if a boundary of the first geometry comes into contact with the boundary of the second (but the interiors of the geometries do NOT intersect)
    • geof:sfCrosses, returns true if the interior of the first geometry comes into contact with the interior or boundary of the second
    • geof:sfWithin, returns true if the second geometry completely encloses the first
    • geof:sfContains, returns true if the first geometry completely encloses the second

Note that some of the above will be used to address the Blazegraph geospatial SERVICEs (wikibase:around and wikibase:box), as explained below.

SERVICE Extensions

This section describes how the WDQS- and Blazegraph-specific SERVICEs (wikibase:label, wikibase:mwapi, wikibase:around, wikibase:box, gas:service, bd:sample and bd:slice) could be supported moving forward.

The geospatial SERVICES, wikibase:around and :box, can (and should) be provided by the use of GeoSPARQL. Details and examples are discussed below.

Unfortunately, there is no straightforward, functional approach for supporting wikibase:mwapi, the GAS service and bd:sample. The problem is that these SERVICEs return multiple (possibly many) results, and some execute based on complex parameters that are defined using unique triple patterns. That combination of requirements does not translate into the standard SPARQL function extensions, which take a set of predefined parameters and return a single result. In order to support these SERVICEs, modifications to the backend code bases will be required - to distinguish local SERVICE IRIs from HTTP federated requests, and then invoke appropriate "handlers".

Note that this discussion did not reference the bd:slice and wikibase:label SERVICEs. bd:slice functionality can be provided by a judicious use of sub-queries. On the other hand, label details can be provided using a SPARQL function extension, although that function will be less convenient than (but with equivalent capabilities to) the existing SERVICE approach. The inconvenience will be due to the need to repeat language preferences. The alternatives for bd:slice and wikibase:label are described in more detail below.

wikibase:around and wikibase:box

It is reasonable to replace the wikibase:around and :box SERVICEs with graph patterns that utilize the GeoSPARQL topology relation functions discussed above. This approach might be most easily explained by using examples.

Let us first examine a query using the wikibase:around SERVICE, which finds airports within 100km of Berlin:

SELECT ?place ?location ?dist WHERE {
  wd:Q64 wdt:P625 ?berlinLoc .       # Berlin coordinates
  SERVICE wikibase:around { 
      ?place wdt:P625 ?location . 
      bd:serviceParam wikibase:center ?berlinLoc . 
      bd:serviceParam wikibase:radius "100" . 
      bd:serviceParam wikibase:distance ?dist.
  } 
  FILTER EXISTS { ?place wdt:P31/wdt:P279* wd:Q1248784 }    # Is an airport
} ORDER BY ASC(?dist)

This could be written as:

prefix uom: <http://www.opengis.net/def/uom/OGC/1.0/>
SELECT ?place ?location ?dist WHERE {
  wd:Q64 wdt:P625 ?berlinLoc .             # Berlin location
  ?place wdt:P31/wdt:P279* wd:Q1248784 ;   # Get airports
         wdt:P625 ?location .              # And their coordinates
  BIND (geof:distance(?berlinLoc, ?location, uom:meter) as ?dist) .
  FILTER (?dist <= 100000)
} ORDER BY ASC(?dist)

Alternately, the check could be accomplished by the following query:

prefix uom: <http://www.opengis.net/def/uom/OGC/1.0/>
SELECT ?place ?location ?dist WHERE {
  { 
     SELECT ?berlinLoc ?aroundBerlinLoc WHERE {
        wd:Q64 wdt:P625 ?berlinLoc .       # Berlin location
        BIND (geof:buffer(?berlinLoc, 100000, uom:meter) as ?aroundBerlinLoc) }   # Geometry surrounding Berlin
  }
  ?place wdt:P31/wdt:P279* wd:Q1248784 ;   # Get airports
         wdt:P625 ?location .              # And their coordinates
  # Filter if the airport location is within the specified geometry
  FILTER (geof:sfWithin(?location, ?aroundBerlinLoc)) .
  BIND (geof:distance(?berlinLoc, ?location, uom:meter) as ?dist) .    # Get the actual distance after filtering
} ORDER BY ASC(?dist)

In order to support the wikibase:box functionality, a similar approach is taken - although geof:buffer is replaced by a custom geof:box function. For example, this query using wikibase:box finds all schools between San Jose and San Francisco CA:

SELECT ?place ?location WHERE {
  wd:Q62 wdt:P625 ?point1 .     # San Francisco location
  wd:Q16553 wdt:P625 ?point2 .  # San Jose location
  SERVICE wikibase:box {
    ?place wdt:P625 ?location .
    bd:serviceParam wikibase:cornerWest ?point1 .
    bd:serviceParam wikibase:cornerEast ?point2 .
  }
  FILTER EXISTS { ?place wdt:P31/wdt:P279* wd:Q3914 }   # Get schools
}

It becomes:

prefix uom: <http://www.opengis.net/def/uom/OGC/1.0/>
SELECT ?place ?location WHERE {
  { 
     SELECT ?boundingBox WHERE {
        wd:Q62 wdt:P625 ?westPoint .        # San Francisco location
        wd:Q16553 wdt:P625 ?eastPoint .     # San Jose location
        BIND (geof:box(?westPoint, ?eastPoint) as ?boundingBox) }  
  }
  ?place wdt:P31/wdt:P279* wd:Q3914 ;   # Get schools
         wdt:P625 ?location .           # And their coordinates
  # Filter if the school location is within the specified bounding box
  FILTER (geof:sfWithin(?location, ?boundingBox))   
}

Note that the above proposes a new function (geof:box) that constructs a bounding polygon based on two POINT locations - where the first parameter is the western-most point and the second parameter is the eastern-most point. The latter is simply a decomposition of the two POINTs into their latitudes and longitudes, and then the creation of a POLYGON using the SPARQL STRDT function. The function could be provided for convenience. If not provided, the functionality is implemented using the following graph patterns:

BIND (geof:latitude(?westPoint) as ?westLat) .
BIND (geof:longitude(?westPoint) as ?westLong).
BIND (geof:latitude(?eastPoint) as ?eastLat) .
BIND (geof:longitude(?eastPoint) as ?eastLong) .
# Note that a POLYGON must be closed (e.g., begin and end at the same POINT)
BIND (CONCAT("POLYGON(", STR(?westLong), " ", STR(?westLat), ", ", STR(?eastLong), " ", STR(?westLat), ", ",
             STR(?eastLong), " ", STR(?eastLat), ", ", STR(?westLong), " ", STR(?eastLat), ", ",
             STR(?westLong), " ", STR(?westLat), ")") as ?polygonString ) .   
BIND (STRDT(?polygonString, geo:wktLiteral) as ?boundingBox) .

wikibase:label

The wikibase:label SERVICE provides an easy means to retrieve rdfs:label, skos:altLabel and schema:description values for an entity. Its main uses are to simplify the SPARQL query and to provide language preferences for the text that is returned. The latter is the more significant aspect of the SERVICE and is the main focus of the functions defined here.

The label SERVICE could be implemented in 3 new SPARQL functions, each returning a string literal:

string literal wikibase:label (variable var, "string_of_language_codes")
string literal wikibase:altLabel (variable var, "string_of_language_codes")
string literal wikibase:description (variable var, "string_of_language_codes")

These functions would be used in BIND statements to associate specific variable names to the returned texts, which would then be referenced in the query's SELECT clause or used later in the query, for example in a FILTER statement.

As an example of the use of the current label SERVICE, the following query lists the US presidents and their spouses:

SELECT ?p ?pLabel ?w ?wLabel WHERE {
   wd:Q30 p:P6/ps:P6 ?p .
   ?p wdt:P26 ?w .
   SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
   }
}

This would be (re)written as:

SELECT ?p ?pLabel ?w ?wLabel WHERE {
   wd:Q30 p:P6/ps:P6 ?p .
   ?p wdt:P26 ?w .
   BIND (wikibase:label(?p, "en") as ?pLabel)
   BIND (wikibase:label(?w, "en") as ?wLabel)
}

As another example, consider this query which uses the manual mode of the label SERVICE:

SELECT * WHERE {
     SERVICE wikibase:label {
       bd:serviceParam wikibase:language "fr,de,en" .
       wd:Q123 rdfs:label ?q123Label .
       wd:Q123 skos:altLabel ?q123Alt .
       wd:Q123 schema:description ?q123Desc .
       wd:Q321 rdfs:label ?q321Label .
    }
}

This would be written as:

SELECT * WHERE {
     BIND (wikibase:label(wd:Q123, "fr,de,en") as ?q123Label) .
     BIND (wikibase:altLabel(wd:Q123, "fr,de,en") as ?q123Alt) .
     BIND (wikibase:description(wd:Q123, "fr,de,en") as ?q123Desc) .
     BIND (wikibase:label(wd:Q321, "fr,de,en") as ?q321Label) .
}

The downside of this approach is the need to repeat the language preferences in each function call.

bd:slice

The functionality of bd:slice is discussed in the code, Slice Service Factory documentation. In its simplest form, it provides a means to get a subset of results. However, the same functionality can be provided by using a sub-query with a limit/offset.

Let's illustrate this with an example. The query below returned 3743 results in 37074 ms. (The query without bd:slice, with the WHERE clause, "?item wdt:P31 wd:Q13442814. MINUS {?item wdt:P577 ?date}", timed out.)

# Work-around for query for scholarly articles with no date of publication (which times out without bd:slice)
SELECT ?item ?itemLabel 
WHERE 
{
  SERVICE bd:slice {
    ?item wdt:P31 wd:Q13442814.
    bd:serviceParam bd:slice.limit 1000000   # 1M items returned
  }
  minus {
    ?item wdt:P577 ?date.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

The same functionality can be achieved by using a sub-query, as follows:

SELECT ?item ?itemLabel 
WHERE 
{
  { 
    SELECT ?item WHERE { ?item wdt:P31 wd:Q13442814 } LIMIT 1000000
  }
  minus {
    ?item wdt:P577 ?date.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Running this query returned 3743 results but took 49671 ms.

The other use of bd:slice is to return a count (using the bd:slice.range predicate). There no queries in the February 2022 set that used this predicate, and one query in March (shown below). It returned 1 result (the count of triples = 151295) in 111 ms.

SELECT ?range WHERE
{
  SERVICE bd:slice
  {
    ?item wdt:P6039 ?o .
    bd:serviceParam bd:slice.range ?range .
  }
}

This query can be rewritten using a simple SPARQL COUNT feature. It also returns 1 result (151295) but in 205 ms.

SELECT (COUNT(*) as ?range) WHERE
{ 
  ?item wdt:P6039 ?o .  
}

Note that the timings above do vary based on caching of results.

wikibase:mwapi, gas:service and bd:sample

The remainder of the Blazegraph SERVICEs are each described on the following pages:

In order to provide similar functionality, each of the backend code bases would have to be modified to distinguish a SERVICE invocation addressed to a local IRI (e.g., with the prefix, "urn:", "wikibase:" or similar) and an actual, external HTTP endpoint. That checking could occur:

  1. When the SPARQL is being parsed (its algebra/semantics are being defined)
  2. While iterating through/executing the component clauses of the query
  3. By modifying the SERVICE processing itself

The latter two options are likely preferable - since the backend infrastructure would already account for variable bindings and combining results into the final solution.

When executing a local IRI/SERVICE, it is most logical to check a registry of possible "handlers" and then invoke the appropriate code or return an error. The graph patterns of the SERVICE clause and current variable bindings would be passed to the "handler" code, as is done for all SERVICEs. Results would have to be returned consistent with the SPARQL 1.1 Federated Query specification, meaning that they would be an array of variable-RDF term bindings.

It is likely that the current SERVICE implementations would need to be adapted to the design points of the specific backends, but the majority of the processing logic should be able to be reused.

Note that one of the backend alternatives (Apache Jena) already has hooks for providing custom SERVICES. This implementation takes the approach of invoking the custom SERVICE while iterating through the query processing (bullet #2, above). Unfortunately, at the time of writing (late April 2022), there is no documentation related to its use. There is, however, a simple test scenario defined.

Frequency of Use of the Blazegraph SERVICE Extensions

Modifying the existing implementations to support local SERVICE extensions and adjusting the logic of those extensions to execute in the particular backend environment may be costly and/or introduce errors. In addition, the Wikidata documentation related to the Blazegraph-specific extensions (gas:service and bd:sample) states that support may be discontinued at some time in the future. Making such a call will require discussion with the community.

To inform discussion, the following shows the usage statistics of these extensions (across all queries issued in February 2022):

Usage of Custom SERVICE Extensions
SERVICE Percentage of queries
wikibase:mwapi 11.88%
gas:service 0.024%
bd:sample 0.002%

As an aside, the bd:slice SERVICE (discussed above) is used in 0.04% of the February 2022 queries (significantly higher than gas:service or bd:sample).

Of these extensions, the MWAPI SERVICE is the most critical to support.