You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User:AndreaWest/Blazegraph Features and Capabilities: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>AndreaWest
imported>AndreaWest
Line 4: Line 4:
* SPARQL functionality extensions
* SPARQL functionality extensions
** Typically, SPARQL is extended by new datatypes and functions
** Typically, SPARQL is extended by new datatypes and functions
*** The current implementation has a mix of both datatype and function extensions (geof:distance, geof:globe, geof:latitude, geof:longitude and wikibase:decodeUri) with other functionality provided as SERVICEs
*** The current Blazegraph implementation has a mix of function extensions (geof:distance, geof:globe, geof:latitude, geof:longitude and wikibase:decodeUri) and SERVICE extensions (such as wikibase:label, wikibase:around or wikibase:mwapi)
*** Whereas functions return a single value, the WDQS SERVICES (such as wikibase:label or wikibase:mwapi) provide multiple outputs  
*** Whereas functions return a single value, the WDQS SERVICES provide multiple outputs  
*** It is possible to implement similar SERVICE functionality at an addressable IRI (and still use federation), but the issue is providing the necessary context/query results for manipulation (for example, what items need retrieval of their labels/alt labels/descriptions)
*** '''Each of the current datatypes/functions and SERVICES are discussed below'''
*** Each of the current datatypes/functions and SERVICES are separately discussed below
* Named subqueries
* Named subqueries
** [https://github.com/blazegraph/database/wiki/NamedSubquery Documentation]
** [https://github.com/blazegraph/database/wiki/NamedSubquery Documentation]
** Note that although support for subqueries is required for SPARQL compliance, naming is not a compliant feature
** Note that although '''support for subqueries is required for SPARQL compliance''', naming is not a compliant feature
** It is likely that subqueries will not be name-able, however, based on the placement of the subquery in the overall SPARQL and the use of query hints, a subquery's order of execution can be controlled
** It is '''likely that subqueries will not be name-able'''
** Based on the placement of the subquery in the overall SPARQL and the use of query hints, a subquery's order of execution can be controlled


== Datatype and Property Extensions ==
== SPARQL Functional Extensions ==
The current datatype and property extensions will be mainly supported by the use of GeoSPARQL. The function, wikibase:decodeURI, will require creating a new property function in the backend's code base. The logic for decodeURI is exactly as defined in https://github.com/wikimedia/wikidata-query-rdf/blob/master/blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/constraints/DecodeUriBOp.java.
The current Blazegraph functional extensions require the creation of custom SPARQL functions (which are supported by all the Blazegraph alternative backends):
* The functions, '''geof:globe, geof:latitude and geof:longitude''', are simple decompositions of the geometry data of a POINT
** In both Wikidata and GeoSPARQL, a geometric POINT utilizes a WKT (well-known text) representation, and is specified by a coordinate system,followed by a longitude/latitude
** The coordinate system (also known as the spatial reference system) is defined either by WGS84 on Earth or identified by an item ID (within right and left carets, '<' and '>') which specifies a non-Earth/planetary body
** The geof:globe function retrieves the coordinate system; If a coordinate system is not specified, then the default is <http://www.opengis.net/def/crs/OGC/1.3/CRS84> (Earth)
*** An example of a non-Earth coordinate system is the volcano, Olympus Mons on Mars (which is located at "<http://www.wikidata.org/entity/Q111> Point(226.2 18.65)"); In this case,  <http://www.wikidata.org/entity/Q111> identifies Mars as the globe/coordinate system
** geof:latitude and geof:longitude split the POINT data with longitude specified first and latitude second
** The '''wikibase:geoGlobe, wikibase:geoLatitude and wikibase:geoLongitude''' value properties correspond to the geof:globe, geof:latitude and geof:longitude functions
*** The properties and functions can be used to construct/destruct a geospatial POINT
*** For example, a POINT location can be defined from the value properties using the following graph pattern: BIND (STRDT(CONCAT("POINT(<", wikibase:geoGlobe, "> ", wikibase:geoLongitude, " ", wikibase:geoLatitude, ")"), <http://www.opengis.net/ont/geosparql#wktLiteral>) as ?location)
* The function, '''wikibase:decodeURI''', will be defined using the logic at https://github.com/wikimedia/wikidata-query-rdf/blob/master/blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/constraints/DecodeUriBOp.java


=== Geospatial Support Using GeoSPARQL ===
=== Geospatial Support Using GeoSPARQL ===
The last Blazegraph property extension is '''geof:distance'''. That is supported directly by GeoSPARQL, and is also identified as ''geof:distance''. Blazegraph's geof:distance takes as input two POINTs and returns the distance between them in kilometers. The GeoSPARQL function, geof:distance, also supports the input of two POINTs and adds a third parameter, units (which could be defaulted in the code base to kilometers). Note that GeoSPARQL 1.0 has only a [http://defs.opengis.net/vocprez/object?uri=http://www.opengis.net/def/uom/OGC/1.0/ few basic units of measure] defined. But, the proposed GeoSPARQL 1.1 specification indicates the use of the [http://qudt.org/2.1/vocab/unit Quantities, Units, Dimensions and Types ontology] (QUDT) which is much broader.


There are 4 geospatial property/function extensions currently defined for Blazegraph - geof:distance, geof:globe, geof:latitude and geof:longitude. These will be addressed by using/moving to the GeoSPARQL standard, as follows:
Beyond the distance function, there are other valuable GeoSPARQL features and functions which could be used in Wikidata queries. These include:
* All of the extensions assume a WKT (well-known text) representation of a geometric POINT
** A POINT is specified based on a coordinate system defined either by WGS84 on Earth or by specifying an item ID (within right and left carets, '<' and '>') such as a planet, and then a longitude/latitude (on that planet)
** GeoSPARQL's default spatial reference system is WGS84 (Earth) with the order longitude, then latitude
*** This is the current definition of POINTs in Wikidata, so no change should be needed
*** For example, Philadelphia (Q1345) is located at "Point(-75.163611111 39.952777777)"
** For non-Earth-based coordinate systems, Wikidata and GeoSPARQL define them consistently
*** For example, the volcano, Olympus Mons on Mars, is located at "<http://www.wikidata.org/entity/Q111> Point(226.2 18.65)" - indicating that the coordinate system is Mars (Q111) and the longitude/latitude are as specified
* The key reason for creating the geof:globe, geof:latitude and geof:longitude extensions was to support the geof:distance function and the geospatial SERVICEs (discussed in more detail below)
** geof:distance takes as input two POINTs and returns the distance between them in kilometers
** The GeoSPARQL function, geof:distance, also supports the input of two POINTs and adds a third parameter, units (which could be defaulted in the code base to kilometers)
 
 
Note that many GeoSPARQL endpoints also support the following standard features and functions:
* Specification of geometries/locations beyond POINTs, such as POLYGONs (which are specified as a group of POINTs that define the geometry's boundary)
* Specification of geometries/locations beyond POINTs, such as POLYGONs (which are specified as a group of POINTs that define the geometry's boundary)
* geof:buffer function, which conceptualizes the space around a geometry (such as a POINT), where the space is defined by a radius given by some units
* geof:buffer function, which conceptualizes the space around a geometry (such as a POINT), where the space is defined by a radius given by some units
* geof:envelope function, which returns the minimal bounding box for an input geometry
* geof:envelope function, which returns the minimal bounding box for an input geometry
** So, given a complex POLYGON, the function would return another POLYGON defining the 4 corners of the minimal bounding box
** Given a complex POLYGON, the function would return another POLYGON defining the 4 corners of the minimal bounding box
* Specification of topology relation functions which compare two geometries and return a boolean indicating if they meet the criteria of the function:
* Specification of topology relation functions which compare two geometries and return a boolean indicating if they meet the criteria of the function:
** geof:sfEquals, returns true if the 2 geometries are equal
** geof:sfEquals, returns true if the 2 geometries are equal
Line 45: Line 43:
** geof:sfContains, returns true if the first geometry completely encloses the second
** geof:sfContains, returns true if the first geometry completely encloses the second


Note that some of the above will be used to address the Blazegraph geospatial SERVICEs (wikibase:around and wikibase:box), as explained below.
== SERVICE Extensions ==
This section describes how the Blazegraph-specific SERVICEs (wikibase:label, wikibase:mmwap, wikibase:around, wikibase:box, gas:service, bd:sample and bd:slice) will be supported moving forward.
Problem: It is possible to implement similar SERVICE functionality at an addressable IRI (and still use federation), but the issue is providing the necessary context/query results for manipulation (for example, what items need retrieval of their labels/alt labels/descriptions). Or, if not implemented as an independent process (at an addressable IRI) but as a functional extension, the issue is how to return multiple results.
=== wikibase:around and wikibase:box ===
It is reasonable to replace the around and box SERVICEs with graph patterns that utilize the GeoSPARQL topology relation functions discussed above. This approach might be most easily explained by using examples.
Let us first examine a query using the wikibase:around SERVICE, which finds airports within 100km of Berlin:
<nowiki>SELECT ?place ?location ?dist WHERE {
  wd:Q64 wdt:P625 ?berlinLoc .      # Berlin coordinates
  SERVICE wikibase:around {
      ?place wdt:P625 ?location .
      bd:serviceParam wikibase:center ?berlinLoc .
      bd:serviceParam wikibase:radius "100" .
      bd:serviceParam wikibase:distance ?dist.
  }
  FILTER EXISTS { ?place wdt:P31/wdt:P279* wd:Q1248784 }    # Is an airport
} ORDER BY ASC(?dist)</nowiki>
This could be written as:
<nowiki>SELECT ?place ?location ?dist WHERE {
  wd:Q64 wdt:P625 ?berlinLoc .            # Berlin location
  ?place wdt:P31/wdt:P279* wd:Q1248784 ;  # Get airports
        wdt:P625 ?location .              # And their coordinates
  BIND (geof:distance(?berlinLoc, ?location, uom:metre) as ?dist) .
  FILTER (?dist <= 100000)
} ORDER BY ASC(?dist)</nowiki>
Alternately, the check could be accomplished by the following query:
<nowiki>SELECT ?place ?location ?dist WHERE {
  {
    SELECT ?berlinLoc ?aroundBerlinLoc WHERE {
        wd:Q64 wdt:P625 ?berlinLoc .      # Berlin location
        BIND (geof:buffer(?berlinLoc, 100000, uom:metre) as ?aroundBerlinLoc) }  # Geometry surrounding Berlin
  }
  ?place wdt:P31/wdt:P279* wd:Q1248784 ;  # Get airports
        wdt:P625 ?location .              # And their coordinates
  # Filter if the airport location is within the specified geometry
  FILTER (geof:sfWithin(?location, ?aroundBerlinLoc)) .
  BIND (geof:distance(?berlinLoc, ?location, uom:metre) as ?dist) .    # Get the actual distance after filtering
} ORDER BY ASC(?dist)</nowiki>
In order to support the wikibase:box functionality, a similar approach is taken - although geof:buffer is replaced by a custom geof:box function. For example, this query using wikibase:box finds all schools between San Jose and San Francisco CA:
<nowiki>SELECT ?place ?location WHERE {
  wd:Q62 wdt:P625 ?point1 .    # San Francisco location
  wd:Q16553 wdt:P625 ?point2 .  # San Jose location
  SERVICE wikibase:box {
    ?place wdt:P625 ?location .
    bd:serviceParam wikibase:cornerWest ?point1 .
    bd:serviceParam wikibase:cornerEast ?point2 .
  }
  FILTER EXISTS { ?place wdt:P31/wdt:P279* wd:Q3914 }  # Get schools
}</nowiki>


Some of the above will be used to address the Blazegraph geospatial SERVICEs, as explained below.
It becomes:
<nowiki>SELECT ?place ?location WHERE {
  {
    SELECT ?boundingBox WHERE {
        wd:Q62 wdt:P625 ?westPoint .        # San Francisco location
        wd:Q16553 wdt:P625 ?eastPoint .    # San Jose location
        BIND (geof:box(?westPoint, ?eastPoint) as ?boundingBox) } 
  }
  ?place wdt:P31/wdt:P279* wd:Q3914 ;  # Get schools
        wdt:P625 ?location .          # And their coordinates
  # Filter if the school location is within the specified bounding box
  FILTER (geof:sfWithin(?location, ?boundingBox)) 
}</nowiki>


== SERVICE Extensions ==
Note that the above proposes a new function ('''geof:box''') that constructs a bounding polygon based on two POINT locations - where the first parameter is the western-most point and the second parameter is the eastern-most point. The latter is simply a decomposition of the two POINTs into their latitudes and longitudes, and then the creation of a POLYGON using the SPARQL STRDT function. The function could be provided for convenience.  If not provided, the functionality is easily implemented using the following graph patterns:
Discuss migrating each of the following Blazegraph-specific SERVICEs: wikibase:label, wikibase:mmwap, gas:service, bd:sample and bd:slice.
<nowiki>BIND (geof:latitude(?westPoint) as ?westLat) .
BIND (geof:longitude(?westPoint) as ?westLong).
BIND (geof:latitude(?eastPoint) as ?eastLat) .
BIND (geof:longitude(?eastPoint) as ?eastLong) .
# Note that a POLYGON must be closed (e.g., begin and end at the same POINT)
BIND (STRDT(CONCAT("POLYGON(", STR(?westLong), " ", STR(?westLat), ", ", STR(?eastLong), " ", STR(?westLat), ", ",
                  STR(?eastLong), " ", STR(?eastLat), ", ", STR(?westLong), " ", STR(?eastLat), ", ",
                  STR(?westLong), " ", STR(?westLat), ")"),   
      <http://www.opengis.net/ont/geosparql#wktLiteral>) as ?boundingBox) .</nowiki>
 
=== wikibase:label ===
xxx
 
=== wikibase:mwapi ===
xxx


=== GAS Service (Gather Apply Scatter) ===
=== GAS Service (Gather Apply Scatter) ===
** [https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization#GAS_Service Documentation]
** [https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization#GAS_Service Documentation]
=== bd:sample and bd:slice ===
xxx

Revision as of 00:25, 26 April 2022

The following is a list of Blazegraph-specific features and capabilities used by WDQS and its community. Defining alternative implementations that minimize the user impact is of critical importance.

Overview of Blazegraph-Specific Features and Capabilities

  • SPARQL functionality extensions
    • Typically, SPARQL is extended by new datatypes and functions
      • The current Blazegraph implementation has a mix of function extensions (geof:distance, geof:globe, geof:latitude, geof:longitude and wikibase:decodeUri) and SERVICE extensions (such as wikibase:label, wikibase:around or wikibase:mwapi)
      • Whereas functions return a single value, the WDQS SERVICES provide multiple outputs
      • Each of the current datatypes/functions and SERVICES are discussed below
  • Named subqueries
    • Documentation
    • Note that although support for subqueries is required for SPARQL compliance, naming is not a compliant feature
    • It is likely that subqueries will not be name-able
    • Based on the placement of the subquery in the overall SPARQL and the use of query hints, a subquery's order of execution can be controlled

SPARQL Functional Extensions

The current Blazegraph functional extensions require the creation of custom SPARQL functions (which are supported by all the Blazegraph alternative backends):

  • The functions, geof:globe, geof:latitude and geof:longitude, are simple decompositions of the geometry data of a POINT
    • In both Wikidata and GeoSPARQL, a geometric POINT utilizes a WKT (well-known text) representation, and is specified by a coordinate system,followed by a longitude/latitude
    • The coordinate system (also known as the spatial reference system) is defined either by WGS84 on Earth or identified by an item ID (within right and left carets, '<' and '>') which specifies a non-Earth/planetary body
    • The geof:globe function retrieves the coordinate system; If a coordinate system is not specified, then the default is <http://www.opengis.net/def/crs/OGC/1.3/CRS84> (Earth)
    • geof:latitude and geof:longitude split the POINT data with longitude specified first and latitude second
    • The wikibase:geoGlobe, wikibase:geoLatitude and wikibase:geoLongitude value properties correspond to the geof:globe, geof:latitude and geof:longitude functions
      • The properties and functions can be used to construct/destruct a geospatial POINT
      • For example, a POINT location can be defined from the value properties using the following graph pattern: BIND (STRDT(CONCAT("POINT(<", wikibase:geoGlobe, "> ", wikibase:geoLongitude, " ", wikibase:geoLatitude, ")"), <http://www.opengis.net/ont/geosparql#wktLiteral>) as ?location)
  • The function, wikibase:decodeURI, will be defined using the logic at https://github.com/wikimedia/wikidata-query-rdf/blob/master/blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/constraints/DecodeUriBOp.java

Geospatial Support Using GeoSPARQL

The last Blazegraph property extension is geof:distance. That is supported directly by GeoSPARQL, and is also identified as geof:distance. Blazegraph's geof:distance takes as input two POINTs and returns the distance between them in kilometers. The GeoSPARQL function, geof:distance, also supports the input of two POINTs and adds a third parameter, units (which could be defaulted in the code base to kilometers). Note that GeoSPARQL 1.0 has only a few basic units of measure defined. But, the proposed GeoSPARQL 1.1 specification indicates the use of the Quantities, Units, Dimensions and Types ontology (QUDT) which is much broader.

Beyond the distance function, there are other valuable GeoSPARQL features and functions which could be used in Wikidata queries. These include:

  • Specification of geometries/locations beyond POINTs, such as POLYGONs (which are specified as a group of POINTs that define the geometry's boundary)
  • geof:buffer function, which conceptualizes the space around a geometry (such as a POINT), where the space is defined by a radius given by some units
  • geof:envelope function, which returns the minimal bounding box for an input geometry
    • Given a complex POLYGON, the function would return another POLYGON defining the 4 corners of the minimal bounding box
  • Specification of topology relation functions which compare two geometries and return a boolean indicating if they meet the criteria of the function:
    • geof:sfEquals, returns true if the 2 geometries are equal
    • geof:sfDisjoint, returns true if the 2 geometries are disjoint/separate (inverse of geof:sfEquals)
    • geof:sfIntersects, returns true if any part of the first geometry overlaps with any part of the second
    • geof:sfTouches, returns true if a boundary of the first geometry comes into contact with the boundary of the second (but the interiors of the geometries do NOT intersect)
    • geof:sfCrosses, returns true if the interior of the first geometry comes into contact with the interior or boundary of the second
    • geof:sfWithin, returns true if the second geometry completely encloses the first
    • geof:sfContains, returns true if the first geometry completely encloses the second

Note that some of the above will be used to address the Blazegraph geospatial SERVICEs (wikibase:around and wikibase:box), as explained below.

SERVICE Extensions

This section describes how the Blazegraph-specific SERVICEs (wikibase:label, wikibase:mmwap, wikibase:around, wikibase:box, gas:service, bd:sample and bd:slice) will be supported moving forward.

Problem: It is possible to implement similar SERVICE functionality at an addressable IRI (and still use federation), but the issue is providing the necessary context/query results for manipulation (for example, what items need retrieval of their labels/alt labels/descriptions). Or, if not implemented as an independent process (at an addressable IRI) but as a functional extension, the issue is how to return multiple results.

wikibase:around and wikibase:box

It is reasonable to replace the around and box SERVICEs with graph patterns that utilize the GeoSPARQL topology relation functions discussed above. This approach might be most easily explained by using examples.

Let us first examine a query using the wikibase:around SERVICE, which finds airports within 100km of Berlin:

SELECT ?place ?location ?dist WHERE {
  wd:Q64 wdt:P625 ?berlinLoc .       # Berlin coordinates
  SERVICE wikibase:around { 
      ?place wdt:P625 ?location . 
      bd:serviceParam wikibase:center ?berlinLoc . 
      bd:serviceParam wikibase:radius "100" . 
      bd:serviceParam wikibase:distance ?dist.
  } 
  FILTER EXISTS { ?place wdt:P31/wdt:P279* wd:Q1248784 }    # Is an airport
} ORDER BY ASC(?dist)

This could be written as:

SELECT ?place ?location ?dist WHERE {
  wd:Q64 wdt:P625 ?berlinLoc .             # Berlin location
  ?place wdt:P31/wdt:P279* wd:Q1248784 ;   # Get airports
         wdt:P625 ?location .              # And their coordinates
  BIND (geof:distance(?berlinLoc, ?location, uom:metre) as ?dist) .
  FILTER (?dist <= 100000)
} ORDER BY ASC(?dist)

Alternately, the check could be accomplished by the following query:

SELECT ?place ?location ?dist WHERE {
  { 
     SELECT ?berlinLoc ?aroundBerlinLoc WHERE {
        wd:Q64 wdt:P625 ?berlinLoc .       # Berlin location
        BIND (geof:buffer(?berlinLoc, 100000, uom:metre) as ?aroundBerlinLoc) }   # Geometry surrounding Berlin
  }
  ?place wdt:P31/wdt:P279* wd:Q1248784 ;   # Get airports
         wdt:P625 ?location .              # And their coordinates
  # Filter if the airport location is within the specified geometry
  FILTER (geof:sfWithin(?location, ?aroundBerlinLoc)) .
  BIND (geof:distance(?berlinLoc, ?location, uom:metre) as ?dist) .    # Get the actual distance after filtering
} ORDER BY ASC(?dist)

In order to support the wikibase:box functionality, a similar approach is taken - although geof:buffer is replaced by a custom geof:box function. For example, this query using wikibase:box finds all schools between San Jose and San Francisco CA:

SELECT ?place ?location WHERE {
  wd:Q62 wdt:P625 ?point1 .     # San Francisco location
  wd:Q16553 wdt:P625 ?point2 .  # San Jose location
  SERVICE wikibase:box {
    ?place wdt:P625 ?location .
    bd:serviceParam wikibase:cornerWest ?point1 .
    bd:serviceParam wikibase:cornerEast ?point2 .
  }
  FILTER EXISTS { ?place wdt:P31/wdt:P279* wd:Q3914 }   # Get schools
}

It becomes:

SELECT ?place ?location WHERE {
  { 
     SELECT ?boundingBox WHERE {
        wd:Q62 wdt:P625 ?westPoint .        # San Francisco location
        wd:Q16553 wdt:P625 ?eastPoint .     # San Jose location
        BIND (geof:box(?westPoint, ?eastPoint) as ?boundingBox) }  
  }
  ?place wdt:P31/wdt:P279* wd:Q3914 ;   # Get schools
         wdt:P625 ?location .           # And their coordinates
  # Filter if the school location is within the specified bounding box
  FILTER (geof:sfWithin(?location, ?boundingBox))   
}

Note that the above proposes a new function (geof:box) that constructs a bounding polygon based on two POINT locations - where the first parameter is the western-most point and the second parameter is the eastern-most point. The latter is simply a decomposition of the two POINTs into their latitudes and longitudes, and then the creation of a POLYGON using the SPARQL STRDT function. The function could be provided for convenience. If not provided, the functionality is easily implemented using the following graph patterns:

BIND (geof:latitude(?westPoint) as ?westLat) .
BIND (geof:longitude(?westPoint) as ?westLong).
BIND (geof:latitude(?eastPoint) as ?eastLat) .
BIND (geof:longitude(?eastPoint) as ?eastLong) .
# Note that a POLYGON must be closed (e.g., begin and end at the same POINT)
BIND (STRDT(CONCAT("POLYGON(", STR(?westLong), " ", STR(?westLat), ", ", STR(?eastLong), " ", STR(?westLat), ", ",
                   STR(?eastLong), " ", STR(?eastLat), ", ", STR(?westLong), " ", STR(?eastLat), ", ",
                   STR(?westLong), " ", STR(?westLat), ")"),     
      <http://www.opengis.net/ont/geosparql#wktLiteral>) as ?boundingBox) .

wikibase:label

xxx

wikibase:mwapi

xxx

GAS Service (Gather Apply Scatter)

bd:sample and bd:slice

xxx