You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User talk:AndreaWest/Blazegraph Features and Capabilities: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>AndreaWest
mNo edit summary
imported>Jheald
(+)
 
Line 18: Line 18:


:Please send me an example of a named subquery that you consider "more involved", so that I can validate that it can be accomplished without naming but maybe with shortcuts. Thanks! [[User:AndreaWest|Andrea Westerinen]] ([[User talk:AndreaWest|talk]]) 15:48, 26 May 2022 (UTC)
:Please send me an example of a named subquery that you consider "more involved", so that I can validate that it can be accomplished without naming but maybe with shortcuts. Thanks! [[User:AndreaWest|Andrea Westerinen]] ([[User talk:AndreaWest|talk]]) 15:48, 26 May 2022 (UTC)
::Good to see you on the chat this afternoon.  Here are a couple more examples of queries that are using named subqueries, for something other than just execution sequence control: https://w.wiki/5HFo and <code>tinyurl.com/35frc9cp</code>
::The first query tries to find manorial estates where we have an external ID for one or more estates in a particular location, but are also missing the external ID for one or more other estates in the same location (which therefore might be expected to occur elsewhere in the same book-chapter, perhaps under a slightly variant name).  The query thus helps with hunting down IDs for this group.  Using named subqueries allows the logic to be set out straightforwardly; and also the set of manor estates to be extracted only once, rather than separately for the group with IDs and the group without.  (Though as at the moment there are fewer {{:d:P3268}} IDs than manors, it ''could'' be more efficient to search on this first, limit to manors, identify locations, find the whole set of manors in that location, filter for manors without an ID, and require that for each location returned there is at least one manor without an ID.  But it would be quite a lot more involved to write and to follow the logic of.  And the number of P3268 IDs ''will'' get larger).
::The second query tries to find examples of classes that are in the [[:d:P279]] tree of both [[:d:Q223557]] and [[:d:Q7184903]], returning only the highest classes up the trees where this intersection occurs (ie not a class that has a parent class for which this is true).  Being able to re-use the already-computed intersection set makes this last condition easy and efficient to implement, without having to trace up each subclass tree again.  [[User:Jheald|Jheald]] ([[User talk:Jheald|talk]]) 21:06, 27 June 2022 (UTC)
:::One other thing I wonder about, after this afternoon's meeting.  I think there is at least one of the candidates which does try to identify repeated subqueries as an optimisation -- so may already have some capability like this.  And others might be able to add it without too much difficulty.  When it came to the original Blazegraph selection, Wikidata was an important enough account for Blazegraph to secure that the BG developers then prioritised things that Wikidata needed, and worked closely to deliver them (in particular adding a 2D geospatial index to power the Box and Around services).  (This may well have paid off for them, if it was the successful demonstration of the deployment of Blazegraph by Wikidata that piqued Amazon's interest in acquiring Blazegraph).
:::So I am wondering this time round, is there any possibility that being chosen by Wikidata might be a sufficiently tempting prospect for one or more of the candidates' development teams that they might consider it worthwhile putting some of their own development effort into things that might ease the transition -- eg by implementing a named subquery extension, or a Blazegraph 'compatibility' mode or pre-processor, or a geospatial z-curve index, or some other of the pinch points?  Or would imagining anything like that just be wishful thinking?  [[User:Jheald|Jheald]] ([[User talk:Jheald|talk]]) 21:35, 27 June 2022 (UTC)

Latest revision as of 21:35, 27 June 2022

wikibase:around and wikibase:box

@AndreaWest: The crucial feature about wikibase:around and wikibase:box is that Blazegraph impemented a geo-spatial index for P625 coordinate location values. (I believe based on a simple w:Z-order curve representation, reducing 2d points to 1d strings). As a result it is very fast for Blazegraph to look up items close to a specific geographical point. This is the key underlying requirement to have an effective wikibase:around and wikibase:box capability.

Using an index is very different to the first approach selected on the page (retrieve everything and then filter) -- because the set of 'everything' (eg all buildings / everything with a heritage designation / everything with a wikidata item) may be very big indeed. Without an index, such queries rapidly become prohibitive. Jheald (talk) 10:30, 25 May 2022 (UTC)

Excellent points! I was thinking of basic GeoSPARQL examples and not performance. Your feedback made me rethink and improve things. I updated the "User page" with information on converting the current Wikidata to be compliant with GeoSPARQL, and have updated the queries to address performance. Performance will be enhanced by geospatial indexes (which endpoints like Jena have). In addition, the queries will be simpler if GeoSPARQL query rewriting is supported. I will make sure to test these aspects in the GeoSPARQL compliance tests.
Hopefully, this addresses your concern. Andrea Westerinen (talk) 15:42, 26 May 2022 (UTC)

Named Subqueries

Named subqueries have become popular for query readability (by breaking it into intelligible chunks), and as a usefully intuitive way to steer execution sequence inside a query (to indicate that a particular group of statements need to be executed first). It may be possible to accommodate these with a preprocessor that replaces the INCLUDE directive with the relevant subquery text as a conventional inline subquery. (Noting that a subquery can itself INCLUDE further subqueries). This would at least allow existing queries to run, if alternate engines did not recognise the Blazegraph named subquery syntax.

They may still be less efficient than Blazegraph however, if the alternate engine cannot recognise the same subquery being invoked for its results more than once -- eg https://w.wiki/H6b as a simple example, where some counts are calculated, and then expressed as a percentage of their total, where the counts are reused to calculate the total, rather than the total being calculated separately from scratch. (This is only a very simple example. In other cases the subquery results being reused may rather more involved and time-consuming to determine; and their re-use may be part of a longer, more involved chain of stages). Jheald (talk) 11:16, 25 May 2022 (UTC)

I agree that named subqueries improve readability, but this is not a SPARQL compliant feature and equivalent functionality is easily achieved (even for sub-queries in sub-queries). Yes, it MAY involve some cut and paste ugliness. I will add an example of your H6b query to the user page.
Please send me an example of a named subquery that you consider "more involved", so that I can validate that it can be accomplished without naming but maybe with shortcuts. Thanks! Andrea Westerinen (talk) 15:48, 26 May 2022 (UTC)
Good to see you on the chat this afternoon. Here are a couple more examples of queries that are using named subqueries, for something other than just execution sequence control: https://w.wiki/5HFo and tinyurl.com/35frc9cp
The first query tries to find manorial estates where we have an external ID for one or more estates in a particular location, but are also missing the external ID for one or more other estates in the same location (which therefore might be expected to occur elsewhere in the same book-chapter, perhaps under a slightly variant name). The query thus helps with hunting down IDs for this group. Using named subqueries allows the logic to be set out straightforwardly; and also the set of manor estates to be extracted only once, rather than separately for the group with IDs and the group without. (Though as at the moment there are fewer {{:d:P3268}} IDs than manors, it could be more efficient to search on this first, limit to manors, identify locations, find the whole set of manors in that location, filter for manors without an ID, and require that for each location returned there is at least one manor without an ID. But it would be quite a lot more involved to write and to follow the logic of. And the number of P3268 IDs will get larger).
The second query tries to find examples of classes that are in the d:P279 tree of both d:Q223557 and d:Q7184903, returning only the highest classes up the trees where this intersection occurs (ie not a class that has a parent class for which this is true). Being able to re-use the already-computed intersection set makes this last condition easy and efficient to implement, without having to trace up each subclass tree again. Jheald (talk) 21:06, 27 June 2022 (UTC)
One other thing I wonder about, after this afternoon's meeting. I think there is at least one of the candidates which does try to identify repeated subqueries as an optimisation -- so may already have some capability like this. And others might be able to add it without too much difficulty. When it came to the original Blazegraph selection, Wikidata was an important enough account for Blazegraph to secure that the BG developers then prioritised things that Wikidata needed, and worked closely to deliver them (in particular adding a 2D geospatial index to power the Box and Around services). (This may well have paid off for them, if it was the successful demonstration of the deployment of Blazegraph by Wikidata that piqued Amazon's interest in acquiring Blazegraph).
So I am wondering this time round, is there any possibility that being chosen by Wikidata might be a sufficiently tempting prospect for one or more of the candidates' development teams that they might consider it worthwhile putting some of their own development effort into things that might ease the transition -- eg by implementing a named subquery extension, or a Blazegraph 'compatibility' mode or pre-processor, or a geospatial z-curve index, or some other of the pinch points? Or would imagining anything like that just be wishful thinking? Jheald (talk) 21:35, 27 June 2022 (UTC)