You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Tool:Wikibase Unicorn

From Wikitech-static
Revision as of 00:21, 22 September 2020 by imported>Ebernhardson
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Toolforge tools
Crystal Clear app package utilities.png
Website https://wikibase-unicorn.toolforge.org/
Description Unicorn graph query language for wikibase
Keywords graph, search, wikibase, wikidata
Maintainer(s) Ebernhardson (View all)
License MIT License

Wikibase Unicorn is a minimal implementation of the unicorn graph query language (pdf). It provides graph search over wikibase enabled wikis by performing recursive queries against the CloudElastic replicas.

This is nowhere near as expressive as SPARQL. But it's significantly easier to read and write. And this scales horizontally, more servers = bigger graph.

Examples

Hospital owners:

(extract P127=
         (or P31=Q16917
             (apply P31= P279=Q16917)))

SPARQL (approximate) equivilant:

SELECT ?owner ?ownerLabel ?sitelinks
WHERE 
{
  {
    SELECT ?owner (count(distinct ?sitelink) as ?sitelinks)
    WHERE {
      ?hospital wdt:P31/wdt:P279* wd:Q16917 .
      ?hospital wdt:P127 ?owner .
      ?sitelink schema:about ?owner
    }
    GROUP BY ?owner
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY DESC(?sitelinks)

Implemented Operators

Operator Example S-expression Description
term (term P31=Q16917) Instances of hospitals
and (and P31=Q16917 P31=Q1774898) Instances of hospitals and clinics
or (or P31=Q16917 P31=Q1774898) Instances of hospitals or clinics
difference (difference P31=Q16917 P31=Q1774898) Instances of hospitals that are not clinics
apply (apply P31= P279=Q16917) Instances of subclasses of hospital
extract (extract P127= P31=Q16917) Owners of instances of hospitals
Instance-of is P31, subclass-of is P279, owned-by is P127. Hospital is Q16917, clinic is Q1774898.

How does it work?

For wikibase enabled wikis CirrusSearch maintains a field called statement_keywords which contains a filtered set of the graph edges each in the form P1=Q1. The provided unicorn query is transformed into equivalent elasticsearch queries, and edges in the graph are followed by performing sequential elasticsearch queries. Because there are execution boundaries between stages, and elasticsearch can only accept 1024 conditions in a single search request, results from Wikibase Unicorn can only provide a completeness guarantee when the truncated metric reported after all search results is zero. Results are truncated based on the number of sitelinks, the pages with the lowest number of sitelinks are removed. Per the linked paper truncation is typical not an issue for user-facing (top N) queries as long as inner-query sorting is doing a good job. Inner query sorting in this implementation is likely sub-par (also by sitelink_count).