Jump to content

This is a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Wikidata Query Service/Migration/Development Infrastructure

From Wikitech

This page describes the development infrastructure we are putting in place to support the WDQS backend migration. It contains a brief tutorial on setting up triple stores on local workstation, and eqiad instances.

Local development

This section illustrates how to bootstrap a tripe store locally, on a Linux workstation, and ingest data from rdf mutation streams.

We'll use QLever as an example, but the same steps applies to virtuoso (modulo appropriate binary names and config changes).

1. Build and start QLever

QLever vendors binaries and a python cli as a wheel available from pypi. For testing and development we bootstrap the database from source instead.

We'll be using the build toolchain and configs provided in https://gitlab.wikimedia.org/repos/wikidata-platform/triplestores . As a prerquisite, make sure that the Nix build system is installed (see docs in the linked repo).

The following will setup a x86_64 c++ toolchain, fetch qlever and related dependencies, and build the database with:

$ nix flake new qlever -t git+https://gitlab.wikimedia.org/repos/wikidata-platform/triplestores#qlever
$ cd qlever
$ nix build

Indexing and Server binary will be available under ./result/bin.


2. Index Wikidata and start QLever

A small sample of wikidata, extracted with wdumper , is available at https://people.wikimedia.org/~gmodena/wikidata/data/sample.nt . That's enough to setup a local QLever instance to be updated via rdf streams. On eqiad nodes, the full dump as well as main and scholarly splits are available.

Index wikidata with:

$ wget https://people.wikimedia.org/~gmodena/wikidata/data/sample.nt
$ mkdir index && cd index
$ cat ../sample.nt | ../result/bin/IndexBuilderMain -m 1G -F ttl -f - -i wikidata -s ../conf/wikidata.settings.json

Where wikidata.settings.json is the default QLever.

Start and test QLever with:

$ ../result/bin/ServerMain --index-basename wikidata --port 7001 --memory-max-size 32G --cache-max-size 16G --default-query-timeout 2000s -a localhost
$ curl -X POST http://localhost:7001/sparql?access-token=localhost -H "Content-Type: application/sparql-query" --data-binary "SELECT (COUNT(*)  AS ?count) WHERE {?s ?p ?o}" | j

3. Populate the index from rdf mutation streams

Local workstations can't access the WMF production network or the rdf mutation streams from Kafka. We'll use the PoC golang client, instead of the qlever wikidata update utility, to consume public event streams and update the triple store.

The project requires golang. If you installed Nix in step 1, helpers are available in the repo to bootstrap a golang toolchain.

$ git clone git@gitlab.wikimedia.org:repos/wikidata-platform/go-wikidata-updater.git
$ nix develop
$ go build ./cmd/go-wikidata-updater

This will compile a go-wikidata-updater binary. To update the QLever index from the main graph rdf mutation stream run

$ ./go-wikidata-updater -sparql-endpoint "http://localhost:7001/sparql?access-token=localhost" -stream-url "https://stream.wikimedia.org/v2/stream/rdf-streaming-updater.mutation-main.v2"

eqiad test nodes

We currently have the following test nodes in eqiad. Wikidata Platform engineerings have root access. The servers are not yet managed via gitops, and allow us deployment and experimentation with non debian-packaged software (including building from source). This is considered development infrastructure and not production, as such we don't have a system user so to keep things simple you can impersonate `gmodena` and access paths and active tmux sessions. Everyone on the team has root access and is empowered to do changes (ping #wikidata-platform-engineering) if in doubt.

The same toolchain discussed in Local Developmet can be used to build the databases on eqiad. The hosts will have access to Kafka, and can use prod rdf-streaming-consumer Java tooling for realtime updates.

Wikidata entity dumps are available as nfs shares at /mnt/nfs/dumps-clouddumps1001.wikimedia.org/wikidatawiki/entities/

Wikidata splits needs to be manually transferred from HDFS (requries SRE to run a cookbook). See https://phabricator.wikimedia.org/T415492 .

wdqs1028

Currently hosts QLever and Virtuoso indexes at /srv/wdqs/qlever and /srv/wdqs/virtuoso respectively.

Host wdqs1028.eqiad.wmnet
Graph full + lexemes
Snapshot 20251208
Database QLever and Virtusos multi-tenant
QLever endpoint http://localhost:7001/sparql?access-token=wdqs1028
QLever index /srv/wdqs/qlever/index
Sparql enpoint http://localhost:8890/sparql
Index /srv/wdqs/virtuoso/virtuoso.db
Real-time update No

wdqs1029

Host wdqs1029.eqiad.wmnet
Graph main
Snapshot 20260209
Database QLever
wdq-proxy endpoint http://localhost:8080/sparql
Sparql endpoint http://localhost:7001/sparql
Index /srv/wdqs/qlever/index
Real-time updates nix run nixpkgs#jdk17 -- -cp /srv/wdqs/tools//streaming-updater-consumer-0.3.162-SNAPSHOT-jar-with-dependencies.jar org.wikidata.query.rdf.updater.consumer.StreamingUpdate --updateResponseHandler DUMMY --sparqlUrl http://localhost:7001/sparql?access-token=wdqs1029 --brokers kafka-main1006.eqiad.wmnet:9092,kafka-main1007.eqiad.wmnet:9092,kafka-main1008.eqiad.wmnet:9092,kafka-main1009.eqiad.wmnet:9092,kafka-main1010.eqiad.wmnet:9092 --consumerGroup wdqs1029-qlever-dev --topic eqiad.rdf-streaming-updater.mutation-main --batchSize 250
Real-time updates enabled No

wdqs1030

Host wdqs1030.eqiad.wmnet
Graph scholarly
Snapshot 20260209
Database QLever
wdqs-proxy http://localhost:8080/sparql
Sparql endpoint http://localhost:7001/sparql
Index /srv/wdqs/qlever/index
Real-time updates nix run nixpkgs#jdk17 -- -cp /srv/wdqs/tools//streaming-updater-consumer-0.3.162-SNAPSHOT-jar-with-dependencies.jar org.wikidata.query.rdf.updater.consumer.StreamingUpdate --updateResponseHandler DUMMY --sparqlUrl http://localhost:7001/sparql?access-token=wdqs103030 --brokers kafka-main1006.eqiad.wmnet:9092,kafka-main1007.eqiad.wmnet:9092,kafka-main1008.eqiad.wmnet:9092,kafka-main1009.eqiad.wmnet:9092,kafka-main1010.eqiad.wmnet:9092 --consumerGroup wdqs1030-qlever-dev --topic eqiad.rdf-streaming-updater.mutation-scholarly --batchSize 250
Real-time updates enabled No

wdqs1031

Host wdqs1031.eqiad.wmnet
Graph main
Snapshot 20260209
Database Virtuoso
wdqs-proxy http://localhost:8080/sparql
Sparql enpoint http://localhost:8890/sparql
Index /srv/wdqs/virtuoso/virtuoso.db
Real-time updates nix run nixpkgs#jdk17 -- -cp /srv/wdqs/tools//streaming-updater-consumer-0.3.162-SNAPSHOT-jar-with-dependencies.jar org.wikidata.query.rdf.updater.consumer.StreamingUpdate --updateResponseHandler DUMMY --sparqlUrl http://localhost:8890/sparql --brokers kafka-main1006.eqiad.wmnet:9092,kafka-main1007.eqiad.wmnet:9092,kafka-main1008.eqiad.wmnet:9092,kafka-main1009.eqiad.wmnet:9092,kafka-main1010.eqiad.wmnet:9092 --consumerGroup wdqs1031-virtuoso-dev --topic eqiad.rdf-streaming-updater.mutation-main --batchSize 250
Real-time updates enabled No

wdqs1032

Host wdqs1030.eqiad.wmnet
Graph scholarly
Snapshot 20260209
Database Virtuoso
wdqs-proxy http://localhost:8080/sparql
Sparql endpoint http://localhost:8890/sparql
Index /srv/wdqs/virtuoso/virtuoso.db
Real-time updates nix run nixpkgs#jdk17 -- -cp /srv/wdqs/tools//streaming-updater-consumer-0.3.162-SNAPSHOT-jar-with-dependencies.jar org.wikidata.query.rdf.updater.consumer.StreamingUpdate --updateResponseHandler DUMMY --sparqlUrl http://localhost:8890/sparql --brokers kafka-main1006.eqiad.wmnet:9092,kafka-main1007.eqiad.wmnet:9092,kafka-main1008.eqiad.wmnet:9092,kafka-main1009.eqiad.wmnet:9092,kafka-main1010.eqiad.wmnet:9092 --consumerGroup wdqs1032-virtuoso-dev --topic eqiad.rdf-streaming-updater.mutation-scholalry --batchSize 250
Real-time updates enabled No

wdqs-proxy

A clone of the wdqs-proxy repo is located at `/srv/wdqs/wdqs-proxy`

wdqs-proxy config is located `/srv/wdqs/deployments/application.yaml` on each eqiad node.

Right now, the only difference is the the proxied endpoint url (localhost:7001 for qlever, localhost:8890 for virtuoso)

$ sudo docker run -d --network=host  -v /srv/wdqs/deployments/config/application.yaml:/srv/app/config/application.yaml wdqs-proxy

Query UI

The application that powers https://query.wikidata.org can easily be run locally, and POST queries to eqiad test hosts over an ssh tunnel.

Tooling to automate setup is available at https://gitlab.wikimedia.org/gmodena/wdqs-local-env .