Wikidata Query Service/Migration/Development Infrastructure
This page describes the development infrastructure we are putting in place to support the WDQS backend migration. It contains a brief tutorial on setting up triple stores on local workstation, and eqiad instances.
Local development
This section illustrates how to bootstrap a tripe store locally, on a Linux workstation, and ingest data from rdf mutation streams.
We'll use QLever as an example, but the same steps applies to virtuoso (modulo appropriate binary names and config changes).
1. Build and start QLever
QLever vendors binaries and a python cli as a wheel available from pypi. For testing and development we bootstrap the database from source instead.
We'll be using the build toolchain and configs provided in https://gitlab.wikimedia.org/repos/wikidata-platform/triplestores . As a prerquisite, make sure that the Nix build system is installed (see docs in the linked repo).
The following will setup a x86_64 c++ toolchain, fetch qlever and related dependencies, and build the database with:
$ nix flake new qlever -t git+https://gitlab.wikimedia.org/repos/wikidata-platform/triplestores#qlever $ cd qlever $ nix build
Indexing and Server binary will be available under ./result/bin.
2. Index Wikidata and start QLever
A small sample of wikidata, extracted with wdumper , is available at https://people.wikimedia.org/~gmodena/wikidata/data/sample.nt . That's enough to setup a local QLever instance to be updated via rdf streams. On eqiad nodes, the full dump as well as main and scholarly splits are available.
Index wikidata with:
$ wget https://people.wikimedia.org/~gmodena/wikidata/data/sample.nt $ mkdir index && cd index $ cat ../sample.nt | ../result/bin/IndexBuilderMain -m 1G -F ttl -f - -i wikidata -s ../conf/wikidata.settings.json
Where wikidata.settings.json is the default QLever.
Start and test QLever with:
$ ../result/bin/ServerMain --index-basename wikidata --port 7001 --memory-max-size 32G --cache-max-size 16G --default-query-timeout 2000s -a localhost $ curl -X POST http://localhost:7001/sparql?access-token=localhost -H "Content-Type: application/sparql-query" --data-binary "SELECT (COUNT(*) AS ?count) WHERE {?s ?p ?o}" | j
3. Populate the index from rdf mutation streams
Local workstations can't access the WMF production network or the rdf mutation streams from Kafka. We'll use the PoC golang client, instead of the qlever wikidata update utility, to consume public event streams and update the triple store.
The project requires golang. If you installed Nix in step 1, helpers are available in the repo to bootstrap a golang toolchain.
$ git clone git@gitlab.wikimedia.org:repos/wikidata-platform/go-wikidata-updater.git $ nix develop $ go build ./cmd/go-wikidata-updater
This will compile a go-wikidata-updater binary. To update the QLever index from the main graph rdf mutation stream run
$ ./go-wikidata-updater -sparql-endpoint "http://localhost:7001/sparql?access-token=localhost" -stream-url "https://stream.wikimedia.org/v2/stream/rdf-streaming-updater.mutation-main.v2"
eqiad test nodes
We currently have the following test nodes in eqiad. Wikidata Platform engineerings have root access. The servers are not yet managed via gitops, and allow us deployment and experimentation with non debian-packaged software (including building from source). This is considered development infrastructure and not production, as such we don't have a system user so to keep things simple you can impersonate `gmodena` and access paths and active tmux sessions. Everyone on the team has root access and is empowered to do changes (ping #wikidata-platform-engineering) if in doubt.
The same toolchain discussed in Local Developmet can be used to build the databases on eqiad. The hosts will have access to Kafka, and can use prod rdf-streaming-consumer Java tooling for realtime updates.
Wikidata entity dumps are available as nfs shares at /mnt/nfs/dumps-clouddumps1001.wikimedia.org/wikidatawiki/entities/
Wikidata splits needs to be manually transferred from HDFS (requries SRE to run a cookbook). See https://phabricator.wikimedia.org/T415492 .
wdqs1028
Currently hosts QLever and Virtuoso indexes at /srv/wdqs/qlever and /srv/wdqs/virtuoso respectively.
| Host | wdqs1028.eqiad.wmnet |
| Graph | full + lexemes |
| Snapshot |
20251208
|
| Database | QLever and Virtusos multi-tenant |
| QLever endpoint | http://localhost:7001/sparql?access-token=wdqs1028 |
| QLever index | /srv/wdqs/qlever/index |
| Sparql enpoint | http://localhost:8890/sparql |
| Index | /srv/wdqs/virtuoso/virtuoso.db |
| Real-time update | No |
wdqs1029
| Host | wdqs1029.eqiad.wmnet |
| Graph | main |
| Snapshot | 20260209 |
| Database | QLever |
| wdq-proxy endpoint | http://localhost:8080/sparql |
| Sparql endpoint | http://localhost:7001/sparql |
| Index | /srv/wdqs/qlever/index |
| Real-time updates | nix run nixpkgs#jdk17 -- -cp /srv/wdqs/tools//streaming-updater-consumer-0.3.162-SNAPSHOT-jar-with-dependencies.jar org.wikidata.query.rdf.updater.consumer.StreamingUpdate --updateResponseHandler DUMMY --sparqlUrl http://localhost:7001/sparql?access-token=wdqs1029 --brokers kafka-main1006.eqiad.wmnet:9092,kafka-main1007.eqiad.wmnet:9092,kafka-main1008.eqiad.wmnet:9092,kafka-main1009.eqiad.wmnet:9092,kafka-main1010.eqiad.wmnet:9092 --consumerGroup wdqs1029-qlever-dev --topic eqiad.rdf-streaming-updater.mutation-main --batchSize 250 |
| Real-time updates enabled | No |
wdqs1030
| Host | wdqs1030.eqiad.wmnet |
| Graph | scholarly |
| Snapshot | 20260209 |
| Database | QLever |
| wdqs-proxy | http://localhost:8080/sparql |
| Sparql endpoint | http://localhost:7001/sparql |
| Index | /srv/wdqs/qlever/index |
| Real-time updates | nix run nixpkgs#jdk17 -- -cp /srv/wdqs/tools//streaming-updater-consumer-0.3.162-SNAPSHOT-jar-with-dependencies.jar org.wikidata.query.rdf.updater.consumer.StreamingUpdate --updateResponseHandler DUMMY --sparqlUrl http://localhost:7001/sparql?access-token=wdqs103030 --brokers kafka-main1006.eqiad.wmnet:9092,kafka-main1007.eqiad.wmnet:9092,kafka-main1008.eqiad.wmnet:9092,kafka-main1009.eqiad.wmnet:9092,kafka-main1010.eqiad.wmnet:9092 --consumerGroup wdqs1030-qlever-dev --topic eqiad.rdf-streaming-updater.mutation-scholarly --batchSize 250 |
| Real-time updates enabled | No |
wdqs1031
| Host | wdqs1031.eqiad.wmnet |
| Graph | main |
| Snapshot | 20260209 |
| Database | Virtuoso |
| wdqs-proxy | http://localhost:8080/sparql |
| Sparql enpoint | http://localhost:8890/sparql |
| Index | /srv/wdqs/virtuoso/virtuoso.db |
| Real-time updates | nix run nixpkgs#jdk17 -- -cp /srv/wdqs/tools//streaming-updater-consumer-0.3.162-SNAPSHOT-jar-with-dependencies.jar org.wikidata.query.rdf.updater.consumer.StreamingUpdate --updateResponseHandler DUMMY --sparqlUrl http://localhost:8890/sparql --brokers kafka-main1006.eqiad.wmnet:9092,kafka-main1007.eqiad.wmnet:9092,kafka-main1008.eqiad.wmnet:9092,kafka-main1009.eqiad.wmnet:9092,kafka-main1010.eqiad.wmnet:9092 --consumerGroup wdqs1031-virtuoso-dev --topic eqiad.rdf-streaming-updater.mutation-main --batchSize 250 |
| Real-time updates enabled | No |
wdqs1032
| Host | wdqs1030.eqiad.wmnet |
| Graph | scholarly |
| Snapshot | 20260209 |
| Database | Virtuoso |
| wdqs-proxy | http://localhost:8080/sparql |
| Sparql endpoint | http://localhost:8890/sparql |
| Index | /srv/wdqs/virtuoso/virtuoso.db |
| Real-time updates | nix run nixpkgs#jdk17 -- -cp /srv/wdqs/tools//streaming-updater-consumer-0.3.162-SNAPSHOT-jar-with-dependencies.jar org.wikidata.query.rdf.updater.consumer.StreamingUpdate --updateResponseHandler DUMMY --sparqlUrl http://localhost:8890/sparql --brokers kafka-main1006.eqiad.wmnet:9092,kafka-main1007.eqiad.wmnet:9092,kafka-main1008.eqiad.wmnet:9092,kafka-main1009.eqiad.wmnet:9092,kafka-main1010.eqiad.wmnet:9092 --consumerGroup wdqs1032-virtuoso-dev --topic eqiad.rdf-streaming-updater.mutation-scholalry --batchSize 250 |
| Real-time updates enabled | No |
wdqs-proxy
A clone of the wdqs-proxy repo is located at `/srv/wdqs/wdqs-proxy`
wdqs-proxy config is located `/srv/wdqs/deployments/application.yaml` on each eqiad node.
Right now, the only difference is the the proxied endpoint url (localhost:7001 for qlever, localhost:8890 for virtuoso)
$ sudo docker run -d --network=host -v /srv/wdqs/deployments/config/application.yaml:/srv/app/config/application.yaml wdqs-proxy
Query UI
The application that powers https://query.wikidata.org can easily be run locally, and POST queries to eqiad test hosts over an ssh tunnel.
Tooling to automate setup is available at https://gitlab.wikimedia.org/gmodena/wdqs-local-env .