You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

User:AndreaWest/WDQS Blazegraph Analysis

From Wikitech-static
< User:AndreaWest
Revision as of 01:39, 25 January 2022 by imported>AndreaWest (Initial writeup of several sections)
Jump to navigation Jump to search

The following are my learnings and thoughts related to replacing Blazegraph as the RDF store/query engine behind the Wikidata Query Service (WDQS).

Phabricator Ticket: Epic, T206560

Results

TBD

Problem Description

The Wikidata Query Service is part of the overall data access strategy for Wikidata. Currently, the service is hosted on two private and two internal load-balanced clusters, where each server in the cluster is running an (open-source) Blazegraph image, and providing a SPARQL endpoint for query access. The Blazegraph infrastructure presents a problem going forward since:

  • There is no short- or long-term maintenance/support strategy, given the acquisition of the Blazegraph personnel by Amazon (for their Neptune product)
  • Blazegraph is a single-server implementation (with replication for availability) that is suffering performance problems at the current ~13B triples
  • Query performance is based on synchronous response and several year old technology, and is experiencing timeouts

Requirements

There are many moving parts and possible alternatives (and combinations of alternatives) for hosting the triple storage and query engine of the Wikidata Query Service. Here are some initial observations on the requirements for an acceptable solution.

A solution MUST:

  • Support SPARQL 1.1
  • Improve the ability to scale the available backing storage (for the total graph) and in-memory requirements for query (ideally to grow infinitely)
  • Reduce (or at least maintain) query time and timeouts
  • Support both read and write in high density, with ACID transactions
  • Support monitoring the server and databases via instrumentation and reported metrics
  • Provide ability to obtain and tune query plans
  • Utilize an indexing scheme that aligns/can be aligned with the query scenarios of Wikidata
    • Indexing strings, numeric values, geospatial values, ...
  • Support stored queries
  • Support various output formats natively (e.g., JSON, CSV/TSV)
  • Support all SPARQL query functions and allow for the addition of custom functions

A solution SHOULD:

  • Be licensed as open source
    • Be actively maintained with a community of users
    • Be well-commented
  • Support SPARQL Federated Query
  • Support paged output/query continuation
  • Provide data integrity/validation support
  • Provide CLI for DevOps and scripting capabilities
  • Have minimal impact on current users and their queries/implementations
    • Need to understand what changes to existing queries would be required
    • This is especially relevant for non-standard, Blazegraph SPARQL syntax, such as query hints, named subqueries, ...
    • How onerous do the queries (and their debug) become with changes such as adding federation or splitting services across different interfaces?

A solution MAY:

  • Support other query languages such as GraphQL, Gremlin, ... for improved programmatic and human ease of use
  • Support geospatial query/GeoSPARQL
  • Support user roles and/or authentication
  • Allow inference and reasoning, including the definition of user rules

Requirement Prioritization

TBD

Candidate Alternatives

This section will be expanded with more details. For now, this is a simple list of ideas that could be bundled together in a final solution:

  • Move off Blazegraph to a different backend store
  • Maintain SPARQL but also add support for other query languages for ease of use
  • Split the Wikidata "knowledge graph" into two or more databases (e.g., a database holding the RDF for scholarly articles vs another holding all the remaining RDF data, or host the truthy data in a separate db) and either require post-query joining of results or federation
  • Add support for (potentially long-running) queued queries with asynchronous reporting of results
  • Execute RDF data cleanup to remove redundancies and segregate "unused" items (items that are not referenced as the object of any triple)
  • Improve RDF indices and their use in queries involving literal/string/numeric data matching, to increase performance
  • Utilize user roles (different roles automatically execute on different dbs with different loads/performance) and/or authentication for throttling
  • Support saving of both queries and results with the ability to re-execute a query if the results are considered "stale"
  • Move users to other services such as Wikidata LDFs (but how to do this in an easy to use/explain/understand fashion?)
  • Establish cloud-deployable containers for the different cloud environments (AWS, Azure, ...) to increase the feasibility of local deployments
  • Evaluate if query performance can be improved by the use of named graphs

Alternative Data Stores

Note that these stores and query engines are listed for future investigation:

  • GraphDb
  • AllegroGraph
  • Jena
  • RDF4J
  • Virtuoso
  • Stardog
  • ...