You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

User:AndreaWest/WDQS Blazegraph Analysis: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>AndreaWest
mNo edit summary
imported>AndreaWest
(→‎Alternative Data Stores: Added details about TerminusDB)
Line 113: Line 113:
** ~200 open issues (issues in JIRA)
** ~200 open issues (issues in JIRA)
** Note that there is no recent activity for this project and JIRA issues appear to be languishing
** Note that there is no recent activity for this project and JIRA issues appear to be languishing
Open-source but no SPARQL support:
* [https://github.com/terminusdb/terminusdb TerminusDB] (last updated today, ''1700+ stars'')


Open-source but unlikely to scale to billions of triples:
Open-source but unlikely to scale to billions of triples:
Line 122: Line 125:
** Written in Java
** Written in Java
** ~150 open issues (issues in JIRA)
** ~150 open issues (issues in JIRA)
** Described as "suitable as a backend for any any kind of RDF datasets, from persoanl dataspaces to ''moderately-sized'' enterprise knowledge graphs"
** Described as "suitable as a backend for any any kind of RDF datasets, from personal dataspaces to ''moderately-sized'' enterprise knowledge graphs"
* [https://github.com/SemWebCentral/parliament Parliament] (last updated 26 days ago, ''35 stars'')
* [https://github.com/SemWebCentral/parliament Parliament] (last updated 26 days ago, ''35 stars'')
** Written in Java, unlikely to scale to billions of triples
** Written in Java, unlikely to scale to billions of triples

Revision as of 17:04, 28 January 2022

The following are my learnings and thoughts related to replacing Blazegraph as the RDF store/query engine behind the Wikidata Query Service (WDQS).

Phabricator Ticket: Epic, T206560

Results

TBD

Problem Description

The Wikidata Query Service is part of the overall data access strategy for Wikidata. Currently, the service is hosted on two private and two internal load-balanced clusters, where each server in the cluster is running an (open-source) Blazegraph image, and providing a SPARQL endpoint for query access. The Blazegraph infrastructure presents a problem going forward since:

  • There is no short- or long-term maintenance/support strategy, given the acquisition of the Blazegraph personnel by Amazon (for their Neptune product)
  • Blazegraph is a single-server implementation (with replication for availability) that is suffering performance problems at the current database size of ~13B triples
  • Query performance is based on synchronous response and several year old technology, and is experiencing timeouts

Requirements

There are many moving parts and possible alternatives (and combinations of alternatives) for hosting the triple storage and query engine of the Wikidata Query Service. Here are some initial observations on the requirements for an acceptable solution.

A solution MUST:

  • Support database sizes of 20B+ triples
  • Support SPARQL 1.1 and the "standard" output formats (JSON, CSV/TSV, XML)
  • Support SPARQL Federated Query
    • Current list of federated endpoints
    • And this allows for the possibility of separating Wikidata across multiple DBs
  • Improve the ability to scale the available backing storage (for the total graph) and in-memory requirements for query (ideally to grow infinitely)
  • Support both read and write in high density
  • Support monitoring the server and databases via instrumentation and reported metrics
  • Provide ability to obtain and tune query plans
  • Utilize an indexing scheme that aligns/can be aligned with the query scenarios of Wikidata
    • Indexing subjects/objects (items), properties/paths, query/join patterns, ...
    • Indexing strings, numeric values, geospatial values, ...
  • Support all SPARQL query functions and allow for the addition of custom functions
  • Support extension points via new SERVICEs
    • Current list of SERVICEs
    • Can investigate if these might be more performant as SPARQL functions
  • Be licensed as open source
    • Be well-commented and actively maintained with a community of users
  • Allow disabling user authentication for query

A solution SHOULD:

  • Reduce (or at least maintain) query time and timeouts
  • Allow database reload to occur in days (not weeks)
    • Initially, data is loaded from dumps and then updated from the RDF stream updater
    • In case of data drift (there are issues in the update process), the data is re-initialized from dumps a few times per year
  • Support ACID transactions for updates
    • Writes are done from our RDF stream updater, "accidentally using transactions to write batches of data" (for performance reasons and not following any kind of semantic coherence)
    • The updates align with edits on Wikidata which are done atomically (one property at a time)
    • Transaction boundaries do not have to be conversational with a client
  • Support named/stored queries for re-execution and re-use
  • Support geospatial query/GeoSPARQL
  • Provide data integrity/validation support
  • Provide CLI for DevOps and scripting capabilities
  • Have minimal impact on current users and their queries/implementations
    • Need to understand what changes to existing queries would be required
    • This is especially relevant for non-standard, Blazegraph SPARQL syntax, such as query hints, named subqueries, ...
    • How onerous do the queries (and their debug) become with changes such as adding federation or splitting services across different interfaces?

A solution MAY:

  • Support paged output/query continuation
  • Support other query languages such as GraphQL, Gremlin, ... for improved programmatic and human ease of use
  • Support user roles and/or authentication for throttling
    • This could also be provided in the UI or by load-balanced pre-processing
  • Allow inference and reasoning, including the definition of user rules
  • Provide DevOps and query interface for browser-based maintenance and debug

Candidate Alternatives

This section will be expanded with more details. For now, this is a simple list of ideas that could be bundled together in a final solution:

  • Move off Blazegraph to a different backend store
  • Maintain SPARQL but also add support for other query languages for ease of use
  • Split the Wikidata "knowledge graph" into two or more databases AND/OR two or more named graphs (e.g., a database/graph holding the RDF for scholarly articles vs another holding all the remaining RDF data, or host the truthy data in a separate db/graph)
    • For separate databases, the solution would require either post-query joining of results or federation
  • Add support for (potentially long-running) queued queries with asynchronous reporting of results
  • Execute RDF data cleanup to remove redundancies and segregate "unused" items (items that are not referenced as the object of any triple)
  • Improve RDF indices and their use in queries to increase performance
    • There are various types of indices ... triple patterns and paths, item, property, joins (for frequently encountered joins), ...
  • Utilize user roles (different roles automatically execute on different dbs with different loads/performance) and/or authentication for throttling
  • Support saving of both queries and results with the ability to re-execute a query if the results are considered "stale"
  • Better incorporate (via federation?) ElasticSearch with query AND/OR move users to other services such as Wikidata ElasticSearch
    • How to do this in an easy to use/explain/understand fashion?
  • Establish cloud-deployable containers for the different cloud environments (AWS, Azure, ...) to increase the feasibility of local deployments
  • Move custom SERVICES (which are federated queries) to SPARQL functions

Alternative Data Stores

Note that these stores and query engines are prioritized by # of stars in GitHub and many will be further investigated. Some background implementation details and statistics are also provided.

Open-source (note that last updated times are based on examining the GitHub pages on 25 January 2022):

  • LevelGraph (last updated 5 months ago, 1400 stars)
    • LevelDB-backed RDF graph database for Node.js and the browser
    • Written in Javascript
    • 38 open issues (78 closed)
  • Virtuoso Open-Source (last updated 1 day ago, 730 stars)
    • Note that the paid version has more much functionality and scalability than the open-source
    • Written in C
    • 555 open issues (377 closed)
  • gStore (last updated 15 days ago, 470 stars)
    • Written in C++
    • 4 open issues (64 closed)
  • CM-Well (last updated 4 months ago, 168 stars)
    • Written in Scala, accessed by REST APIs
    • Developed by Thomson Reuters & Refinitiv
    • 231 open issues (239 closed)
  • quadstore (last updated 3 months ago, 124 stars)
    • LevelDB-backed RDF graph database for Node.js and the browser
    • Written in TypeScript
    • 7 open issues (89 closed)
  • SANSA-Stack (last updated 4 days ago, 118 stars)
    • Requires Spark 3.x.x with Scala 2.12 setup
    • Written in Java and Scala
    • 14 open issues (69 closed)
  • Apache Rya (last updated 14 months ago, 102 stars)
    • Built on top of Accumulo, and implemented as an extension to RDF4J
    • Written in Java
    • ~200 open issues (issues in JIRA)
    • Note that there is no recent activity for this project and JIRA issues appear to be languishing

Open-source but no SPARQL support:

Open-source but unlikely to scale to billions of triples:

  • RDF4J (last updated 17 days ago, 267 stars)
    • Written in Java
    • 247 open issues (1621 closed)
    • "RDF4J Native Store is ... currently aimed at medium-sized datasets in the order of 100 million triples"
  • Apache Jena (last updated hours ago, 801 stars) and Jena Fuseki SPARQL server (last updated 3 days ago)
    • Written in Java
    • ~150 open issues (issues in JIRA)
    • Described as "suitable as a backend for any any kind of RDF datasets, from personal dataspaces to moderately-sized enterprise knowledge graphs"
  • Parliament (last updated 26 days ago, 35 stars)
    • Written in Java, unlikely to scale to billions of triples
    • Developed by Raytheon BBN
    • 3 open issues (25 closed)
  • LUPOSDATE (last updated 17 months ago, 18 stars)
    • Academic implementation developed by IFIS at the University of Lübeck
    • Written in Java
    • No issues ever reported

Open-source but early in development:

  • OxiGraph (last updated 3 days ago, 427 stars)
    • Database based on the RocksDB key-value store, written in Rust
    • 29 open issues (39 closed)
  • Atomic Data Rust (last updated hours ago, 39 stars)
    • Graph database server for storing and sharing typed, linked, atomic data (strict subset of RDF), written in Rust
    • 97 open issues (148 closed)

Proprietary:

  • AllegroGraph
  • AnzoGraph
  • DGraph (also no support for SPARQL)
  • GraphDB
  • MarkLogic
  • Oracle
  • Neo4J (also no support for SPARQL)
  • Neptune
  • RDFox
  • Stardog
  • TriplyDB

Dead, no development within last 2 years or more:

  • Halyard, 4Store, Redland/RedStore, Mulgara, Jena-HBase, HBase-RDF, H2RDF, CumulusRDF, AdaptRDF, CliqueSquare, RDFDB, Akutan
  • And many more

Other Questions

  • How is the query UI (as seen on https://query.wikidata.org/) impacted by a move off Blazegraph?
    • Note that the UI is a standalone static javascript application, NOT hosted on the Blazegraph server
    • It MAY have accidental strong coupling with Blazegraph (TBD)
  • What are the current features and capabilities of Blazegraph that are non-standard (such as named queries)? How might they be supported?