You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Wikidata Query Service/Runbook: Difference between revisions
imported>Quiddity m (lang="text") |
imported>DCausse |
||
Line 53: | Line 53: | ||
The cause of deadlocked queries is unknown; it seems to be more likely with queries of higher complexity, but there is no single type of query that can cause the issue. | The cause of deadlocked queries is unknown; it seems to be more likely with queries of higher complexity, but there is no single type of query that can cause the issue. | ||
Symptoms: | |||
* The graphs '''Triples''', '''Lag''', '''Queries per seconds''' and some others stop reporting data for the affected node in the [https://grafana.wikimedia.org/d/000000489/wikidata-query-service wikidata-query-service graphana dashboard]. | |||
* Blazegraph is unreachable and might die with a OOM at some point | |||
==== Remediation ==== | ==== Remediation ==== | ||
If it affects a single node: | |||
* restart blazegraph on this node and depool it if the lag is too high (>1hour) | |||
If it affect multiple nodes: | |||
The main strategy is to take short-term steps to eliminate the pain while working to locate any offenders that are slamming WDQS with large volumes of heavy complexity (read: lots of computation required) queries. | The main strategy is to take short-term steps to eliminate the pain while working to locate any offenders that are slamming WDQS with large volumes of heavy complexity (read: lots of computation required) queries. | ||
Revision as of 13:04, 24 August 2021
Wikidata Query Service is the Wikimedia implementation of SPARQL server, based on Blazegraph engine, to service queries for Wikidata and other data sets. This "runbook" goal is to provide short instructions of what to do in case of issue with the service. Please see more detailed description in the User Manual, Wikidata Query Service and Wikidata Query Service/Implementation.
Clusters
We operate 3 separate WDQS clusters, with different expectations in term of availability and stability of response times. All clusters only expose a readonly interface. Writes are exclusively managed internally, with WDQS pulling data from different sources (mainly from Wikidata):
- wdqs-internal: The cluster used by other internal service, mainly mediawiki. This cluster should be highly available with predictable response times.
- wdqs: The public facing SPARQL endpoint. This cluster is used by external applications and researchers. While it does see "real user traffic", due to the nature of a SPARQL endpoint, we do expect a lower availability and more random response times on this cluster. The users of this cluster should be aware of those limitations.
- wdqs-test: Those 2 servers are used for testing and should see no "real user traffic".
Main Services
- wdqs-blazegraph: main blazegraph instance
- wdqs-categories: blazegraph instance dedicated to categories
- wdqs-updater: updates the main blazegraph instance from wikidata
Main dashboards
Known limitations
Exposing a SPARQL endpoint to the internet is similar to exposing an SQL endpoint. The user generated load can vary widely in the complexity per request and cause widely varying performances. Throttling is applied based on overall computing time and error rate to protect from overly eager clients. The throttling is bucketed by user-agent and IP address. An abusive client which does not rate limit when receiving HTTP 429 responses will be banned for a longer duration where all its requests will be denied with an HTTP 403.
Further analysis
Before restarting wdqs-blazegraph
or wdqs-updater
, please collect multiple stack traces for further analysis:
for i in `seq 1 10`; do sudo -u blazegraph jcmd <pid> Thread.print > thread.${i} sleep 5 done
Known issues
Update lag
Synchronisation from wikidata (and other data sources) can be viewed on Grafana. Common cause of lag are:
- Too much write load (too many edits on Wikidata)
- Too much read load on the WDQS systems, competing for I/O with the writes. You can check the queries currently running by going to http://localhost:9999/bigdata/status?showQueries=details (it's an HTML page so ssh port forwarding to local browser may be a good idea).
- Updater process crashed or stucked. Check the update progress logs in
/var/log/wdqs/wdqs-updater.log
. It should report progress regularly, about once a minute or so. The progress report looks like this:Jun 25 22:15:20 wdqs1003 bash[15223]: 22:15:20.654 [main] INFO org.wikidata.query.rdf.tool.Updater - Polled up to 2018-06-25T22:15:16Z at (7.1, 5.4, 5.3) updates per second and (1453.4, 1087.6, 1018.1) milliseconds per second
Watch the bold items - the time should be reasonably close (within minute) to current time, unless Updater is catching up from the lag, and the ms/s number should be close to 1000. If it's significantly less, the server is overloaded and can't catch up with updates. High numbers in updates/s can indicate high write load.
Remediation
- restarting the updater process:
service wdqs-updater restart
- restart of the blazegraph service:
service wdqs-blazegraph restart
. Note that Updater can recover from Blazegraph restart automatically, but if it's stuck, restarting it after Blazegraph restart is recommended. - depooling the problematic server until it catches up on updates (thus reducing the competition on I/O)
Blazegraph deadlock
Occasionally a query will cause the blazegraph process to become deadlocked, rendering it unresponsive. As a result, blazegraph - which functions as the backend for wdqs - will fail to respond, leading to nginx (the frontend) reporting a timeout (as its attempt to talk to blazegraph on the other end has failed).
The cause of deadlocked queries is unknown; it seems to be more likely with queries of higher complexity, but there is no single type of query that can cause the issue.
Symptoms:
- The graphs Triples, Lag, Queries per seconds and some others stop reporting data for the affected node in the wikidata-query-service graphana dashboard.
- Blazegraph is unreachable and might die with a OOM at some point
Remediation
If it affects a single node:
- restart blazegraph on this node and depool it if the lag is too high (>1hour)
If it affect multiple nodes: The main strategy is to take short-term steps to eliminate the pain while working to locate any offenders that are slamming WDQS with large volumes of heavy complexity (read: lots of computation required) queries.
The short-term steps are to restart blazegraph on the affected node(s). This will unstick the process, but it can become deadlocked again from subsequent queries, thus why it is necessary to identify the offender(s). Sometimes the nodes fall back into deadlock so soon after restart that it stops being worth it to do the restarts; in those scenarios focus all attention on finding whose queries are causing the deadlock.
Below we will describe what to do in more detail when presented with a "query of death scenario" which ultimately is triggering blazegraph deadlock.
Identifying which nodes are operational
For Blazegraph deadlock issues (which is the result of a "query of death"), degree of outage can be viewed in the RDF Triples
graph: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m&var-cluster_name=wdqs
Nodes that are reporting null values for Triples
are locked up and unable to respond to queries.
Identifying the user agent
In a "query of death" scenario, there's generally a certain query type that reliably correlates with the problem.
For example, in the Sep 2 2020 incident, the root cause appeared to be a malformed query that triggered (unsurprisingly) a MalformedQueryException
.
In the July 23 2020 incident, the root cause was a poorly performing (large) query.
Kibana (the web frontend for Elasticsearch) is ideally suited to parse logs and perform basic operations like summing, visualizations, etc.
TIP: When adding filters (to pare the log messages down to just the relevant ones), Kibana can sometimes get very laggy, especially if there's a lot of filter options.
To work-around this, after you click Add a filter +
, click Edit Query DSL
to edit the DSL directly rather than use the graphical interface.
Here's an example of a simple filter that just makes sure the field stack_trace
contains the string cause=java.util.concurrent.ExecutionException: org.openrdf.query.MalformedQueryException
.
(See https://www.elastic.co/guide/en/elasticsearch/reference/7.x/query-filter-context.html#query-filter-context-ex for crafting more advanced query filters manually)
{
"query": {
"bool": {
"must": [
{ "match": { "stack_trace": "cause=java.util.concurrent.ExecutionException: org.openrdf.query.MalformedQueryException" }}
]
}
}
}
See https://logstash.wikimedia.org/goto/d386fbdc24d9f51c902b1bf50077558a for an example dashboard which filters only for MalformedQueryException
(note: requires access to production logs)
Banning the user agent
Once you've isolated the queries causing blazegraph to lock up in the first place, (as opposed to queries which are innocently getting timed out *because* blazegraph has already locked up), you should ban either the user agent or the source IP address in question. Generally a user agent ban is better because the queries might be coming from an application which relies on, say, cloud ec2 instances, in which case the IP address itself is subject to change. Use your judgement.
User agents can be banned at the nginx level like so: https://github.com/wikimedia/puppet/commit/d683630d32855fbe6b167d4d215d87d5aca61366
Add the ban to the nginx config, and run puppet across the relevant wdqs instances, which should automatically reload nginx on completion (make sure to check via sudo systemctl status nginx.service
)
Cleanup
Once the correct user agent(s) have been banned, restart blazegraph on all affected nodes.
Take a mental note to circle back in the future to unblock any users whose user agents were banned. (Do this during a time where you'll be around to monitor and make sure the deadlock condition does not re-surface)
Issues with Kafka / Recent Change pollers
WDQS supports 2 mechanism for getting updates from Wikidata. One based on Kafka, one based on the Recent Changes API. In case of issues with one of those backends, switching to the other might help.
Procedure
The updater used if configured in Puppet, via the profile::wdqs::use_kafka_for_updates
hiera variable (See 466722 for an example). The updater needs to be restarted for the change to be activated (service wdqs-updater restart
). The Kafka based poller stores kafka offsets in Blazegraph, those need to be reset when switching from Kafka to Recent Changes to ensure that Kafka Poller will start from the right place when re-enabled (see Resetting Kafka Offsets for details). The following commands needs to be run on each server where the poller configuration is changed:
curl -X POST --data-urlencode update@reset_kafka.sparql http://localhost:9999/bigdata/namespace/wdq/sparql
Where reset_kafka.sparql
contains this query:
DELETE { ?z rdf:first ?head ; rdf:rest ?tail . } WHERE { [] wikibase:kafka ?list . ?list rdf:rest* ?z . ?z rdf:first ?head ; rdf:rest ?tail . } ; DELETE WHERE { <https://www.wikidata.org> wikibase:kafka ?o . };
Timeouts
Sadly, the public WDQS cluster is subject to varying performances depending on user generated load. Throttling should protect it but this is an imperfect solution. Logstash provides a view of the different clients being throttled. So far, we have seen this issue only on the public WDQS cluster. The wdqs-categories
service is usually not timing out. Overload is due to a combination of read and write load exceeding the current capacity.
Remediation
- restart of the blazegraph main service:
systemctl restart wdqs-blazegraph
- restart of the blazegraph categories service:
systemctl restart wdqs-categories
(usually not necessary) - find the most offending client and ban its user agent (see 518691 as an example).
- shutdown wdqs-updater:
systemctl stop wdqs-updater
(this will obviously increase replication lag, but will protect read traffic).
Server de-synchronization
Ideally, all servers should have the same data. However, it can happen, due to some glitches, that the data on one server differs from the data on another server. This can happen for a brief period due to update lags, however if the data is old and still in not in sync, this can lead to query results being inconsistent between servers and thus the same query returning different result at different time. You can see which server returned the result by inspecting x-served-by
header in the response.
Remediation
In the case of servers being out of sync, the best way is to update the affected entities manually. Of course the exact entities depend on the nature of the query, so it can be tricky to find out. But if the entities are known, run this on each server:
cd /srv/deployment/wdqs/wdqs; bash runUpdate.sh -n wdq -N -- --ids Q1234,Q345,Q45567
where argument to --ids
is the IDs of the affected entities. This can be done at any time without affecting the function of the server.
Logstash Spamming
Due to improper sanitization of user requests, logstash can become overwhelmed with loads of logging errors from logback. An example of this error is: Invalid format for the WKT value: value=11.576,48.216805555556
. These errors can be seen here: https://logstash.wikimedia.org/app/kibana#/dashboard/default?_g=h@89f0c75&_a=h@8f80f8c
Remediation
Once the specific logger that is generating the spam is identified, that specific logger can be disabled by adding this line <logger name="<logger_name>" level="off"/>
to logback config for wdqs and submit the patch for review. The logback configuration is automatically reloaded, no service restart is required. An example of a submitted patch is: https://gerrit.wikimedia.org/r/c/operations/puppet/+/468958.
WDQS Updater service crashing due to memory error
Larger than usual updates can make wdqs-updater service run out of memory. An example can be seen here: https://phabricator.wikimedia.org/T210235. OutOfMemoryErrors are only sent to journald, not to logback log files. UpdateBuilder.bindValue()
running out of memory is a good indication of that issue:
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at java.util.Arrays.copyOf(Arrays.java:3332)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:472)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at java.lang.StringBuffer.append(StringBuffer.java:310)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at java.lang.StringBuffer.append(StringBuffer.java:97)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:486)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at java.lang.StringBuffer.append(StringBuffer.java:338)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at java.util.regex.Matcher.appendReplacement(Matcher.java:890)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at java.util.regex.Matcher.replaceAll(Matcher.java:955)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at java.lang.String.replace(String.java:2240)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at org.wikidata.query.rdf.tool.rdf.UpdateBuilder.bindValue(UpdateBuilder.java:40)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at org.wikidata.query.rdf.tool.rdf.RdfRepository.syncFromChanges(RdfRepository.java:308)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at org.wikidata.query.rdf.tool.Updater.handleChanges(Updater.java:214)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at org.wikidata.query.rdf.tool.Updater.run(Updater.java:129)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at org.wikidata.query.rdf.tool.Update.run(Update.java:146)
Nov 23 11:28:14 wdqs1009 wdqs-updater[28004]: at org.wikidata.query.rdf.tool.Update.main(Update.java:81)
Nov 23 11:28:14 wdqs1009 systemd[1]: wdqs-updater.service: Main process exited, code=exited, status=1/FAILURE
Remediation
Temporarily reducing the batch size (-b
parameter) should allow the large updates to go through. Check task T210235 for more details.
Overload due to high edit rate
In some cases, high edit rate will push the load on the service to a point where it can't keep up with both the read and the write load. In that case, shutting down the wdqs-updater
service will provide some breathing room, at the cost of no updates being applied to that cluster.
Categories update lag
Categories are updated weekly from Mediawiki dumps. Due to the delay between dumps being created and categories being reloaded, the lag can be larger than a week. Updates are done through loadCategoriesDaily.sh
script, run as a cron. Logs are in /var/log/wdqs/reloadCategories.log
. Icinga alerts might be too aggressive, in case of alert, re-scheduling the check might be enough to fix it.
Note that the script that is used for reloads, forAllCategoryWikis.sh
, will skip wikis that are failing and move to the next ones - you can find which wikis failed by looking for similar logs -
ERROR:loadCategoriesDaily.sh - Failed for wowiktionary, moving on to the next entry...