You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Talk:Incidents/2022-03-27 wdqs outage

From Wikitech-static
Jump to navigation Jump to search

Rough notes around the incident from the Search team.

Bking: I see Icinga alerts matching " PROBLEM - WDQS SPARQL on wdqs2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook" in #wikimedia-operations IRC room, but I can't find any emails for this. Action: Ensure that search team SREs get emails for these failures.

As discussed here, the command-line utility jstack can detect deadlocks, and is installed on all wdqs hosts. Perhaps we can use it to monitor for these deadlocks.

We also update https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Blazegraph_deadlock with the exact verbiage from the alerts and examples of what Grafana looks like during these outages.

Update the alert verbiage itself to say "restart blazegraph service on X"


Potential things to alert on

Thread count plateau, see 2002 and 2007 https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=1648382400000&to=1648396800000

Sustained load avg 15 leads

Performance improvements

We have NUMA enabled on these nodes, is that a good idea?

Do we have a perf testing environment, maybe there are other tunables we should look into.