You are browsing a read-only backup copy of Wikitech. The live site can be found at

Incident documentation/2019-10-31 wikidata

From Wikitech-static
< Incident documentation
Revision as of 19:37, 31 March 2021 by imported>Krinkle (Krinkle moved page Incident documentation/20191031-wikidata to Incident documentation/2019-10-31 wikidata)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

document status: final


Wikidata API calls were not getting responses (getting timeouts) due to DB read load due to backported changes reducing deadlocks around writing to the new terms store for wikibase.


Wikidata editors received timeouts to API requests, API response time for writes went through the roof. It seems like most edits from API calls were actually made, but the clients didn't get a response confirming that.


Humans in Telegram chat. Confirmed by Addshore.


All times in UTC.

  • 16:00 - Maintenance script migrating wb_terms data restarted
  • 16:02 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Wikibase deadlock reduction, Stop locking and use DISTINCT when finding used terms to delete (T236466) (duration: 01m 05s)
  • 16:05 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Wikibase deadlock reduction, Stop locking and use DISTINCT when finding used terms to delete (T234948) (duration: 01m 04s)
  • 16:12 - Read rows on 1 db slave shot up
  • ~16:18 - Edit rate on wikidata really started dropping
  • 16:30 - Maintenance script migrating wb_terms data restarted (picking up code changes)?
  • 16:38 - Reported timeout in UI editing on in Telegram chat
  • 16:54 - Phabricator task created after Addshore spotted this message -
  • 16:59 <jynus> killed rebuildItemTerms on mwmaint1002
  • ~17:00 Edit rate on wikidata recovering, but drops again -
  • 17:26 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Revert 16:02 UTC T236928 (duration: 01m 04s)
  • ~17:26 Edit rate on wikidata recovering again
  • 17:29 <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.4/extensions/Wikibase: Revert 16:05 UTC T236928 (duration: 01m 05s)


  • We could do with more alarms on things that often indicate a problem

What went well?

  • Was not a total outage, just severe slowness

What went poorly?

  • No alarms went off
  • Only a message in Telegram alerted us to an issue (not even a phab task)

Where did we get lucky?

  • People were on hand that knew what the problem was (as the issue did not coincide with deployment time)

How many people were involved in the remediation?

  • 2 SREs?
  • 2 Wikidata Devs