You are browsing a read-only backup copy of Wikitech. The live site can be found at

Incidents/2022-08-16 x2 databases replication breakage

From Wikitech-static
< Incidents
Revision as of 22:47, 18 August 2022 by imported>Cwhite (→‎Scorecard)
Jump to navigation Jump to search

document status: draft


Incident metadata (see Incident Scorecard)
Incident ID 2022-08-16 x2 databases replication breakage Start 04:07:58
Task T315274 End 04:43:22
People paged 9 Responder count 7
Coordinators CWhite Affected metrics/SLOs
Impact For approximately 36 minutes, errors messages were noticeable for editors. While edit count didn't seem to be affected, the errors impacted or made it look like it impacted edit-related actions by showing error messages. ~1 hour of volatile cache data was lost.

Summary: An operator mistake while trying to fix a minor cross-datacenter replication issue caused replication to break within the primary datacenter, which made mediawiki output user-noticeable exceptions, mainly on edit-related activities.

Cross-dc replication for x2 section had broken earlier in the day, mainly caused by the application requiring STATEMENT based replication and the databases using ROW based replication. While this caused a split brain state- where eqiad and codfw now drifting its cache data state, this had low to no impact- as codfw traffic at the time was restricted to testwiki, test2wiki and, in a very test-like environement. Service itself would have not been impacted even if codfw and eqiad shared 50-50 the user load.

After paging, operators tried to fix the ongoing issue by running SET GLOBAL sql_slave_skip_counter = X, for values of 1 and 2 a few times. This caused an actual user-noticeable outage, by breaking replication within the datacenter- so MediaWiki started generating exceptions at "edit speed rate":

Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomic().

While a few other things were attempted, such as wiping out all its table data and resetting replication, the way replication was restored was by:

  • Switching active write hosts to STATEMENT-based replication
  • Disabling GTID (this fixed the cross-dc replication)
  • Skipping statements on the replicas with CHANGE MASTER until they started replicating STATEMENT binlog positions

Approximately 1 hour of data (cache) from x2 was lost and the servers ended up replicating but with different data each. Because both data and logs were purged at earlier states of debugging, later root cause findings were much harder to find and required .


All times in UTC.

File:Atomic MediaWiki errors.png
Exception error log during the incident
File:Atomic edit rate.png
Edit rate during the incident
  • 01:38:00 moved to multi-dc
  • 03:08:12 Delete query sent to codfw x2 primary: #220816 3:08:12 server id 180363291 end_log_pos 1038450247 CRC32 0xcc3de9f1 Annotate_rows: #Q> DELETE /* SqlBagOStuff::deleteServerObjectsExpiringBefore */ FROM `objectstash` WHERE (exptime < '20220816020812') AND keyname IN ('azwikibooks:captcha:1072943127',...) (full query on internal doc)

Note this was the query that blew up, not necessarily the one that triggered the issue- logs and data were deleted so research is hard- the main thing is both primary servers had, even if for some time, different data when executing this.

  • 03:08:23 SERVICE ALERT: db2144;MariaDB Replica SQL: x2 #page;CRITICAL;SOFT;1;CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table mainstash.objectstash: Can't find record in 'objectstash', Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the event's master log db1151-bin.000810, end_log_pos 5460559
  • 03:10:03 SERVICE ALERT: db1151;MariaDB Replica SQL: x2 #page;CRITICAL;SOFT;1;CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table mainstash.objectstash: Can't find record in 'objectstash', Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the event's master log db2144-bin.000813, end_log_pos 1038458517 - At this moment, there is a split brain, but given the, at the time, load on codfw, no big user impact.
  • 03:42 Cwhite escalated by page to Amir, Manuel, Jaime. Amir was waken up by this. (Manuel was on holidays at the time)
  • 03:54 Reverted multi-dc patch
  • 04:07:58 Amir tries a few things on db1151: START/STOP slave, sql_skip_slave_counter several times (OUTAGE STARTS HERE), then TRUNCATE TABLE, then RESET SLAVE, unsuccessfully.
  • 04:08:07 SERVICE ALERT: db2143;MariaDB Replica IO: x2 #page;CRITICAL;SOFT;1;CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: 'could not find next log: the first event 'db2144-bin.000670' at 517376598, the last event read from 'db2144-bin.000813' at 1044345737, the last byte read from 'db2144-bin.000813' at 1044345768.'
  • 04:20 Amir called in Jaime
  • 04:26 Incident opened. cwhite becomes IC.
  • 04:37 Switched MainStash back to Redis patch is prepared (never deployed)
  • 04:39 Jaime switched master to STATEMENT, flushed logs to apply it
  • 04:39 Switched codfw master to disable GTID, that made it work (this was done on codfw first as a test)
  • 04:39 Switched eqiad replicas to both disable GTID and jump to latest STATEMENT coord, skipping ~1h of data. Untouched codfw replicas for debugging later. Replicas start caching up.
  • 04:43:22 “Explicit transaction still active” errors stopped. OUTAGE ENDS HERE
  • 04:44 Reverted switching MainStash back to Redis
  • 04:45:40 SERVICE ALERT: db1152;MariaDB Replica Lag: x2 #page;OK;HARD;10;OK slave_sql_lag Replication lag: 0.00 seconds
  • 04:48 eqiad X2 replication restored, users report able to save edits again
  • 05:00 Decision is made to leave replication broken on db2142 and db2143 for forensics during working hours.


2 pages were sent as soon as replication broke on the primary servers, which were promptly attended. The user-facing outage created more pages on the replicas- although unsure if sufficiently different from the first 2 pages to indicate the severity. User reports reached SRE at this point too, but people were already working on a resolution by then.


What went well?

  • Coordination of responders and decisions went as well as it can be organized- pages were attended, people escalated the issue appropriately and promptly
  • This didn't happen on a regular metadata or content section db, which would have required mass, multi-terabyte-data backup recovery
  • Decision to depool codfw for mediawiki was taken quickly

What went poorly?

  • Unsafe, discouraged operational procedures were run on a section's primary server (sql_slave_skip_counter)
  • Configuration on servers did not match expectations by the application (ROW vs STATEMENT)
  • Application was not tested with the same configuration as production beforehand
  • Exceptions should have not been generated for errors that were not fatal
  • Production testers were not immediately aware of production errors/pages triggered by multi-dc work.
  • Page text doesn't make immediately clear what is the potential impact (the same text for db1151 and db1152 can mean user-impacting outage or minor issue depending on the context)
  • Application purges may need tuning to be fully IDEMPOTENT (e.g. DELETE IGNORE?)
  • "Doing nothing" would have been way less impacting than trying to fix issues in 2 ways- errors would have not been produced to users at all, and even fixing it could have been possible in a non-dba way (fixing the split brain by depooling codfw, fixing the exceptions by leaving pooled a single "good" server)
  • Radical actions, such as deleting all data and deleting all logs made debugging during the issue and after it much harder- generating a copy or stopping replication on one dc or host would have helped. Also testing those on passive hosts to verify they work would have been wiser, rather than applying it on all at once.
  • Service details (peculiarities) of the service were not known for some of the outage responders, important actions were not documented or promptly shared. Information didn't flow smoothly and proactively, even after the outage end - delaying the fix and root cause research.

How many people were involved in the remediation?

  • 7

Links to relevant documentation


  • task T315427 Switch x2 to statement-based replication, and in general re-review db configuration (disable GTID, read-write, slave_exec_mode)
  • Review runbooks/procedures for x2 (and parsercache), specifically regarding replication issues
    • E.g. Sometimes doing nothing and having a split brain is better than trying to fix stuff manually (e.g. waiting & depooling a dc, and cloning afterwards)
  • Restore replication on codfw replicas, repool them, remove downtimes/silences, recheck read-only mode Yes Done
  • Validate MW’s concept of multi-master conflict resolution by performing simultaneous writes and simultaneous purges on both DCs
    • As far as I was told, UPSERTs seemed to work but purges may need review (DELETE IGNORE?)
  • Fix uncaught exception from LoadBalancer::approvePrimaryChanges() which caused total failure rather than graceful failure
  • Re-enable multi-DC mode on testwiki, test2wiki and mediawikiwiki
  • See tasks task T315271 (initial replication breakage/multi-dc implications) and task T315274 (user visible outage, mw errors, within-dc replication issues) for discussion. See for full logs and other data.


Incident Engagement ScoreCard
Question Answer


People Were the people responding to this incident sufficiently different than the previous five incidents? yes overlap of 4
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? no
Were pages routed to the correct sub-team(s)? yes
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. no
Process Was the incident status section actively updated during the incident? yes
Was the public status page updated? no
Is there a phabricator task for the incident? yes
Are the documented action items assigned? yes* only 1 unassigned
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are

open tasks that would prevent this incident or make mitigation easier if implemented.

Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? yes
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above) 11