You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Incidents/2022-08-16 x2 databases replication breakage
document status: draft
|Incident ID||2022-08-16 x2 databases replication breakage||Start||04:07:58|
|People paged||9||Responder count||7|
|Impact||For approximately 36 minutes, errors messages were noticeable for editors. While edit count didn't seem to be affected, the errors impacted or made it look like it impacted edit-related actions by showing error messages. ~1 hour of volatile cache data was lost.|
Edit: Root cause was initially set to an operator error- however, upon further research, the operator was told by the service owners that such operations would be safe to perform (with the underlying bug being unknown to all), so IC decided to give more weight to the bug as the real cause of the incident.
Summary: An unexpected MediaWiki behavior when replication broke between local Mainstash cache databases produced user-noticeable exceptions, mainly on edit-related activities. Unlike thought at the time, Chronology protector was setup on this section (x2), and so replicas were required to be up and not lagged in order to allow normal writes to the primary- creating a single point of failure on the replication setup.
Cross-dc replication for x2 section had broken earlier in the day, mainly caused by the application requiring
STATEMENT based replication and the databases using
ROW based replication. While this caused a split brain state- where eqiad and codfw now drifting its cache data state, this had low to no impact- as codfw traffic at the time was restricted to testwiki, test2wiki and mediawiki.org, in a very test-like environment. Service itself would have not been impacted even if codfw and eqiad shared 50-50 the user load.
After paging, operators tried to fix the ongoing issue by running
SET GLOBAL sql_slave_skip_counter = X, for values of 1 and 2 a few times. This caused replication to break also on the local replicas. Documentation/tickets said that breaking replicas within a datacenter shouldn't cause an outage, as they were supposed to be fully passive/failover hosts, however, because chronology protector was still enabled, this caused an actual user-noticeable outage- so MediaWiki started generating exceptions at "edit speed rate" (even if edits went through):
Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomic().
While a few other things were attempted, such as wiping out all its table data and resetting replication, the way replication was restored was by:
- Switching active write hosts to STATEMENT-based replication
- Disabling GTID (this fixed the cross-dc replication)
- Skipping statements on the replicas with CHANGE MASTER until they started replicating STATEMENT binlog positions
Approximately 1 hour of data (cache) from x2 was lost and the servers ended up replicating but with different data each. Because both data and logs were purged at earlier states of debugging, later root cause findings were much harder to find and required .
All times in UTC.
- 01:38:00 mediawiki.org moved to multi-dc
Delete query sent to codfw x2 primary: #220816 3:08:12 server id 180363291 end_log_pos 1038450247 CRC32 0xcc3de9f1 Annotate_rows: #Q> DELETE /* SqlBagOStuff::deleteServerObjectsExpiringBefore */ FROM `objectstash` WHERE (exptime < '20220816020812') AND keyname IN ('azwikibooks:captcha:1072943127',...)(full query on internal doc)
Note this was the query that blew up, not necessarily the one that triggered the issue- logs and data were deleted so research is hard- the main thing is both primary servers had, even if for some time, different data when executing this.
SERVICE ALERT: db2144;MariaDB Replica SQL: x2 #page;CRITICAL;SOFT;1;CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table mainstash.objectstash: Can't find record in 'objectstash', Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the event's master log db1151-bin.000810, end_log_pos 5460559
SERVICE ALERT: db1151;MariaDB Replica SQL: x2 #page;CRITICAL;SOFT;1;CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table mainstash.objectstash: Can't find record in 'objectstash', Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the event's master log db2144-bin.000813, end_log_pos 1038458517- At this moment, there is a split brain, but given the, at the time, load on codfw, no big user impact.
- 03:42 Cwhite escalated by page to Amir, Manuel, Jaime. Amir was waken up by this. (Manuel was on holidays at the time)
- 03:54 Reverted multi-dc patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/823231
- 04:07:58 Amir tries a few things on db1151:
sql_skip_slave_counterseveral times (OUTAGE STARTS HERE), then
TRUNCATE TABLE, then
RESET SLAVE, unsuccessfully.
SERVICE ALERT: db2143;MariaDB Replica IO: x2 #page;CRITICAL;SOFT;1;CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: 'could not find next log: the first event 'db2144-bin.000670' at 517376598, the last event read from 'db2144-bin.000813' at 1044345737, the last byte read from 'db2144-bin.000813' at 1044345768.'
- 04:20 Amir called in Jaime
- 04:26 Incident opened. cwhite becomes IC.
- 04:37 Switched MainStash back to Redis patch is prepared (never deployed) https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/823275/
- 04:39 Jaime switched master to STATEMENT, flushed logs to apply it
- 04:39 Switched codfw master to disable GTID, that made it work (this was done on codfw first as a test)
- 04:39 Switched eqiad replicas to both disable GTID and jump to latest STATEMENT coord, skipping ~1h of data. Untouched codfw replicas for debugging later. Replicas start caching up.
- 04:43:22 “Explicit transaction still active” errors stopped. OUTAGE ENDS HERE
- 04:44 Reverted switching MainStash back to Redis https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/823232
- 04:45:40 SERVICE ALERT: db1152;MariaDB Replica Lag: x2 #page;OK;HARD;10;OK slave_sql_lag Replication lag: 0.00 seconds
- 04:48 eqiad X2 replication restored, users report able to save edits again
- 05:00 Decision is made to leave replication broken on db2142 and db2143 for forensics during working hours.
2 pages were sent as soon as replication broke on the primary servers, which were promptly attended. The user-facing outage created more pages on the replicas- although unsure if sufficiently different from the first 2 pages to indicate the severity. User reports reached SRE at this point too, but people were already working on a resolution by then.
What went well?
- Coordination of responders and decisions went as well as it can be organized- pages were attended, people escalated the issue appropriately and promptly
- This didn't happen on a regular metadata or content section db, which would have required mass, multi-terabyte-data backup recovery
- Decision to depool codfw for mediawiki was taken quickly
What went poorly?
- Expected and documented application behavior (local dc replicas being fully passive) was not actual observed behavior: passive hosts were actually vital for the application to consider the service in good health
- Configuration on servers did not match expectations by the application (ROW vs STATEMENT)
- Application was not tested with the same configuration as production beforehand, or was not tested under common failure scenarios
- Exceptions should have not been generated for errors that were not fatal
- Unsafe, discouraged operational procedures were run on a section's primary server (sql_slave_skip_counter)
- Production testers were not immediately aware of production errors/pages triggered by multi-dc work.
- Page text doesn't make immediately clear what is the potential impact (the same text for db1151 and db1152 can mean user-impacting outage or minor issue depending on the context)
- Application purges may need tuning to be fully IDEMPOTENT (e.g. DELETE IGNORE?)
- "Doing nothing" would have been way less impacting than trying to fix issues in 2 ways- errors would have not been produced to users at all, and even fixing it could have been possible in a non-dba way (fixing the split brain by depooling codfw, fixing the exceptions by leaving pooled a single "good" server). However this was not possible to be known given the information known by operators at the time.
- Radical actions, such as deleting all data and deleting all logs made debugging during the issue and after it much harder- generating a copy or stopping replication on one dc or host would have helped. Also testing those on passive hosts to verify they work would have been wiser, rather than applying it on all at once.
- Service details (peculiarities) of the service were not known for some of the outage responders, important actions were not documented or promptly shared. Information didn't flow smoothly and proactively, even after the outage end - delaying the fix and root cause research.
How many people were involved in the remediation?
Links to relevant documentation
- task T315427 Switch x2 to statement-based replication, and in general re-review db configuration (disable GTID, read-write, slave_exec_mode)
- Review runbooks/procedures for x2 (and parsercache), specifically regarding replication issues
- E.g. Sometimes doing nothing and having a split brain is better than trying to fix stuff manually (e.g. waiting & depooling a dc, and cloning afterwards)
- Restore replication on codfw replicas, repool them, remove downtimes/silences, recheck read-only mode Done
- Validate MW’s concept of multi-master conflict resolution by performing simultaneous writes and simultaneous purges on both DCs
- As far as I was told, UPSERTs seemed to work but purges may need review (DELETE IGNORE?)
- Fix uncaught exception from LoadBalancer::approvePrimaryChanges() which caused total failure rather than graceful failure https://gerrit.wikimedia.org/r/c/mediawiki/core/+/823791
- Remove chronology protector checks so databases don't go into read only if local replicas crash or get replication broken: task T312809
- Re-enable multi-DC mode on testwiki, test2wiki and mediawikiwiki https://gerrit.wikimedia.org/r/c/operations/puppet/+/824039
- Make attempts to mistakenly depool a primary db have a more helpful message: task T314658
- See tasks task T315271 (initial replication breakage/multi-dc implications) and task T315274 (user visible outage, mw errors, within-dc replication issues) for discussion. See https://docs.google.com/document/d/1_nDpRvLEK9dGI2XVNeZVC1TcA89EYBCptiVCKItlWug for full logs and other data.
|People||Were the people responding to this incident sufficiently different than the previous five incidents?||yes||overlap of 4|
|Were the people who responded prepared enough to respond effectively||yes|
|Were fewer than five people paged?||no|
|Were pages routed to the correct sub-team(s)?||yes|
|Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.||no|
|Process||Was the incident status section actively updated during the incident?||yes|
|Was the public status page updated?||no|
|Is there a phabricator task for the incident?||yes|
|Are the documented action items assigned?||yes*||only 1 unassigned|
|Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?||yes|
|Tooling||To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented.
|Were the people responding able to communicate effectively during the incident with the existing tooling?||yes|
|Did existing monitoring notify the initial responders?||yes|
|Were the engineering tools that were to be used during the incident, available and in service?||yes|
|Were the steps taken to mitigate guided by an existing runbook?||no|
|Total score (count of all “yes” answers above)||11|