You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Help:Toolforge/Database/Replica drift"

From Wikitech
Jump to navigation Jump to search
imported>Dvorapa
m (sort)
imported>SRodlund
Line 1: Line 1:
 
<noinclude>{{Template:Toolforge nav}}</noinclude>
 
<noinclude>{{Template:Toolforge nav}}</noinclude>
''This problem was '''solved''' for wikireplicas by introducing Row-based replication starting on production, so most issues, if not all, should have already disappeared (only anecdotal cases could reappear).'' {{tracked|T138967|Resolved}}
+
 
 +
{{note|This page is historical and documents a problem which has been addressed.}}
 +
 
 +
== Overview ==
 +
 
 +
This page documents an issue known as Replica drift.
 +
 
 +
This problem was '''solved''' for wikireplicas by introducing Row-based replication starting on production, so most issues, if not all, should have already disappeared (only anecdotal cases could reappear). {{tracked|T138967|Resolved}}
  
 
If you detect what you think is a drift- report a ticket on Wikimedia's [[mw:Phabricator|Phabricator]] (https://phabricator.wikimedia.org) with an SQL query, expected results and obtained results, with the tags #data-services and #dba.
 
If you detect what you think is a drift- report a ticket on Wikimedia's [[mw:Phabricator|Phabricator]] (https://phabricator.wikimedia.org) with an SQL query, expected results and obtained results, with the tags #data-services and #dba.
 +
 +
== History ==
  
 
Replica drift was a recurring problem for the Wiki Replicas prior to the [[phab:phame/post/view/70/new_wiki_replica_servers_ready_for_use/|introduction]] of [https://dev.mysql.com/doc/refman/5.7/en/replication-rbr-usage.html row-based replication (RBR)] between the [[Labsdb redaction|sanitarium server(s)]] and their upstream sources. The RBR replication used to populate the <code>*.{analytics,web}.db.svc.eqiad.wmflabs</code> servers will not allow arbitrary differences in data to be synchronized. If there is a replication failure it will halt all replication with the master server which will in turn raise an alert that will be noticed and corrected.<noinclude>
 
Replica drift was a recurring problem for the Wiki Replicas prior to the [[phab:phame/post/view/70/new_wiki_replica_servers_ready_for_use/|introduction]] of [https://dev.mysql.com/doc/refman/5.7/en/replication-rbr-usage.html row-based replication (RBR)] between the [[Labsdb redaction|sanitarium server(s)]] and their upstream sources. The RBR replication used to populate the <code>*.{analytics,web}.db.svc.eqiad.wmflabs</code> servers will not allow arbitrary differences in data to be synchronized. If there is a replication failure it will halt all replication with the master server which will in turn raise an alert that will be noticed and corrected.<noinclude>
  
== Historic problem ==
 
 
Prior to the switch to RBR, the replicated databases were not exact copies of the production database, which caused the database to slowly drift from the production contents. This was visible in various queries, but queries that involved recently deleted/restored pages seemed to be affected the most. The impact of this was kept as small as possible by regular database re-imports.
 
Prior to the switch to RBR, the replicated databases were not exact copies of the production database, which caused the database to slowly drift from the production contents. This was visible in various queries, but queries that involved recently deleted/restored pages seemed to be affected the most. The impact of this was kept as small as possible by regular database re-imports.
  
Line 24: Line 32:
 
=== How did things got better? ===
 
=== How did things got better? ===
 
* [[phab:T136860|3 new database servers were ordered]]. With these servers, we migrated back to InnoDB tables - that reduced clone time dramatically.
 
* [[phab:T136860|3 new database servers were ordered]]. With these servers, we migrated back to InnoDB tables - that reduced clone time dramatically.
* We switched to row based replication between the sanitarium servers and their upstream sources.
+
* We switched to row-based replication between the sanitarium servers and their upstream sources.
 
* A full reimport was done to bring the servers to a stable starting state.
 
* A full reimport was done to bring the servers to a stable starting state.
  
[[Category:Toolforge|Database replica drift]]
+
[{{:Help:Cloud Services communication}}
 +
 
 +
 
 +
[[Category:Toolforge]]
 +
[[Category:Documentation]]
 +
[[Category:Cloud Services]]

Revision as of 21:48, 14 February 2020

Overview

This page documents an issue known as Replica drift.

This problem was solved for wikireplicas by introducing Row-based replication starting on production, so most issues, if not all, should have already disappeared (only anecdotal cases could reappear).

If you detect what you think is a drift- report a ticket on Wikimedia's Phabricator (https://phabricator.wikimedia.org) with an SQL query, expected results and obtained results, with the tags #data-services and #dba.

History

Replica drift was a recurring problem for the Wiki Replicas prior to the introduction of row-based replication (RBR) between the sanitarium server(s) and their upstream sources. The RBR replication used to populate the *.{analytics,web}.db.svc.eqiad.wmflabs servers will not allow arbitrary differences in data to be synchronized. If there is a replication failure it will halt all replication with the master server which will in turn raise an alert that will be noticed and corrected.

Prior to the switch to RBR, the replicated databases were not exact copies of the production database, which caused the database to slowly drift from the production contents. This was visible in various queries, but queries that involved recently deleted/restored pages seemed to be affected the most. The impact of this was kept as small as possible by regular database re-imports.

Why did this happen?

The cause for the drift was that certain data-altering MediaWiki queries, under certain circumstances, produced different results on the Wiki Replicas and production. Replicas did not simply repeat every query sent to the master server, and this meant the databases would drift from each other.

For example, when a revision is undeleted by MediaWiki it is done with a query is something like:

INSERT INTO revision SELECT * FROM archive WHERE ...

That query can create different output when executed by different servers. The archive id can be different because that id was blocked by another connection; if the locks are different, the ids are different, and the replicas drift.

Based on reports, the main offenders were probably deleting/undeleting pages and auto_increment ids. In the long term, this should be solved on the MediaWiki side. (See phab:T108255, phab:T112637)

Why doesn't this happen on production?

The solution in production is the nuclear option: if a server is detected to have a difference, we nuke it and clone it, which takes 1 hour. This is not possible in Cloud Services due to several differences between production and the Wiki Replicas:

  • We cannot simply copy from production because the table contents have to be sanitized.
  • The copy cannot be a binary copy because the Wiki Replica servers use extra compression.

How did things got better?

  • 3 new database servers were ordered. With these servers, we migrated back to InnoDB tables - that reduced clone time dramatically.
  • We switched to row-based replication between the sanitarium servers and their upstream sources.
  • A full reimport was done to bring the servers to a stable starting state.

[== Communication and support ==

We communicate and provide support through several primary channels. Please reach out with questions and to join the conversation.

Communicate with us
Connect Best for
Phabricator Workboard #Cloud-Services Task tracking and bug reporting
IRC Channel #wikimedia-cloud connect General discussion and support
Mailing List cloud@ Information about ongoing initiatives, general discussion and support
Announcement emails cloud-announce@ Information about critical changes (all messages mirrored to cloud@)
News wiki page News Information about major near-term plans
Blog Clouds & Unicorns Learning more details about some of our work