You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Switch Datacenter: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Alexandros Kosiaris
(Note that frontend puppetmasters should be used for puppet-merge)
imported>Clément Goubert
 
(65 intermediate revisions by 21 users not shown)
Line 1: Line 1:
{{See|See also the failover blog posts from "[https://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/ April 11, 2016]" and "[https://blog.wikimedia.org/2017/04/18/codfw-temporary-editing-pause/ April 18, 2017]" on the Wikimedia Techblog.}}
{{See|See also the blog posts from the "[https://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/ 2016 Failover]" and "[https://blog.wikimedia.org/2017/04/18/codfw-temporary-editing-pause/ 2017 Failover]" and "[https://techblog.wikimedia.org/2021/07/23/june-2021-data-center-switchover/ 2021 Switchover]" on the Wikimedia Blogs.}}


== Introduction ==
== Introduction ==
A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.
Datacenter switchovers are a standard response to certain types of disasters ([https://duckduckgo.com/?q=Datacenter+switchover web search]). Technology companies regularly practice them to make sure that everything will work properly when the process is needed. Switching between datacenters also make it easier to do some maintenance work on non-active servers (e.g. database upgrades/changes), so while we're serving traffic from datacenter A, we can do all that work at datacenter B.


=== Schedule for 2017 switch ===
A Wikimedia datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component. SRE Service Operations maintains the process and software necessary to run the switchover.
See [[phab:T138810]] for tasks to be undertaken during the switch


* Elasticsearch: elasticsearch is automatically following mediawiki switch
__TOC__
* Services: Tuesday, April 18th 2017 14:30 UTC
 
* Media storage/Swift: Tuesday, April 18th 2017 15:00 UTC
== Weeks in advance preparation ==
* Traffic: Tuesday, April 18th 2017 19:00 UTC
 
* MediaWiki: '''Wednesday, April 19th 2017 [https://www.timeanddate.com/worldclock/fixedtime.html?iso=20170419T14 14:00 UTC]''' (user visible, requires read-only mode)
* 10 weeks before: Coordinate dates and communication plan with involved groups: [[Switch_Datacenter/Coordination]]
* Deployment server: Wednesday, April 19th 2017 16:00 UTC
* 3 weeks before: Run a "live test" of the cookbooks by "switching" from the passive DC to the active DC. The <code>--live-test</code> flag will skip actions that could harm the active DC or do them on the passive DC. This will instrument most of the code paths used in the switchover and help identify issues. This process will !log to SAL so you should coordinate with others, but otherwise should be non-disruptive. Due to changes since the last switchover you can expect code changes to become necessary, so take the time and assistance needed into account.
 
=== Live-test limitations ===
==== 03-set-db-readonly ====
This step of the cookbook will fail if circular replication is not already enabled everywhere. It can be skipped if the live-test is run before circular replication is enabled, but it '''must''' be retested during [[Switch_Datacenter#Days_in_advance_preparation]] to ensure it is correctly set up.
 
== Overall switchover flow ==
In a controlled switchover we first deactivate services in the primary datacenter and  second deactivate caching in the datacenter. The next step is to switch Mediawiki itself. About a week later we activate caching in the datacenter again, as we believe that testing the situation without caching in the datacenter is sufficient.


==== Switching back ====
Typical scheduling looks like:
* Traffic: Pre-switchback in two phases: Mon May 1 and Tues May 2 (to avoid cold-cache issues Weds)
* Monday 14:00 UTC Services
* MediaWiki:  '''Wednesday, May 3rd 2017 [https://www.timeanddate.com/worldclock/fixedtime.html?iso=20170503T14 14:00 UTC]''' (user visible, requires read-only mode)
* Monday 15:00 UTC Caching (traffic)
* Elasticsearch: elasticsearch is automatically following mediawiki switch
* Tuesday 14:00 UTC Mediawiki
* Services: Thursday, May 4th 2017 14:30 UTC
* Swift: Thursday, May 4th 2017 15:30 UTC
* Deployment server: Thursday, May 4th 2017 16:00 UTC


__TOC__
The following week reactivate caching. 6+ Weeks later switchback Mediawiki


== Per-service switchover instructions ==
== Per-service switchover instructions ==
=== MediaWiki ===
=== MediaWiki ===
We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in operations/switchdc [https://github.com/wikimedia/operations-switchdc]
We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in the [[git:operations/cookbooks/|operations/cookbooks]] repository, in the [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/|cookbooks/sre/switchdc/mediawiki/]] path.


==== Days in advance preparation ====
==== Days in advance preparation ====
# Warm up databases; see [[MariaDB/buffer pool dump]].
# '''OPTIONAL: SKIP IN AN EMERGENCY''': Make sure databases are in a good state. Normally this requires no operation, as the passive datacenter databases are always prepared to receive traffic, so there are no actionables. Some things that #DBAs normally should make sure for the most optimal state possible (sanity checks):
# Prepare puppet patches:
#* There is no ongoing long-running maintenance that affects database availability or lag (schema changes, upgrades, hardware issues, etc.). Depool those servers that are not ready.
#* Switch mw_primary ([https://gerrit.wikimedia.org/r/346321 eqiad->codfw];  [https://gerrit.wikimedia.org/r/#/c/351315 codfw->eqiad] )
#* Replication is flowing from eqiad -> codfw and from codfw -> eqiad (replication is usually stopped in the passive -> active direction to facilitate maintenance)
#* Switch cache::app_routes backends from old_site-active to new_site-active ([https://gerrit.wikimedia.org/r/346320 eqiad->codfw]; [https://gerrit.wikimedia.org/r/#/c/351313 codfw->eqiad] )
#* All database servers have its buffer pool filled up. This is taken care automatically with the [[MariaDB/buffer pool dump|automatic buffer pool warmup functionality]]. For sanity checks, some sample load could be sent to the MediaWiki application server to check requests happen as quickly as in the active datacenter.
# Prepare the mediawiki-config patch or patches ([https://gerrit.wikimedia.org/r/346251 eqiad->codfw]; [https://gerrit.wikimedia.org/r/#/c/351592/ codfw->eqiad])
#* These were the [[Switch Datacenter/planned db maintenance#2018 Switch Datacenter|things we prepared/checked for the 2018 switch]]
# Make absolutely sure that parsercache replication is working from the active to the passive DC. Verify that the parsercache servers are set to read-write in the passive DC. '''This is important'''
#Check appserver weights on servers in the passive DC, make sure that newer hardware is weighted higher (usually 30) and older hardware is less (usually 25)
#Run <code>sre.switchdc.mediawiki.03-set-db-readonly</code> and <code>sre.switchdc.mediawiki.06-set-db-readwrite</code> in live-test mode back to back once circular replication is enabled. '''This is important'''


==== Stage 0 - preparation ====
{{Infobox2|data1='''Start the following steps about half an hour to an hour before the scheduled switchover time, in a tmux or a screen.'''}}
# Disable puppet on all MediaWiki jobqueues/videoscalers and maintenance hosts and cache::text in both eqiad and codfw. [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t00_disable_puppet.py switchdc t00_disable_puppet.py] [[Switch Datacenter/MediaWiki#t00 disable puppet|sample output]]
==== Phase 0 - preparation ====
# Merge the mediawiki-config [https://gerrit.wikimedia.org/r/346251 switchover changes] but don't sync '''This is not covered by the switchdc script'''
# Add a scheduled maintenance on [https://manage.statuspage.io/pages/nnqjzz7cd4tj/incidents StatusPage] (Maintenances -> Schedule Maintenance) '''This is not covered by the switchdc script.'''
# Stop <code>swiftrepl</code> on <code>me-fe1005</code> '''This is not covered by the switchdc script'''
# Add a scap lock on the deployment server <code>scap lock --all "Datacenter Switchover - T12345"</code>. Do this in another tmux window, as it will stay there for you to unlock at the end of the procedure.'''This is not covered by the switchdc script.'''
# Reduce the TTL on <tt>appservers-rw, api-rw, imagescaler-rw</tt> to 10 seconds: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t00_reduce_ttl.py switchdc t00_reduce_ttl.py] [[Switch Datacenter/MediaWiki#t00 reduce ttl|sample output]]
# Disable puppet on maintenance hosts in both eqiad and codfw: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-disable-puppet.py|00-disable-puppet.py]]
# Reduce the TTL on <code>appservers-rw, api-rw, jobrunner, videoscaler, parsoid-php</code> to 10 seconds: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-reduce-ttl.py|00-reduce-ttl.py]] '''Make sure that at least 5 minutes (the old TTL) have passed before moving to Phase 1, the cookbook should force you to wait'''.
# '''Optional''' Warm up APC running the mediawiki-cache-warmup on the new site clusters. The warmup queries will repeat automatically until the response times stabilize: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-warmup-caches.py|00-warmup-caches.py]]
#* The global "urls-cluster" warmup against the appservers cluster
#* The "urls-server" warmup against all hosts in the appservers cluster.
#*The "urls-server" warmup against all hosts in the api-appservers cluster.
# Set downtime for Read only checks on mariadb masters changed on Phase 3 so they don't page. [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-downtime-db-readonly-checks.py|00-downtime-db-readonly-checks.py]]
{{Infobox2|data1='''Stop for GO/NOGO'''}}


==== Phase 1 - stop maintenance ====
==== Phase 1 - stop maintenance ====
# Stop jobqueues in the ''active site'' and kill all the cronjobs on the maintenance host in the active site: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t01_stop_maintenance.py switchdc t01_stop_maintenance.py] [[Switch Datacenter/MediaWiki#output 2|sample output]]
# Stop maintenance jobs in both datacenters and kill all the periodic jobs (systemd timers) on maintenance hosts in both datacenters: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/01-stop-maintenance.py|01-stop-maintenance.py]]
 
{{Infobox2|data1='''Stop for final GO/NOGO before read-only.'''<br>The following steps until Phase 7 need to be executed in quick succession to minimize read-only time}}
==== Phase 2 - read-only mode ====
==== Phase 2 - read-only mode ====
# Go to read-only mode by syncing <tt>wmf-config/db-$old-site.php</tt>: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t02_start_mediawiki_readonly.py switchdc t02_start_mediawiki_readonly.py] [[Switch Datacenter/MediaWiki#output 3|sample output]]
# Go to read-only mode by changing the <code>ReadOnly</code> conftool value: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/02-set-readonly.py|02-set-readonly.py]]


==== Phase 3 - lock down database masters ====
==== Phase 3 - lock down database masters ====
# Put old-site core DB masters (shards: s1-s7, x1, es2-es3) in read-only mode: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t03_coredb_masters_readonly.py switchdc t03_coredb_masters_readonly.py] [[Switch Datacenter/MediaWiki#output 4|sample output]]
# Put old-site core DB masters (shards: s1-s8, x1, es4-es5) in read-only mode and wait for the new site's databases to catch up replication: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/03-set-db-readonly.py|03-set-db-readonly.py]]


==== Phase 4 - Wipe caches in the new site and warmup them ====
==== Phase 4 - switch active datacenter configuration ====
# All the following tasks are performed by [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t04_cache_wipe.py switchdc t04_cache_wipe.py] [[Switch Datacenter/MediaWiki#output 5|sample output]]
## Wait for the new site's databases to catch up replication
## Wipe ''new site''<nowiki/>'s memcached to prevent stale values — only once the new site's read-only master/slaves are caught up.
## Restart all HHVM servers in the new site to clear the APC cache
## Warm up memcached and APC running the mediawiki-cache-warmup on the new site clusters, specifically:
##* The global warmup against the appservers cluster
##* The apc-warmup against all hosts in the appservers cluster.
# Resync redises in the destination datacenter using [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t04_resync_redis.py switchdc t04_resync_redis.py]
# Merge and puppet-merge the traffic change for text caches [https://gerrit.wikimedia.org/r/#/c/346320 eqiad->codfw]; [https://gerrit.wikimedia.org/r/#/c/351313 codfw->eqiad]. '''This is not covered by the switchdc script'''..


==== Phase 5 - switch active datacenter configuration ====
# Switch the discovery records and MediaWiki active datacenter: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/04-switch-mediawiki.py|04-switch-mediawiki.py]]  
# Send the traffic layer to active-active: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t05_switch_traffic.py switchdc t05_switch_traffic.py] [[Switch Datacenter/MediaWiki#t05 switch traffic|sample output]]
#* Flip <code>appservers-rw, api-rw, jobrunner, videoscaler, parsoid-php</code> to <code>pooled=true</code> in the new site. Since both sites are now pooled in etcd, this will not actually change the DNS records for the active datacenter.
#* enable and run puppet on cache::text in $new_site. This starts the active-active traffic phase (traffic will  go to both MW clusters)
#* Flip <code>WMFMasterDatacenter</code> from the old site to the new.
#* ensure that the change was applied on all hosts in $new_site
#* Flip <code>appservers-rw, api-rw, jobrunner, videoscaler, parsoid-php</code> to <code>pooled=false</code> in the old site. After this, DNS will be changed for the old DC and internal applications (except mediawiki) will start hitting the new DC
#* Run puppet on the text caches in $old_site. This ends the active-active phase.
# Merge the switch of <tt>$mw_primary</tt> at this point. This change can actually be puppet-merged together with the varnish one. '''This is not covered by the switchdc script'''. (Puppet is only involved in managing traffic, db alerts, and the jobrunners).
# Switch the discovery: switchdc [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t05_switch_datacenter.py t05_switch_datacenter] [[Switch Datacenter/MediaWiki#t05 switch datacenter|sample output]]
#* Flip <tt>appservers-rw, api-rw, imagescaler-rw</tt> to <tt>pooled=true</tt> in the new site. This will not actually change the DNS records, but the on-disk redis config will change.
#* Deploy <tt>wmf-config/ConfigSettings.php</tt> changes to switch the datacenter in MediaWiki
#* Flip <tt>appservers-rw, api-rw, imagescaler-rw</tt> to <tt>pooled=false</tt> in the old site. After this, DNS will be changed and internal applications will start hitting the new DC


==== Phase 6 - Redis replicas ====
==== Phase 5 - DEPRECATED - Invert Redis replication for MediaWiki sessions ====
# Switch the live redis configuration. Verify redises are indeed replicating correctly: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t06_redis.py switchdc t06_redis] [[Switch Datacenter/MediaWiki#output 7|sample output]]
{{Outdated-inline|year=2023|note=Redis is not used for MW sessions anymore. Cookbook has been removed, go directly to [[Switch_Datacenter#Phase_6_-_Set_new_site's_databases_to_read-write|Phase 6]]}}


==== Phase 7 - Set new site's databases to read-write ====
==== Phase 6 - Set new site's databases to read-write ====
# Set new-site's core DB masters (shards: s1-s7, x1, es2-es3) in read-write mode: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t07_coredb_masters_readwrite.py switchdc t07_coredb_masters_readwrite.py] [[Switch Datacenter/MediaWiki#output 8|sample output]]
# Set new-site's core DB masters (shards: s1-s8, x1, es4-es5) in read-write mode: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/06-set-db-readwrite.py|06-set-db-readwrite.py]]


==== Phase 8 - Set MediaWiki to read-write ====
==== Phase 7 - Set MediaWiki to read-write ====
# Deploy mediawiki-config <tt>wmf-config/db-$new-site.php</tt> with all shards set to read-write: switchdc [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t08_stop_mediawiki_readonly.py t08_stop_mediawiki_readonly.py] [[Switch Datacenter/MediaWiki#output 9|sample output]]
# Go to read-write mode by changing the <code>ReadOnly</code> conftool value: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/07-set-readwrite.py|07-set-readwrite.py]]
{{Infobox2|data1='''You are now out of read-only mode'''<br>Breathe.}}
==== Phase 8 - Restore rest of MediaWiki ====


==== Phase 9 - post read-only ====
# Restart Envoy on the jobrunners that are now inactive, to trigger changeprop to re-resolve the DNS name and connect to the new DC: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/08-restart-envoy-on-jobrunners.py|08-restart-envoy-on-jobrunners.py]]  
# Start maintenance in the new DC: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t09_start_maintenance.py switchdc t09_start_maintenance.py] [[Switch Datacenter/MediaWiki#t09 start maintenance|sample output]]
#*A steady rate of 500s is expected until this step is completed, because changeprop will still be sending edits to jobrunners in the old DC, where the database master will reject them.
## the jobqueue in the ''new site'' by running puppet there (mw_primary controls it)
# Start maintenance in the new DC: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/08-start-maintenance.py|08-start-maintenance.py]]
## Run puppet on the maintenance hosts (mw_primary controls it)
#* Run puppet on the [[Maintenance_server|maintenance hosts]], which will reactivate systemd timers in both datacenters in the primary DC
# Update tendril for new database masters: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t09_tendril.py switchdc t09_tendril.py] [[Switch Datacenter/MediaWiki#t09 tendril|sample output]]
#*Most Wikidata-editing bots will restart once this is done and the "[https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?orgId=1&refresh=5s dispatch lag]" has recovered. This should bring us back to 100% of editing traffic.
# Restart parsoid: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t09_restart_parsoid.py switchdc t09_restart_parsoid.py] [[Switch Datacenter/MediaWiki#t09 restart parsoid|sample output]]
# End the planned maintenance in [https://manage.statuspage.io/pages/nnqjzz7cd4tj/incidents StatusPage]
# Set the TTL for the DNS records to 300 seconds again: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t09_restore_ttl.py switchdc t09_restore_ttl.py] [[Switch Datacenter/MediaWiki#t09 restore ttl|sample output]]
==== Phase 9 - Post read-only ====
# Update DNS records for new database masters deploying [https://gerrit.wikimedia.org/r/#/c/348440 eqiad->codfw]; [https://gerrit.wikimedia.org/r/#/c/350824/ codfw->eqiad] '''This is not covered by the switchdc script'''
# Set the TTL for the DNS records to 300 seconds again: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/09-restore-ttl.py|09-restore-ttl.py]]
# Start <code>swiftrepl</code> on codfw  '''This is not covered by the switchdc script'''
# Update DNS records for new database masters deploying [[gerrit:#/c/operations/dns/+/458787/|eqiad->codfw]]; [[gerrit:#/c/operations/dns/+/458790/|codfw->eqiad]] '''This is not covered by the switchdc script'''. Please use the following to SAL log <code>!log Phase 9.5 Update DNS records for new database masters</code>
# [Optional] Run the script to fix broken wikidata entities on the maintenance host of the active datacenter: <code>sudo -u www-data mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki --force</code> '''This is not covered by the switchdc script'''
# Run Puppet on the database masters in both DCs, to update expected read-only state: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/09-run-puppet-on-db-masters.py|09-run-puppet-on-db-masters.py]]. This will remove the downtimes set in Phase 0.
# Make sure the CentralNotice banner informing users of readonly is removed. Keep in mind, there is some minor HTTP caching involved (~5mins)
# Cancel the scap lock. You will need to go back to the terminal where you added the lock and press enter '''This is not covered by the switchdc script.'''
#Re-order noc.wm.o's debug.json to have primary servers listed first, see [[phab:T289745|T289745]] and backport it using scap. This will test scap2 deployment '''This is not covered by the switchdc script.'''
#Update maintenance server DNS records [[gerrit:893675|eqiad -> codfw]] '''This is not covered by the switchdc script.'''


==== Phase 10 - verification and troubleshooting ====
==== Phase 10 - verification and troubleshooting ====
'''This is not covered by the switchdc script'''
# Make sure reading & editing works! :)
# Make sure reading & editing works! :)
# Make sure recent changes are flowing (see Special:RecentChanges, [[EventStreams]], [[RCStream]] and the IRC feeds)
# Make sure recent changes are flowing (see Special:RecentChanges, [[EventStreams]], and the IRC feeds) <code>curl -s -H 'Accept: application/json'  https://stream.wikimedia.org/v2/stream/recentchange | jq .</code>
# Make sure email works (<code>exim4 -bp</code> on mx1001/mx2001, test an email)
# Make sure email works (<code>sudo -i; sudo exim4 -bp | exiqsumm | tail -n 5</code> on mx1001/mx2001 it should fluctuate between 0m and a few minutes, test an email)
 
=== Media storage/Swift ===
==== Switchover ====


* Set temporary active/active for Swift
==== Audible indicator ====
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347859/
Put [http://listen.hatnote.com/#fr,en,bn,de,es,ru,wikidata| Listen to wikipedia] in the background during the switchover. Silence indicates read-only, when it starts to make sounds again, edits are back up.
# <any puppetmaster>: <code>puppet-merge</code>
# <any cumin master>: <code>sudo cumin 'R:class = role::cache::upload and ( *.eqiad.wmnet or *.codfw.wmnet )' 'run-puppet-agent'</code>
* The above must complete correctly and fully (applying the change)
* Set Swift to active/passive in codfw only:
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347860/
# <any puppetmaster>: <code>puppet-merge</code>
# <any cumin master>: <code>sudo cumin 'R:class = role::cache::upload and ( *.eqiad.wmnet or *.codfw.wmnet )' 'run-puppet-agent'</code>


==== Switching back ====
==== Dashboards ====


Repeat the steps above in reverse order, with suitable revert commits
*[https://grafana.wikimedia.org/d/000000327/apache-fcgi?orgId=1 Apache/FCGI]
*[https://grafana.wikimedia.org/dashboard/db/mediawiki-application-servers?orgId=1 App servers]
*[https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?orgId=1&var-datasource=esams%20prometheus%2Fops&var-layer=backend&var-cluster=text ATS cluster view (text)]
*[https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cluster=text&var-origin=api-rw.discovery.wmnet&var-origin=appservers-rw.discovery.wmnet&var-origin=restbase.discovery.wmnet ATS backends<->Origin servers overview (appservers, api, restbase)]
*[https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors Logstash: mediawiki-errors]


=== ElasticSearch ===
=== ElasticSearch ===
==== General context on how to switchover ====


CirrusSearch talks by default to the local datacenter ($wmfDatacenter). If Mediawiki switches datacenter, elasticsearch will automatically follow.
CirrusSearch talks by default to the local datacenter (<code>$wmgDatacenter</code>). No special actions are required when disabling a datacenter.


Manually switching CirrusSearch to a specific datacenter can always be done. Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster [https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L16025-L16027 InitialiseSettings.php].
Manually switching CirrusSearch to a specific datacenter can always be done. Point CirrusSearch to codfw by editing <code>wgCirrusSearchDefaultCluster</code> [https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/ext-CirrusSearch.php#L12 ext-CirrusSearch.php].


To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following [[Search#Recovering_from_an_Elasticsearch_outage.2Finterruption_in_updates|Recovering from an Elasticsearch outage / interruption in updates]].
To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following [[Search#Recovering_from_an_Elasticsearch_outage.2Finterruption_in_updates|Recovering from an Elasticsearch outage / interruption in updates]].
==== Dashboards ====
* [https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles-prometheus?orgId=1 ElasticSearch Percentiles]


=== Traffic ===
=== Traffic ===
It is relatively straightforward for us to depool an entire datacenter at the traffic level, and is regularly done during maintenance or outages. For that reason, we tend to only keep the datacenter depooled for about a week, which allows us to test for a full traffic cycle (in theory).


==== General information on generic procedures ====
==== General information on generic procedures ====


https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing
See [[Global traffic routing]].


==== Switchover ====
==== Switchover ====
Inter-Cache Routing:
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347613/
# <any puppetmaster>: <code>puppet-merge</code>
# <any cumin master>: <code>sudo cumin 'cp3*.esams.wmnet' 'run-puppet-agent -q'</code>


GeoDNS (User-facing) Routing:
GeoDNS (User-facing) Routing:


# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347616
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458806
# <any authdns node>: authdns-update
# <any authdns node>: authdns-update
# SAL Log using the following <code>!log Traffic: depool eqiad from user traffic</code>
(Running <code>authdns-update</code> from any authdns node will update all nameservers.)


==== Switchback ====
==== Switchback ====


Same procedures as above, with reversions of the commits specified. The switchback will happen in two stages over two days: first reverting the inter-cache routing, then the user routing, to minimize cold-cache issues.
Same procedure as above, with reversion of the commit specified: [https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458807 GeoDNS].


=== Traffic - Services ===
==== Dashboards ====
*[https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1 Load Balancers]
*[https://grafana.wikimedia.org/d/000000479/frontend-traffic?refresh=1m&orgId=1&var-site=codfw&var-site=eqiad&var-cache_type=text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4 Frontend traffic (text@eqiad/codfw)]
*[https://grafana.wikimedia.org/d/000000479/frontend-traffic?refresh=1m&orgId=1&var-site=codfw&var-site=eqiad&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4 Frontend traffic (upload@eqiad/codfw)]


For reference, the public-facing services involved which are confirmed active/active or failover-capable (other than MW and Swift, handled elsewhere):
=== Services ===
* cache_text:
 
** restbase (active/passive for now)
==== General procedure ====
** cxserver (active/active)
All services, are active-active in DNS discovery, apart from restbase, that needs special treatment. The procedure to fail over to one site only is the same for every one of them:
** citoid (active/active)
# reduce the TTL of the DNS discovery records to 10 seconds
* cache_maps:
# depool the datacenter we're moving away from in confctl / discovery
** kartotherian (active/passive for now)
# restore the original TTL
* cache_misc:
 
** noc (active/active)
All of the above is done using the <code>sre.discovery.datacenter</code> cookbook in the case of a global switchover
** pybal_config (active/active)
** wdqs (active/active)
** ores (active/active)
** eventstreams (active/active)


==== Switchover ====
==== Switchover ====
<syntaxhighlight lang="bash">
# Switch all services to codfw
$ sudo cookbook sre.discovery.datacenter depool eqiad --all --reason "Datacenter Switchover" --task-id T12345
</syntaxhighlight>This will depool all active/active services, and prompt you to move or skip active/passive services.
==== Switchback ====
<syntaxhighlight lang="bash">
# Repool eqiad
$ sudo cookbook sre.discovery.datacenter pool eqiad --all --reason "Datacenter Switchback" --task-id T12345
</syntaxhighlight>This will repool all active/active services, and prompt you to move or skip active/passive services.


* Set temporary active/active for active/passive services above:
=== Special cases ===
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347852/
# <any frontend puppetmaster>: <code>puppet-merge</code>
# <any cumin master>: <code>sudo cumin '( cp1*.eqiad.wmnet or cp2*.codfw.wmnet ) and R:class ~ "(?i)role::cache::(maps|text)"' 'run-puppet-agent'</code>
* The above must complete correctly and fully (applying the change)
* Set all active/active (including temps above) to active/passive in codfw only:
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347853/
# <any frontend puppetmaster>: <code>puppet-merge</code>
# <any cumin master>: <code>sudo cumin '( cp1*.eqiad.wmnet or cp2*.codfw.wmnet ) and R:class ~ "(?i)role::cache::(misc|maps|text)"' 'run-puppet-agent'</code>


==== Switchback ====
==== Exclusions ====
If it is needed to exclude services, using the old <code>sre.switchdc.services</code> is still necessary until exclusion is implemented.<syntaxhighlight lang="bash">
# Switch all services to codfw, excluding parsoid and cxserver
$ sudo cookbook sre.switchdc.services --exclude parsoid cxserver -- eqiad codfw
</syntaxhighlight>


Reverse the above with reverted commits.
==== Single service ====
If you are switching only one service, using the old <code>sre.switchdc.services</code> is still necessary<syntaxhighlight lang="bash">
# Switch the service "parsoid" to codfw-only
$ sudo cookbook sre.switchdc.services --services parsoid -- eqiad codfw
</syntaxhighlight>


=== Services ===
==== apt ====
All services, are active-active in DNS discovery, apart from restbase, that needs special treatment. The procedure to fail over to one site only is the same for every one of them:
apt.wikimedia.org needs a  [[gerrit:893409|puppet change]]
# reduce the TTL of the dns discovery records to 10 seconds
# depool the datacenter we're moving away from in confctl / discovery
# restore the original TTL


==== restbase-async ====
Restbase is a bit of a special case, and needs an additional step, if we're just switching active traffic over and not simulating a complete failover:
Restbase is a bit of a special case, and needs an additional step, if we're just switching active traffic over and not simulating a complete failover:


# pool restbase-async everywhere, then depool restbase-async in the newly active dc, so that async traffic is separated from real-users traffic as much as possible.
# pool restbase-async everywhere<syntaxhighlight lang="bash">
sudo cookbook sre.discovery.service-route --reason T123456 pool --wipe-cache $dc_from restbase-async
sudo cookbook sre.discovery.service-route --reason T123456 pool --wipe-cache $dc_to restbase-async
</syntaxhighlight>
# depool restbase-async in the newly active dc, so that async traffic is separated from real-users traffic as much as possible.<syntaxhighlight lang="bash">
sudo cookbook sre.discovery.service-route --reason T123456 depool --wipe-cache $dc_to restbase-async
</syntaxhighlight>
 
When simulating a complete failover, keep restbase pooled in $dc_to for as long as possible to test capacity, then switch it to $dc_from by using the above procedure.
 
==== Manual switch ====
 
These services require manual changes to be switched over and have not yet been included in service::catalog
 
* [[Switch Datacenter/DeploymentServer|Deployment server]]
 
* [[planet.wikimedia.org]]
** The DNS discovery name planet.discovery.wmnet needs to be switched from one backend to another as in example change [[gerrit:891369]]. No other change is needed.
 
* [[people.wikimedia.org]]
** In puppet hieradata the rsync_src and rsync_dst hosts need to be flipped as in example change [[gerrit:891382]].
** FIXME: manual rsync command has to be run
** The DNS discovery name peopleweb.discovery.wmnet needs to be switched from one backend to another as in example change [[gerrit::891381]].
 
* [[noc.wikimedia.org]]
** The [[noc.wikimedia.org]] DNS name points to DNS discovery name mwmaint.discovery.wmnet that needs to be switched from one backend to another as in example change [[gerrit:896118]]. No other change is needed.
 
==== Dashboards ====
*[https://grafana.wikimedia.org/dashboard/db/eventstreams?refresh=1m&orgId=1 EventStreams]
*[https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&orgId=1 ORES]
*[https://grafana.wikimedia.org/dashboard/db/maps-performances?orgId=1 Maps]
*[https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m WDQS]
 
=== Databases ===
Main document: [[MariaDB/Switch Datacenter]]


=== Other miscellaneous ===
=== Other miscellaneous ===
* [[Switch Datacenter/DeploymentServer|Deployment server]]
* [[Analytics/Systems/EventLogging|EventLogging]]
* [[Analytics/Systems/EventLogging|EventLogging]]
* [[Irc.wikimedia.org|IRC]], [[RCStream]], [[EventStreams]]
* [[Irc.wikimedia.org|IRC]], <s>[[RCStream]]</s>, [[EventStreams]]
 
== Upcoming and past switches ==
{{Anchor|Schedule of past switches}}
=== 2023 switches ===
{{tracked|T327920}}
 
;Schedule
* Services: [https://zonestamp.toolforge.org/1677592832 Tuesday, February 28th, 2023 14:00 UTC]
* Traffic: [https://zonestamp.toolforge.org/1677596427 Tuesday, February 28th, 2023 15:00 UTC]
* MediaWiki: [https://zonestamp.toolforge.org/1677679227 Wednesday, March 1st, 2023 14:00 UTC]
'''Reports'''
 
* [[listarchive:list/wikitech-l@lists.wikimedia.org/thread/QXNSWHT7G2TUZRTYKLOGJR7IHEAHXWK7/|Recap]]
* Read only: 1 minute 59 seconds
 
 
'''Switching back:'''
 
'''Schedule'''
* Services: [https://zonestamp.toolforge.org/1682431229 Tuesday, April 25th, 2023 14:00 UTC]
* Traffic: March 14 2023
* MediaWiki: [https://zonestamp.toolforge.org/1682517645 Wednesday, April 26th 2023 14:00 UTC]
'''Reports'''
 
* Read only: 3 minutes 1 second
 
=== 2021 switches ===
{{tracked|T281515}}
 
;Schedule
* Services: [https://zonestamp.toolforge.org/1624888854 Monday, June 28th, 2021 14:00 UTC]
* Traffic: [https://zonestamp.toolforge.org/1624892434 Monday, June 28th, 2021 15:00 UTC]
* MediaWiki: [https://zonestamp.toolforge.org/1624975258 Tuesday, June 29th, 2021 14:00 UTC]
;Reports
* [https://techblog.wikimedia.org/2021/07/23/june-2021-data-center-switchover/ June 2021 Data Center Switchover] on the Wikimedia Tech blog
*[https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/XI57Z6T2DK7IC345VFTENM5RLTQBQDEQ/ Services and Traffic] on wikitech-l
* [https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/ENL3P5SA7RSOHPN4ILMXQ2BGBF5XR776/ MediaWiki] on wikitech-l
* [[Incident documentation/2021-06-29 trwikivoyage primary db]]
* Read only duration: 1 minute 57 seconds
 
'''Switching back:'''
 
{{Tracked|T287539}}
* Services: [https://zonestamp.toolforge.org/1631541650 Monday, Sept 13th 14:00 UTC]
*Traffic: [https://zonestamp.toolforge.org/1631545256 Monday, Sept 13th 15:00 UTC]
*MediaWiki: [https://zonestamp.toolforge.org/1631628029 Tuesday, Sept 14th 14:00 UTC]
;Reports
* [https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/6UZCCACCBCZLN5MHROZQXUG6ZOQTDCLO/ Datacenter switchover recap] on wikitech-l
* Read only duration: 2 minutes 42 seconds
 
=== 2020 switches ===
{{tracked|T243314}}
;Schedule
* Services: Monday, August 31st, 2020 14:00 UTC
* Traffic: Monday, August 31st, 2020 15:00 UTC
* MediaWiki: Tuesday, September 1st, 2020 14:00 UTC
;Reports
* [[Incident documentation/2020-09-01 data-center-switchover]]
* Read only duration: 2 minutes 49 seconds
 
'''Switching back:'''
* Traffic: Thursday, September 17th, 2020 17:00 UTC
* MediaWiki: Tuesday, October 27th, 2020 14:00 UTC
* Services: Wednesday, October 28th, 2020 14:00 UTC
 
=== 2018 switches ===
{{tracked|T199073}}
;Schedule
* Services: Tuesday, September 11th 2018 14:30 UTC
*Media storage/Swift: Tuesday, September 11th 2018 15:00 UTC
*Traffic: Tuesday, September 11th 2018 19:00 UTC
*MediaWiki: Wednesday, September 12th 2018: 14:00 UTC
;Reports
* [https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/LU63AEQYJDWRN4PL6OHFLT5ENMQBVFMW/ Datacenter Switchover recap]
* Read only duration: 7 minutes 34 seconds
 
'''Switching back:'''
;Schedule
* Traffic: Wednesday, October 10th 2018 09:00 UTC
*MediaWiki: Wednesday, October 10th 2018: 14:00 UTC
*Services: Thursday, October 11th 2018 14:30 UTC
*Media storage/Swift: Thursday, October 11th 2018 15:00 UTC
;Reports
* [https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/YSAZEXIMF73OA3OBC4Z4SYJKC6I4EWJH/ Datacenter Switchback recap]
* Read only duration: 4 minutes 41 seconds
 
=== 2017 switches ===
{{tracked|T138810}}
 
;Schedule
* Elasticsearch: elasticsearch is automatically following mediawiki switch
* Services: Tuesday, April 18th 2017 14:30 UTC
* Media storage/Swift: Tuesday, April 18th 2017 15:00 UTC
* Traffic: Tuesday, April 18th 2017 19:00 UTC
* MediaWiki: Wednesday, April 19th 2017 [https://www.timeanddate.com/worldclock/fixedtime.html?iso=20170419T14 14:00 UTC] (user visible, requires read-only mode)
* Deployment server: Wednesday, April 19th 2017 16:00 UTC
;Reports
* [https://blog.wikimedia.org/2017/04/18/codfw-temporary-editing-pause/ Editing pause for failover test] on Wikimedia Blog
* [[listarchive:list/wikitech-l@lists.wikimedia.org/thread/OLPWQECTXOT57IGPCFMCLHOG4ADCMM4S/|Recap]]
* Read only duration: 17 minutes


== Schedule of past switches ==
'''Switching back:'''
;Schedule
* Traffic: Pre-switchback in two phases: Mon May 1 and Tue May 2 (to avoid cold-cache issues Weds)
* MediaWiki:  Wednesday, May 3rd 2017 [https://www.timeanddate.com/worldclock/fixedtime.html?iso=20170503T14 14:00 UTC] (user visible, requires read-only mode)
* Elasticsearch: elasticsearch is automatically following mediawiki switch
* Services: Thursday, May 4th 2017 14:30 UTC
* Swift: Thursday, May 4th 2017 15:30 UTC
* Deployment server: Thursday, May 4th 2017 16:00 UTC
;Reports
* [[Incident documentation/2017-05-03 missing index]]
* [[Incident documentation/2017-05-03 x1 outage]]
* [[listarchive:list/wikitech-l@lists.wikimedia.org/thread/OLPWQECTXOT57IGPCFMCLHOG4ADCMM4S/|Recap]]
* Read only duration: 13 minutes


=== Schedule for 2016 switch ===
=== 2016 switches ===
;Schedule
* Deployment server: Wednesday, January 20th 2016
* Deployment server: Wednesday, January 20th 2016
* Traffic: Thursday, March 10th 2016
* Traffic: Thursday, March 10th 2016
Line 201: Line 355:
* Services: Monday, April 18th 2016, 10:00 UTC
* Services: Monday, April 18th 2016, 10:00 UTC
* MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
* MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
;Reports
* [https://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/ Wikimedia failover test] on Wikimedia Blog


==== Switching back ====
'''Switching back:'''
* MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
* MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
* Services, Elasticsearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done
* Services, Elasticsearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done
== Monitoring Dashboards ==
Aggregated list of interesting dashboards
*[https://grafana.wikimedia.org/dashboard/db/apache-hhvm?orgId=1 Apache/HHVM]
*[https://grafana.wikimedia.org/dashboard/db/mediawiki-application-servers?orgId=1 App servers]
*[https://grafana.wikimedia.org/dashboard/db/load-balancers?orgId=1 Load Balancers]
*[https://grafana.wikimedia.org/dashboard/db/frontend-traffic?refresh=1m&orgId=1 Frontend traffic]
*[https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=cache_upload&var-layear-layer=backend Varnish eqiad]
*[https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-dc-stats?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=cache_upload&var-layer=backend Varnish codfw]
* [https://grafana.wikimedia.org/dashboard/file/swift.json?orgId=1 Swift eqiad]
* [https://grafana.wikimedia.org/dashboard/file/swift.json?orgId=1&var-DC=codfw Swift codfw]
[[Category:SRE Service Operations]]

Latest revision as of 08:22, 4 May 2023

Introduction

Datacenter switchovers are a standard response to certain types of disasters (web search). Technology companies regularly practice them to make sure that everything will work properly when the process is needed. Switching between datacenters also make it easier to do some maintenance work on non-active servers (e.g. database upgrades/changes), so while we're serving traffic from datacenter A, we can do all that work at datacenter B.

A Wikimedia datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component. SRE Service Operations maintains the process and software necessary to run the switchover.

Weeks in advance preparation

  • 10 weeks before: Coordinate dates and communication plan with involved groups: Switch_Datacenter/Coordination
  • 3 weeks before: Run a "live test" of the cookbooks by "switching" from the passive DC to the active DC. The --live-test flag will skip actions that could harm the active DC or do them on the passive DC. This will instrument most of the code paths used in the switchover and help identify issues. This process will !log to SAL so you should coordinate with others, but otherwise should be non-disruptive. Due to changes since the last switchover you can expect code changes to become necessary, so take the time and assistance needed into account.

Live-test limitations

03-set-db-readonly

This step of the cookbook will fail if circular replication is not already enabled everywhere. It can be skipped if the live-test is run before circular replication is enabled, but it must be retested during Switch_Datacenter#Days_in_advance_preparation to ensure it is correctly set up.

Overall switchover flow

In a controlled switchover we first deactivate services in the primary datacenter and second deactivate caching in the datacenter. The next step is to switch Mediawiki itself. About a week later we activate caching in the datacenter again, as we believe that testing the situation without caching in the datacenter is sufficient.

Typical scheduling looks like:

  • Monday 14:00 UTC Services
  • Monday 15:00 UTC Caching (traffic)
  • Tuesday 14:00 UTC Mediawiki

The following week reactivate caching. 6+ Weeks later switchback Mediawiki

Per-service switchover instructions

MediaWiki

We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in the operations/cookbooks repository, in the cookbooks/sre/switchdc/mediawiki/ path.

Days in advance preparation

  1. OPTIONAL: SKIP IN AN EMERGENCY: Make sure databases are in a good state. Normally this requires no operation, as the passive datacenter databases are always prepared to receive traffic, so there are no actionables. Some things that #DBAs normally should make sure for the most optimal state possible (sanity checks):
    • There is no ongoing long-running maintenance that affects database availability or lag (schema changes, upgrades, hardware issues, etc.). Depool those servers that are not ready.
    • Replication is flowing from eqiad -> codfw and from codfw -> eqiad (replication is usually stopped in the passive -> active direction to facilitate maintenance)
    • All database servers have its buffer pool filled up. This is taken care automatically with the automatic buffer pool warmup functionality. For sanity checks, some sample load could be sent to the MediaWiki application server to check requests happen as quickly as in the active datacenter.
    • These were the things we prepared/checked for the 2018 switch
  2. Make absolutely sure that parsercache replication is working from the active to the passive DC. Verify that the parsercache servers are set to read-write in the passive DC. This is important
  3. Check appserver weights on servers in the passive DC, make sure that newer hardware is weighted higher (usually 30) and older hardware is less (usually 25)
  4. Run sre.switchdc.mediawiki.03-set-db-readonly and sre.switchdc.mediawiki.06-set-db-readwrite in live-test mode back to back once circular replication is enabled. This is important
Start the following steps about half an hour to an hour before the scheduled switchover time, in a tmux or a screen.

Phase 0 - preparation

  1. Add a scheduled maintenance on StatusPage (Maintenances -> Schedule Maintenance) This is not covered by the switchdc script.
  2. Add a scap lock on the deployment server scap lock --all "Datacenter Switchover - T12345". Do this in another tmux window, as it will stay there for you to unlock at the end of the procedure.This is not covered by the switchdc script.
  3. Disable puppet on maintenance hosts in both eqiad and codfw: 00-disable-puppet.py
  4. Reduce the TTL on appservers-rw, api-rw, jobrunner, videoscaler, parsoid-php to 10 seconds: 00-reduce-ttl.py Make sure that at least 5 minutes (the old TTL) have passed before moving to Phase 1, the cookbook should force you to wait.
  5. Optional Warm up APC running the mediawiki-cache-warmup on the new site clusters. The warmup queries will repeat automatically until the response times stabilize: 00-warmup-caches.py
    • The global "urls-cluster" warmup against the appservers cluster
    • The "urls-server" warmup against all hosts in the appservers cluster.
    • The "urls-server" warmup against all hosts in the api-appservers cluster.
  6. Set downtime for Read only checks on mariadb masters changed on Phase 3 so they don't page. 00-downtime-db-readonly-checks.py
Stop for GO/NOGO

Phase 1 - stop maintenance

  1. Stop maintenance jobs in both datacenters and kill all the periodic jobs (systemd timers) on maintenance hosts in both datacenters: 01-stop-maintenance.py
Stop for final GO/NOGO before read-only.
The following steps until Phase 7 need to be executed in quick succession to minimize read-only time

Phase 2 - read-only mode

  1. Go to read-only mode by changing the ReadOnly conftool value: 02-set-readonly.py

Phase 3 - lock down database masters

  1. Put old-site core DB masters (shards: s1-s8, x1, es4-es5) in read-only mode and wait for the new site's databases to catch up replication: 03-set-db-readonly.py

Phase 4 - switch active datacenter configuration

  1. Switch the discovery records and MediaWiki active datacenter: 04-switch-mediawiki.py
    • Flip appservers-rw, api-rw, jobrunner, videoscaler, parsoid-php to pooled=true in the new site. Since both sites are now pooled in etcd, this will not actually change the DNS records for the active datacenter.
    • Flip WMFMasterDatacenter from the old site to the new.
    • Flip appservers-rw, api-rw, jobrunner, videoscaler, parsoid-php to pooled=false in the old site. After this, DNS will be changed for the old DC and internal applications (except mediawiki) will start hitting the new DC

Phase 5 - DEPRECATED - Invert Redis replication for MediaWiki sessions

Phase 6 - Set new site's databases to read-write

  1. Set new-site's core DB masters (shards: s1-s8, x1, es4-es5) in read-write mode: 06-set-db-readwrite.py

Phase 7 - Set MediaWiki to read-write

  1. Go to read-write mode by changing the ReadOnly conftool value: 07-set-readwrite.py
You are now out of read-only mode
Breathe.

Phase 8 - Restore rest of MediaWiki

  1. Restart Envoy on the jobrunners that are now inactive, to trigger changeprop to re-resolve the DNS name and connect to the new DC: 08-restart-envoy-on-jobrunners.py
    • A steady rate of 500s is expected until this step is completed, because changeprop will still be sending edits to jobrunners in the old DC, where the database master will reject them.
  2. Start maintenance in the new DC: 08-start-maintenance.py
    • Run puppet on the maintenance hosts, which will reactivate systemd timers in both datacenters in the primary DC
    • Most Wikidata-editing bots will restart once this is done and the "dispatch lag" has recovered. This should bring us back to 100% of editing traffic.
  3. End the planned maintenance in StatusPage

Phase 9 - Post read-only

  1. Set the TTL for the DNS records to 300 seconds again: 09-restore-ttl.py
  2. Update DNS records for new database masters deploying eqiad->codfw; codfw->eqiad This is not covered by the switchdc script. Please use the following to SAL log !log Phase 9.5 Update DNS records for new database masters
  3. Run Puppet on the database masters in both DCs, to update expected read-only state: 09-run-puppet-on-db-masters.py. This will remove the downtimes set in Phase 0.
  4. Make sure the CentralNotice banner informing users of readonly is removed. Keep in mind, there is some minor HTTP caching involved (~5mins)
  5. Cancel the scap lock. You will need to go back to the terminal where you added the lock and press enter This is not covered by the switchdc script.
  6. Re-order noc.wm.o's debug.json to have primary servers listed first, see T289745 and backport it using scap. This will test scap2 deployment This is not covered by the switchdc script.
  7. Update maintenance server DNS records eqiad -> codfw This is not covered by the switchdc script.

Phase 10 - verification and troubleshooting

This is not covered by the switchdc script

  1. Make sure reading & editing works! :)
  2. Make sure recent changes are flowing (see Special:RecentChanges, EventStreams, and the IRC feeds) curl -s -H 'Accept: application/json' https://stream.wikimedia.org/v2/stream/recentchange | jq .
  3. Make sure email works (sudo -i; sudo exim4 -bp | exiqsumm | tail -n 5 on mx1001/mx2001 it should fluctuate between 0m and a few minutes, test an email)

Audible indicator

Put Listen to wikipedia in the background during the switchover. Silence indicates read-only, when it starts to make sounds again, edits are back up.

Dashboards

ElasticSearch

General context on how to switchover

CirrusSearch talks by default to the local datacenter ($wmgDatacenter). No special actions are required when disabling a datacenter.

Manually switching CirrusSearch to a specific datacenter can always be done. Point CirrusSearch to codfw by editing wgCirrusSearchDefaultCluster ext-CirrusSearch.php.

To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following Recovering from an Elasticsearch outage / interruption in updates.

Dashboards

Traffic

It is relatively straightforward for us to depool an entire datacenter at the traffic level, and is regularly done during maintenance or outages. For that reason, we tend to only keep the datacenter depooled for about a week, which allows us to test for a full traffic cycle (in theory).

General information on generic procedures

See Global traffic routing.

Switchover

GeoDNS (User-facing) Routing:

  1. gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458806
  2. <any authdns node>: authdns-update
  3. SAL Log using the following !log Traffic: depool eqiad from user traffic

(Running authdns-update from any authdns node will update all nameservers.)

Switchback

Same procedure as above, with reversion of the commit specified: GeoDNS.

Dashboards

Services

General procedure

All services, are active-active in DNS discovery, apart from restbase, that needs special treatment. The procedure to fail over to one site only is the same for every one of them:

  1. reduce the TTL of the DNS discovery records to 10 seconds
  2. depool the datacenter we're moving away from in confctl / discovery
  3. restore the original TTL

All of the above is done using the sre.discovery.datacenter cookbook in the case of a global switchover

Switchover

# Switch all services to codfw
$ sudo cookbook sre.discovery.datacenter depool eqiad --all --reason "Datacenter Switchover" --task-id T12345

This will depool all active/active services, and prompt you to move or skip active/passive services.

Switchback

# Repool eqiad
$ sudo cookbook sre.discovery.datacenter pool eqiad --all --reason "Datacenter Switchback" --task-id T12345

This will repool all active/active services, and prompt you to move or skip active/passive services.

Special cases

Exclusions

If it is needed to exclude services, using the old sre.switchdc.services is still necessary until exclusion is implemented.

# Switch all services to codfw, excluding parsoid and cxserver
$ sudo cookbook sre.switchdc.services --exclude parsoid cxserver -- eqiad codfw

Single service

If you are switching only one service, using the old sre.switchdc.services is still necessary

# Switch the service "parsoid" to codfw-only
$ sudo cookbook sre.switchdc.services --services parsoid -- eqiad codfw

apt

apt.wikimedia.org needs a puppet change

restbase-async

Restbase is a bit of a special case, and needs an additional step, if we're just switching active traffic over and not simulating a complete failover:

  1. pool restbase-async everywhere
    sudo cookbook sre.discovery.service-route --reason T123456 pool --wipe-cache $dc_from restbase-async
    sudo cookbook sre.discovery.service-route --reason T123456 pool --wipe-cache $dc_to restbase-async
    
  2. depool restbase-async in the newly active dc, so that async traffic is separated from real-users traffic as much as possible.
    sudo cookbook sre.discovery.service-route --reason T123456 depool --wipe-cache $dc_to restbase-async
    

When simulating a complete failover, keep restbase pooled in $dc_to for as long as possible to test capacity, then switch it to $dc_from by using the above procedure.

Manual switch

These services require manual changes to be switched over and have not yet been included in service::catalog

  • planet.wikimedia.org
    • The DNS discovery name planet.discovery.wmnet needs to be switched from one backend to another as in example change gerrit:891369. No other change is needed.
  • people.wikimedia.org
    • In puppet hieradata the rsync_src and rsync_dst hosts need to be flipped as in example change gerrit:891382.
    • FIXME: manual rsync command has to be run
    • The DNS discovery name peopleweb.discovery.wmnet needs to be switched from one backend to another as in example change gerrit::891381.

Dashboards

Databases

Main document: MariaDB/Switch Datacenter

Other miscellaneous

Upcoming and past switches

2023 switches

Schedule

Reports

  • Recap
  • Read only: 1 minute 59 seconds


Switching back:

Schedule

Reports

  • Read only: 3 minutes 1 second

2021 switches

Schedule
Reports

Switching back:

Reports

2020 switches

Schedule
  • Services: Monday, August 31st, 2020 14:00 UTC
  • Traffic: Monday, August 31st, 2020 15:00 UTC
  • MediaWiki: Tuesday, September 1st, 2020 14:00 UTC
Reports

Switching back:

  • Traffic: Thursday, September 17th, 2020 17:00 UTC
  • MediaWiki: Tuesday, October 27th, 2020 14:00 UTC
  • Services: Wednesday, October 28th, 2020 14:00 UTC

2018 switches

Schedule
  • Services: Tuesday, September 11th 2018 14:30 UTC
  • Media storage/Swift: Tuesday, September 11th 2018 15:00 UTC
  • Traffic: Tuesday, September 11th 2018 19:00 UTC
  • MediaWiki: Wednesday, September 12th 2018: 14:00 UTC
Reports

Switching back:

Schedule
  • Traffic: Wednesday, October 10th 2018 09:00 UTC
  • MediaWiki: Wednesday, October 10th 2018: 14:00 UTC
  • Services: Thursday, October 11th 2018 14:30 UTC
  • Media storage/Swift: Thursday, October 11th 2018 15:00 UTC
Reports

2017 switches

Schedule
  • Elasticsearch: elasticsearch is automatically following mediawiki switch
  • Services: Tuesday, April 18th 2017 14:30 UTC
  • Media storage/Swift: Tuesday, April 18th 2017 15:00 UTC
  • Traffic: Tuesday, April 18th 2017 19:00 UTC
  • MediaWiki: Wednesday, April 19th 2017 14:00 UTC (user visible, requires read-only mode)
  • Deployment server: Wednesday, April 19th 2017 16:00 UTC
Reports

Switching back:

Schedule
  • Traffic: Pre-switchback in two phases: Mon May 1 and Tue May 2 (to avoid cold-cache issues Weds)
  • MediaWiki: Wednesday, May 3rd 2017 14:00 UTC (user visible, requires read-only mode)
  • Elasticsearch: elasticsearch is automatically following mediawiki switch
  • Services: Thursday, May 4th 2017 14:30 UTC
  • Swift: Thursday, May 4th 2017 15:30 UTC
  • Deployment server: Thursday, May 4th 2017 16:00 UTC
Reports

2016 switches

Schedule
  • Deployment server: Wednesday, January 20th 2016
  • Traffic: Thursday, March 10th 2016
  • MediaWiki 5-minute read-only test: Tuesday, March 15th 2016, 07:00 UTC
  • Elasticsearch: Thursday, April 7th 2016, 12:00 UTC
  • Media storage/Swift: Thursday, April 14th 2016, 17:00 UTC
  • Services: Monday, April 18th 2016, 10:00 UTC
  • MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
Reports

Switching back:

  • MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
  • Services, Elasticsearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done

Monitoring Dashboards

Aggregated list of interesting dashboards