Difference between revisions of "Switch Datacenter"

From Wikitech-static
Jump to navigation Jump to search
imported>Filippo Giunchedi
(→‎Phase 5 - apply configuration: move from swift-init to 'service')
imported>Kormat
 
(57 intermediate revisions by 22 users not shown)
Line 1: Line 1:
{{See|See also "[https://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/ Failover Test]" on the Wikimedia Techblog. (April 11, 2016)}}
{{See|See also the blog posts from the "[https://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/ 2016 Failover]" and "[https://blog.wikimedia.org/2017/04/18/codfw-temporary-editing-pause/ 2017 Failover]" and "[https://techblog.wikimedia.org/2021/07/23/june-2021-data-center-switchover/ 2021 Switchover]" on the Wikimedia Blogs.}}


== Introduction ==
== Introduction ==
A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.
Datacenter switchovers are a standard response to certain types of disasters ([https://duckduckgo.com/?q=Datacenter+switchover web search]). Technology companies regularly practice them to make sure that everything will work properly when the process is needed. Switching between datacenters also make it easier to do some maintenance work on non-active servers (e.g. database upgrades/changes), so while we're serving traffic from datacenter A, we can do all that work at datacenter B.


=== Schedule for 2017 switch ===
A Wikimedia datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component. SRE Service Operations maintains the process and software necessary to run the switchover.
See [[phab:T138810]] for tasks to be undertaken during the switch
* Deployment server:
* Traffic:
* MediaWiki 5-minute read-only test: (?)
* ElasticSearch:
* Media storage/Swift:
* Services:
* MediaWiki:  '''Wednesday, April 19th 2017''' (requires read-only mode)


==== Switching back ====
__TOC__
* MediaWiki: '''Wednesday, May 3rd 2017''' (requires read-only mode)
 
* Services, ElasticSearch, Traffic, Swift, Deployment server:  (after the above is done)
== Weeks in advance preparation ==
 
* 10 weeks before: Coordinate dates and communication plan with involved groups: [[Switch_Datacenter/Coordination]]
* 3 weeks before: Run a "live test" of the cookbooks by "switching" from the passive DC to the active DC. The <code>--live-test</code> flag will skip actions that could harm the active DC or do them on the passive DC. This will instrument most of the code paths used in the switchover and help identify issues. This process will !log to SAL so you should coordinate with others, but otherwise should be non-disruptive. Due to changes since the last switchover you can expect code changes to become necessary, so take the time and assistance needed into account.


=== Schedule for Q3 FY2015-2016 rollout ===
== Overall switchover flow ==
* Deployment server: Wednesday, January 20th 2016
In a controlled switchover we first deactivate services in the primary datacenter and  second deactivate caching in the datacenter. The next step is to switch Mediawiki itself. About a week later we activate caching in the datacenter again, as we believe that testing the situation without caching in the datacenter is sufficient.
* Traffic: Thursday, March 10th 2016
* MediaWiki 5-minute read-only test: Tuesday, March 15th 2016, 07:00 UTC
* ElasticSearch: Thursday, April 7th 2016, 12:00 UTC
* Media storage/Swift: Thursday, April 14th 2016, 17:00 UTC
* Services: Monday, April 18th 2016, 10:00 UTC
* MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)


==== Switching back ====
Typical scheduling looks like:
* MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
* Monday 14:00 UTC Services
* Services, ElasticSearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done
* Monday 15:00 UTC Caching (traffic)
* Tuesday 14:00 UTC Mediawiki


<br>
The following week reactivate caching. 6+ Weeks later switchback Mediawiki
__TOC__


== Per-service switchover instructions ==
== Per-service switchover instructions ==
=== MediaWiki ===
We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in the [[git:operations/cookbooks/|operations/cookbooks]] repository, in the [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/|cookbooks/sre/switchdc/mediawiki/]] path.
==== Days in advance preparation ====
# '''OPTIONAL: SKIP IN AN EMERGENCY''': Make sure databases are in a good state. Normally this requires no operation, as the passive datacenter databases are always prepared to receive traffic, so there are no actionables. Some things that #DBAs normally should make sure for the most optimal state possible (sanity checks):
#* There is no ongoing long-running maintenance that affects database availability or lag (schema changes, upgrades, hardware issues, etc.). Depool those servers that are not ready.
#* Replication is flowing from eqiad -> codfw and from codfw -> eqiad (replication is usually stopped in the passive -> active direction to facilitate maintenance)
#* All database servers have its buffer pool filled up. This is taken care automatically with the [[MariaDB/buffer pool dump|automatic buffer pool warmup functionality]]. For sanity checks, some sample load could be sent to the MediaWiki application server to check requests happen as quickly as in the active datacenter.
#* These were the [[Switch Datacenter/planned db maintenance#2018 Switch Datacenter|things we prepared/checked for the 2018 switch]]
# Make absolutely sure that parsercache replication is working from the active to the passive DC. Verify that the parsercache servers are set to read-write in the passive DC. '''This is important'''
#Check appserver weights on servers in the passive DC, make sure that newer hardware is weighted higher (usually 30) and older hardware is less (usually 25)


=== MediaWiki-related ===
==== Phase 0 - preparation ====
# Disable puppet on maintenance hosts in both eqiad and codfw: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-disable-puppet.py|00-disable-puppet.py]]
# Reduce the TTL on <code>appservers-ro, appservers-rw, api-ro, api-rw, jobrunner, videoscaler, parsoid-php</code> to 10 seconds: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-reduce-ttl.py|00-reduce-ttl.py]] '''Make sure that at least 5 minutes (the old TTL) have passed before moving to Phase 1, the cookbook should force you to wait'''.
# Warm up APC running the mediawiki-cache-warmup on the new site clusters. The warmup queries will repeat automatically until the response times stabilize: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-warmup-caches.py|00-warmup-caches.py]]
#* The global "urls-cluster" warmup against the appservers cluster
#* The "urls-server" warmup against all hosts in the appservers cluster.
#*The "urls-server" warmup against all hosts in the api-appservers cluster.
# Set downtime for Read only checks on mariadb masters changed on Phase 3 so they don't page. '''This is not covered by the switchdc script.'''


==== Phase 1 - preparation ====
==== Phase 1 - stop maintenance ====
# (days in advance) Warm up databases; see [[MariaDB/buffer_pool_dump]].
# Stop maintenance jobs in both datacenters and kill all the periodic jobs (systemd timers) on maintenance hosts in both datacenters: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/01-stop-maintenance.py|01-stop-maintenance.py]]
# Stop jobqueues in the ''active site''
#* Merge [https://gerrit.wikimedia.org/r/282880 <s>https://gerrit.wikimedia.org/r/282880</s>]https://gerrit.wikimedia.org/r/#/c/284403/
#* <s>run <code>salt -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'service jobrunner stop; service jobchron stop;'</code></s>
#* run <code>salt -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'service jobrunner stop; service jobchron stop;'</code>
# Stop all jobs running on the maintenance host
#* Merge [https://gerrit.wikimedia.org/r/#/c/283952/ <s>https://gerrit.wikimedia.org/r/#/c/283952/</s>] https://gerrit.wikimedia.org/r/#/c/284404/
#* <s>run <code>ssh terbium.eqiad.wmnet sudo 'puppet agent -t;sudo  killall php; sudo killall php5;'</code></s>
#* run <code>ssh wasat.codfw.wmnet 'sudo puppet agent -t;sudo  killall php; sudo killall php5;'</code>
#* manually check for any scripts that need to be killed


==== Phase 2 - read-only mode ====
==== Phase 2 - read-only mode ====
# Deploy mediawiki-config with all shards set to read-only (set ro on active site)
# Go to read-only mode by changing the <code>ReadOnly</code> conftool value: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/02-set-readonly.py|02-set-readonly.py]]
#* [[gerrit:283953|<s>https://gerrit.wikimedia.org/r/283953</s>]] https://gerrit.wikimedia.org/r/#/c/284402/


==== Phase 3 - lock down database masters, cache wipes ====
==== Phase 3 - lock down database masters ====
# Set ''active site''<nowiki/>'s databases (masters) in [[/db read-only|read-only mode]] except parsercache ones (which are dual masters) standalone es1 servers (which are always read only) and misc and labs servers (for now, as they are independent from mediawiki and do not have yet clients on codfw).
# Put old-site core DB masters (shards: s1-s8, x1, es4-es5) in read-only mode and wait for the new site's databases to catch up replication: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/03-set-db-readonly.py|03-set-db-readonly.py]]
#* Check with:  <code>sudo salt -C 'G@mysql_role:master and G@site:eqiad and G@mysql_group:core' cmd.run 'mysql --skip-ssl --batch --skip-column-names -e "SELECT @@global.read_only"'</code>
#* Change with:  <code>sudo salt -C 'G@mysql_role:master and G@site:eqiad and G@mysql_group:core' cmd.run 'mysql --skip-ssl --batch --skip-column-names -e "SET GLOBAL read_only=1"'</code>
# Wipe ''new site''<nowiki/>'s memcached to prevent stale values — only once the new site's read-only master/slaves are caught up. Run:  <code><s>salt -C 'G@cluster:memcached and G@site:codfw' cmd.run 'service memcached restart'</s></code>  <code>salt -C 'G@cluster:memcached and G@site:eqiad' cmd.run 'service memcached restart'</code>
# Warm up memcached and APC <syntaxhighlight lang="bash">
apache-fast-test wiki-urls-warmup1000.txt <new_active_dc>
</syntaxhighlight>


==== Phase 4 - switch active datacenter configuration ====
==== Phase 4 - switch active datacenter configuration ====
# Switch the datacenter in puppet, by setting <code>$app_routes['mediawiki']</code>
#* Merge https://gerrit.wikimedia.org/r/#/c/284397
# Switch the datacenter in mediawiki-config (<code>$wmfMasterDatacenter</code> Merge)
#* Merge https://gerrit.wikimedia.org/r/#/c/284398/


==== Phase 5 - apply configuration ====
# Switch the discovery records and MediaWiki active datacenter: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/04-switch-mediawiki.py|04-switch-mediawiki.py]]  
# Redis replication
#* Flip <code>appservers-ro, appservers-rw, api-ro, api-rw, jobrunner, videoscaler, parsoid-php</code> to <code>pooled=true</code> in the new site. Since both sites are now pooled in etcd, this will not actually change the DNS records for the active datacenter.
#* stop puppet on all redises <code>salt 'mc*' cmd.run 'puppet agent --disable'; salt 'rdb*' cmd.run 'puppet agent --disable'</code>
#* Flip <code>WMFMasterDatacenter</code> from the old site to the new.
#* switch the Redis replication on all Redises from the ''old site'' (codfw) to the ''new site'' (eqiad) at runtime with [https://gist.github.com/lavagetto/c3d22c22a4ccd27f38e14723d9144171 switch_redis_replication.py]
#* Flip <code>appservers-ro, appservers-rw, api-ro, api-rw, jobrunner, videoscaler, parsoid-php</code> to <code>pooled=false</code> in the old site. After this, DNS will be changed for the old DC and internal applications (except mediawiki) will start hitting the new DC
#:* Run <code>switch_redis_replication.py memcached.yaml codfw eqiad</code>
 
#:* Run <code>swicth_redis_replication.py jobqueue.yaml codfw eqiad</code>
==== Phase 5 - Invert Redis replication for MediaWiki sessions ====
#:* Verify those are now masters by running <code>check_redis.py memcached.yaml,jobqueue.yaml</code>
 
#* Alternatively via puppet:
# Invert the Redis replication for the <code>sessions</code> cluster: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/05-invert-redis-sessions.py|05-invert-redis-sessions.py]]
#:* run <code>salt 'mc1*' cmd.run 'puppet agent --enable; puppet agent -t'; salt 'rdb1*' cmd.run 'puppet agent --enable; puppet agent -t'</code>
 
#:* verify eqiad Redises now think they're masters
==== Phase 6 - Set new site's databases to read-write ====
#:* run <code>salt 'mc2*' cmd.run 'puppet agent --enable; puppet agent -t'; salt 'rdb2*' cmd.run 'puppet agent --enable; puppet agent -t'</code>
# Set new-site's core DB masters (shards: s1-s8, x1, es4-es5) in read-write mode: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/06-set-db-readwrite.py|06-set-db-readwrite.py]]
#:* verify codfw redises are replicating
 
# RESTBase (for the action API endpoint)
==== Phase 7 - Set MediaWiki to read-write ====
#* run <code>salt -b10% -G 'cluster:restbase' cmd.run 'puppet agent -t; service restbase restart'</code>
# Go to read-write mode by changing the <code>ReadOnly</code> conftool value: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/07-set-readwrite.py|07-set-readwrite.py]]
# Misc services cluster (for the action API endpoint)
 
#* run <code>salt 'sc*' cmd.run 'puppet agent -t'</code>
==== Phase 8 - Restore rest of MediaWiki ====
# Parsoid (for the action API endpoint)
#*Merge https://gerrit.wikimedia.org/r/#/c/284399/
#*Deploy
# Switch Varnish backend to <code>appserver.svc.$newsite.wmnet</code>/<code>api.svc.$newsite.wmnet</code>
#* Merge https://gerrit.wikimedia.org/r/#/c/284400/
#* run <code>salt -G 'cluster:cache_text' cmd.run 'puppet agent -t'</code>
# Point Swift imagescalers to the active MediaWiki
#* Merge [[gerrit:284401|https://gerrit.wikimedia.org/r/284401]]
#* restart swift in eqiad: <code>salt -b1 'ms-fe1*' cmd.run 'puppet agent -t; service swift-proxy restart'</code>
#* restart swift in codfw: <code>salt -b1 'ms-fe2*' cmd.run 'puppet agent -t; service swift-proxy restart'</code>


==== Phase 6 - Undo read-only ====
# Restart Envoy on the jobrunners that are now inactive, to trigger changeprop to re-resolve the DNS name and connect to the new DC: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/08-restart-envoy-on-jobrunners.py|08-restart-envoy-on-jobrunners.py]]
# Set new database masters in read-write mode for every core (s1-7), External Storage (es2-3, not es1) and extra (x1) database
#*A steady rate of 500s is expected until this step is completed, because changeprop will still be sending edits to jobrunners in the old DC, where the database master will reject them.
#* Check with: <code>sudo salt -C 'G@mysql_role:master and G@site:codfw and G@mysql_group:core' cmd.run 'mysql --skip-ssl --batch --skip-column-names -e "SELECT @@global.read_only"'</code>
# Start maintenance in the new DC: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/08-start-maintenance.py|08-start-maintenance.py]]
#* Change with: <code>sudo salt -C 'G@mysql_role:master and G@site:codfw and G@mysql_group:core' cmd.run 'mysql --skip-ssl --batch --skip-column-names -e "SET GLOBAL read_only=0"'</code>
#* Run puppet on the [[Maintenance_server|maintenance hosts]], which will reactivate systemd timers in both datacenters in the primary DC
# Deploy mediawiki-config eqiad with all shards set to read-write
#*Most Wikidata-editing bots will restart once this is done and the "[https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts?orgId=1&refresh=5s dispatch lag]" has recovered. This should bring us back to 100% of editing traffic.
#* [https://gerrit.wikimedia.org/r/284157 <s>https://gerrit.wikimedia.org/r/284157</s>]https://gerrit.wikimedia.org/r/#/c/284396/


==== Phase 7 - post read-only ====
==== Phase 9 - Post read-only ====
# Start the jobqueue in the ''new site''
# Update tendril for new database masters: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/09-update-tendril.py|09-update-tendril.py]]  
#* Merge [https://gerrit.wikimedia.org/r/282881 <s>https://gerrit.wikimedia.org/r/282881</s>]https://gerrit.wikimedia.org/r/#/c/284394/
#*Pure cosmetic change, no effect on production. No changes required for database zarcillo (which has a different master for eqiad and codfw).
#* <s>run <code>salt -b 6 -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'puppet agent -t'</code></s> run <code>salt -b 6 -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'puppet agent -t'</code>
#*The parsercache hosts and x2 will need to manually be updated in tendril see [[phab:T266723|T266723]]. '''This is not covered by the switchdc script.'''
# Start the cron jobs on the maintenance host in codfw
# Set the TTL for the DNS records to 300 seconds again: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/09-restore-ttl.py|09-restore-ttl.py]]
#* Merge [https://gerrit.wikimedia.org/r/#/c/283954/ <s>https://gerrit.wikimedia.org/r/#/c/283954/</s>]https://gerrit.wikimedia.org/r/#/c/284395/
# Update DNS records for new database masters deploying [[gerrit:#/c/operations/dns/+/458787/|eqiad->codfw]]; [[gerrit:#/c/operations/dns/+/458790/|codfw->eqiad]] '''This is not covered by the switchdc script'''. Please use the following to SAL log <code>!log Phase 8.5 Update DNS records for new database masters</code>
#* <s>run <code>ssh wasat.codfw.wmnet 'sudo puppet agent -t'</code></s> run <code>ssh terbium.codfw.wmnet 'sudo puppet agent -t'</code>
# Run Puppet on the database masters in both DCs, to update expected read-only state: [[git:operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/09-run-puppet-on-db-masters.py|09-run-puppet-on-db-masters.py]].
# Update DNS records for new database masters
# Make sure the CentralNotice banner informing users of readonly is removed. Keep in mind, there is some minor HTTP caching involved (~5mins)
#* Merge [[gerrit:284667|https://gerrit.wikimedia.org/r/284667]]
# Remove the downtime added in phase 0.
# Update tendril for new database masters
#Update [[git:operations/puppet/+/refs/heads/production/modules/profile/files/configmaster/disc desired state.py|disc_desired_state.py]] to reflect which services are pooled in which DCs. See [[phab:T286231|T286231]] as an example. '''This is not covered by the switchdc script.'''
# Run the script to fix broken wikidata entities on the maintenance host of the active datacenter: <code>sudo -u www-data mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki --force</code>
#Re-order noc.wm.o's debug.json to have primary servers listed first, see [[phab:T289745|T289745]]. '''This is not covered by the switchdc script.'''


==== Phase 8 - verification and troubleshooting ====
==== Phase 10 - verification and troubleshooting ====
'''This is not covered by the switchdc script'''
# Make sure reading & editing works! :)
# Make sure reading & editing works! :)
# Make sure recent changes are flowing (see Special:RecentChanges, [[EventStreams]], [[RCStream]] and the IRC feeds)
# Make sure recent changes are flowing (see Special:RecentChanges, [[EventStreams]], and the IRC feeds)
# Make sure email works (<code>exim4 -bp</code> on mx1001/mx2001, test an email)
# Make sure email works (<code>exim4 -bp</code> on mx1001/mx2001, test an email)


=== Media storage/Swift ===
==== Dashboards ====
==== Ahead of the switchover, originals and thumbs ====
# '''MediaWiki:''' Write synchronously to both sites with <s>https://gerrit.wikimedia.org/r/#/c/282888/</s>https://gerrit.wikimedia.org/r/284652
# '''Cache->app:''' Change varnish backends for <tt>swift</tt> and <tt>swift_thumbs</tt> to point to ''new site'' with <s>https://gerrit.wikimedia.org/r/#/c/282890/</s>https://gerrit.wikimedia.org/r/284651
## Force a puppet run on cache_upload in ''both sites'': <tt>salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and ( G@site:eqiad or G@site:codfw )' cmd.run 'puppet agent --test'</tt>
# '''Inter-Cache:''' Switch ''new site'' from ''active site'' to 'direct' in cache::route_table for upload <s>https://gerrit.wikimedia.org/r/#/c/282891/</s>https://gerrit.wikimedia.org/r/284650
## Force a puppet run on cache_upload in ''new site'': <tt>salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:codfw' cmd.run 'puppet agent --test'</tt>
# '''Users:''' De-pool ''active site'' in GeoDNS <s>https://gerrit.wikimedia.org/r/#/c/283416/</s>https://gerrit.wikimedia.org/r/#/c/284694/ + <tt>authdns-update</tt>
# '''Inter-Cache:''' Switch all ''caching sites'' currently pointing from ''active site'' to ''new site''  in cache::route_table for upload <s>https://gerrit.wikimedia.org/r/#/c/283418/</s>https://gerrit.wikimedia.org/r/284649
## Force a puppet run on cache_upload in ''caching sites'': <tt>salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:esams' cmd.run 'puppet agent --test'</tt>
# '''Inter-Cache:''' Switch ''active site'' from 'direct' to ''new site'' in cache::route_table for upload <s>https://gerrit.wikimedia.org/r/#/c/282892/</s>https://gerrit.wikimedia.org/r/284648
## Force a puppet run on cache_upload in ''active site'': <tt>salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:eqiad' cmd.run 'puppet agent --test'</tt>


==== Switching back ====
*[https://grafana.wikimedia.org/d/000000327/apache-fcgi?orgId=1 Apache/FCGI]
*[https://grafana.wikimedia.org/dashboard/db/mediawiki-application-servers?orgId=1 App servers]
*[https://grafana.wikimedia.org/d/kHk7W6OZz/ats-cluster-view?orgId=1&var-datasource=esams%20prometheus%2Fops&var-layer=backend&var-cluster=text ATS cluster view (text)]
*[https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cluster=text&var-origin=api-rw.discovery.wmnet&var-origin=appservers-rw.discovery.wmnet&var-origin=restbase.discovery.wmnet ATS backends<->Origin servers overview (appservers, api, restbase)]
*[https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors MediaWiki errors]
*[https://logstash.wikimedia.org/app/kibana#/dashboard/87348b60-90dd-11e8-8687-73968bebd217 Database Errors logs]


Repeat the steps above in reverse order, with suitable revert commits
=== ElasticSearch ===
==== General context on how to switchover ====


=== ElasticSearch ===
CirrusSearch talks by default to the local datacenter (<code>$wmfDatacenter</code>). If Mediawiki switches datacenter, elasticsearch will automatically follow.


Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster [https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L16025-L16027 InitialiseSettings.php]. The usual default value is "local", which means that if mediawiki switches DC, everything should be automatic. For this specific switch, the value has been set to "codfw" to switch Elasticsearch ahead of time.
Manually switching CirrusSearch to a specific datacenter can always be done. Point CirrusSearch to codfw by editing <code>wmgCirrusSearchDefaultCluster</code> [https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L16025-L16027 InitialiseSettings.php].


To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following [[Search#Recovering_from_an_Elasticsearch_outage.2Finterruption_in_updates|Recovering from an Elasticsearch outage / interruption in updates]].
To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following [[Search#Recovering_from_an_Elasticsearch_outage.2Finterruption_in_updates|Recovering from an Elasticsearch outage / interruption in updates]].
==== Preserving more_like query cache performance ====
CirrusSearch has a caching layer that caches the result of Elasticsearch queries such as "more like this" queries (which are used, among other things, to generate "Related Articles" at the bottom of mobile Wikipedia pages).
Switching datacenters will result in degraded performance while the cache fills back up.
In order to avoid the aforementioned performance degradation, a mitigation should be deployed that will hardcode <code>more_like</code> queries to keep routing to the "old" datacenter for 24 hours following the switchover.
Hardcoding the cirrus cluster will allow the stampede of cache misses to be sent to the secondary search cluster which has enough capacity, once typical traffic has migrated to the new datacenter, to serve the increased load.
This hardcoding should be deployed in advance of the switchover. Since it is effectively a no-op until the actual cutover, it can be deployed as far in advance as desired.
For example, if we are switching over from eqiad to codfw, <code>more_like</code> queries should be hardcoded to route to eqiad; this change should be deployed before the actual cutover.
Then, 24 hours after the cutover, the hardcoding can be removed, allowing <code>more_like queries</code> to route to the ''new'' cirrus dc - in this example, codfw.
===== Days in advance preparation =====
Deploy a patch to hardcode more_like query routing to the currently active DC (i.e. the datacenter we are switching over ''from'').
''Example Patch:'' [mediawiki-config] {{gerrit|635411}} cirrus: Temporarily hardcode more_like query routing
This mitigation should be left in place for 24 hours following the switchover (equivalent to the cache length), at which point there is no longer any performance penalty to removing the hardcoding.
===== One day after datacenter switch =====
Revert the earlier patch to hardcode more_like query routing; this will allow these queries to route to the newly active DC, and there will not be any performance degradation since the caches have been fully populated by this point.
==== Dashboards ====
* [https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles-prometheus?orgId=1 ElasticSearch Percentiles]


=== Traffic ===
=== Traffic ===
It is relatively straightforward for us to depool an entire datacenter at the traffic level, and is regularly done during maintenance or outages. For that reason, we tend to only keep the datacenter depooled for about a week, which allows us to test for a full traffic cycle (in theory).
==== General information on generic procedures ====


==== GeoDNS user routing ====
See [[Global traffic routing]].


* Traffic-layer only, no interdependencies elsewhere
==== Switchover ====
* Granularity is per-cache-cluster (misc, maps, text, upload)
* Documented at: https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing#GeoDNS


==== Inter-Cache routing ====
GeoDNS (User-facing) Routing:


* Traffic-layer only, no interdependencies elsewhere
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458806
* Granularity is per-cache-cluster (misc, maps, text, upload)
# <any authdns node>: authdns-update
* Documented at: https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing#Inter-Cache_Routing
# SAL Log using the following <code>!log Traffic: depool eqiad from user traffic</code>
(Running <code>authdns-update</code> from any authdns node will update all nameservers.)


==== Cache->App routing ====
==== Switchback ====


* Normally will have inter-dependencies with application-level work
Same procedure as above, with reversion of the commit specified: [https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458807 GeoDNS].
* Granularity is per-application-service (how they're defined at the back end of varnish)
* Documented at: https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing#Cache-to-Application_Routing


==== Specifics for Switchover Test Week ====
==== Dashboards ====
*[https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1 Load Balancers]
*[https://grafana.wikimedia.org/d/000000479/frontend-traffic?refresh=1m&orgId=1&var-site=codfw&var-site=eqiad&var-cache_type=text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4 Frontend traffic (text@eqiad/codfw)]
*[https://grafana.wikimedia.org/d/000000479/frontend-traffic?refresh=1m&orgId=1&var-site=codfw&var-site=eqiad&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4 Frontend traffic (upload@eqiad/codfw)]


After switching all applayer services we plan to switch successfully, we'll switch user and inter-cache traffic away from eqiad:
=== Services ===
All services, are active-active in DNS discovery, apart from restbase, that needs special treatment. The procedure to fail over to one site only is the same for every one of them:
# reduce the TTL of the DNS discovery records to 10 seconds
# depool the datacenter we're moving away from in confctl / discovery
# restore the original TTL


* The Upload cluster will be following similar instructions on the 14th during the Swift switch.
* All of the above is done using the <code>sre.switchdc.services</code> cookbooks:
* Maps and Misc clusters are not participating (low traffic, special issues, validated by the other moves)
<syntaxhighlight lang="bash">
* This leaves just the text cluster to operate on below:
# Switch the service "parsoid" to codfw-only
$ cookbook sre.switchdc.services --services parsoid -- eqiad codfw
# Switch all active-active services to codfw, excluding parsoid and cxserver
$ cookbook sre.switchdc.services --exclude parsoid cxserver -- eqiad codfw
</syntaxhighlight>


# '''Inter-Cache:''' Switch codfw from 'eqiad' to 'direct' in cache::route_table for the text cluster.
Restbase is a bit of a special case, and needs an additional step, if we're just switching active traffic over and not simulating a complete failover:
#* https://gerrit.wikimedia.org/r/283430
#* Force a puppet run on affected caches:
#* <code>salt -v -t 10 -b 17 -C 'G@site:codfw and G@cluster:cache_text' cmd.run 'puppet agent --test'</code>
# '''Users:''' De-pool eqiad in GeoDNS for the text cluster.
#* https://gerrit.wikimedia.org/r/283433
#* <code>authdns-update</code> on any one of the authdns servers (radon, baham, eeden)
# '''Inter-Cache:''' Switch esams from 'eqiad' to 'codfw' in cache::route_table for the text cluster.
#* https://gerrit.wikimedia.org/r/283431
#* Force a puppet run on affected caches:
#* <code>salt -v -t 10 -b 17 -C 'G@site:esams and G@cluster:cache_text' cmd.run 'puppet agent --test'</code>
# '''Inter-Cache:''' Switch eqiad from 'direct' to 'codfw' in cache::route_table for the text cluster.
#* https://gerrit.wikimedia.org/r/283432
#* Force a puppet run on affected caches:
#* <code>salt -v -t 10 -b 17 -C 'G@site:eqiad and G@cluster:cache_text' cmd.run 'puppet agent --test'</code>


Before reversion of applayer services to eqiad, we'll revert the above steps in reverse order to undo them:
# pool restbase-async everywhere, then depool restbase-async in the newly active dc, so that async traffic is separated from real-users traffic as much as possible.


# '''Inter-Cache:''' Switch eqiad from 'codfw' to 'direct' in cache::route_table for all clusters.
==== Manual steps ====
#* https://gerrit.wikimedia.org/r/284687
#Update WDQS lag reporting to point to the new primary DC, see [[gerrit:701927]] as an example and [[phab:T285710|T285710]] for more details. '''This is not covered by the switchdc script.'''
#* Force a puppet run on affected caches:
#* <code>salt -v -t 10 -b 17 -C 'G@site:eqiad and G@cluster:cache_text' cmd.run 'puppet agent --test'</code>
# '''Inter-Cache:''' Switch esams from 'codfw' to 'eqiad' in cache::route_table for all clusters.
#* https://gerrit.wikimedia.org/r/284688
#* Force a puppet run on affected caches:
#* <code>salt -v -t 10 -b 17 -C 'G@site:esams and G@cluster:cache_text' cmd.run 'puppet agent --test'</code>
# '''Users:''' Re-pool eqiad in GeoDNS.
#* https://gerrit.wikimedia.org/r/284692
#* <code>authdns-update</code> on any one of the authdns servers (radon, baham, eeden)
# '''Inter-Cache:''' Switch codfw from 'direct' to 'eqiad' in cache::route_table for all clusters.
#* https://gerrit.wikimedia.org/r/284689
#* Force a puppet run on affected caches:
#* <code>salt -v -t 10 -b 17 -C 'G@site:codfw and G@cluster:cache_text' cmd.run 'puppet agent --test'</code>


=== Services ===
==== Dashboards ====
* RESTBase and Parsoid already active in codfw, using eqiad MW API.
*[https://grafana.wikimedia.org/dashboard/db/eventstreams?refresh=1m&orgId=1 EventStreams]
* Shift traffic to codfw:
*[https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&orgId=1 ORES]
** Public traffic: Update Varnish backend config.
*[https://grafana.wikimedia.org/dashboard/db/maps-performances?orgId=1 Maps]
** Update RESTBase and Flow configs in mediawiki-config to use codfw.
*[https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m WDQS]
* During MW switch-over:
** Update RESTBase and Parsoid to use MW API in codfw, either using puppet / Parsoid deploy, or DNS. See [[phab:T125069|https://phabricator.wikimedia.org/T125069]].


* [[phab:T127974|Tracker / checklist]]
=== Databases ===
Main document: [[MariaDB/Switch Datacenter]]


=== Other miscellaneous ===
=== Other miscellaneous ===
* [[Switch Datacenter/DeploymentServer|Deployment server]]
* [[Switch Datacenter/DeploymentServer|Deployment server]]
* EventLogging
* [[Analytics/Systems/EventLogging|EventLogging]]
* IRC/RCstream/[[EventStreams]]
* [[Irc.wikimedia.org|IRC]], <s>[[RCStream]]</s>, [[EventStreams]]
 
== Upcoming and past switches ==
{{Anchor|Schedule of past switches}}
 
=== 2021 switches ===
{{tracked|T281515}}
 
;Schedule
* Services: [https://zonestamp.toolforge.org/1624888854 Monday, June 28th, 2021 14:00 UTC]
* Traffic: [https://zonestamp.toolforge.org/1624892434 Monday, June 28th, 2021 15:00 UTC]
* MediaWiki: [https://zonestamp.toolforge.org/1624975258 Tuesday, June 29th, 2021 14:00 UTC]
;Reports
* [https://techblog.wikimedia.org/2021/07/23/june-2021-data-center-switchover/ June 2021 Data Center Switchover] on the Wikimedia Tech blog
*[https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/XI57Z6T2DK7IC345VFTENM5RLTQBQDEQ/ Services and Traffic] on wikitech-l
* [https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/ENL3P5SA7RSOHPN4ILMXQ2BGBF5XR776/ MediaWiki] on wikitech-l (1m57s of read-only time)
* [[Incident documentation/2021-06-29 trwikivoyage primary db]]
 
'''Switching back:'''
 
{{Tracked|T287539}}
* Services: [https://zonestamp.toolforge.org/1631541650 Monday, Sept 13th 14:00 UTC]
*Traffic: [https://zonestamp.toolforge.org/1631545256 Monday, Sept 13th 15:00 UTC]
*MediaWiki: [https://zonestamp.toolforge.org/1631628029 Tuesday, Sept 14th 14:00 UTC]
;Reports
* [https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/6UZCCACCBCZLN5MHROZQXUG6ZOQTDCLO/ Datacenter switchover recap] on wikitech-l (2m42s of read-only time)
 
=== 2020 switches ===
{{tracked|T243314}}
;Schedule
* Services: Monday, August 31st, 2020 14:00 UTC
* Traffic: Monday, August 31st, 2020 15:00 UTC
* MediaWiki: Tuesday, September 1st, 2020 14:00 UTC
;Reports
* [[Incident documentation/2020-09-01 data-center-switchover]] (2m49s of read-only time)
 
'''Switching back:'''
* Traffic: Thursday, September 17th, 2020 17:00 UTC
* MediaWiki: Tuesday, October 27th, 2020 14:00 UTC
* Services: Wednesday, October 28th, 2020 14:00 UTC
 
=== 2018 switches ===
{{tracked|T199073}}
;Schedule
* Services: Tuesday, September 11th 2018 14:30 UTC
*Media storage/Swift: Tuesday, September 11th 2018 15:00 UTC
*Traffic: Tuesday, September 11th 2018 19:00 UTC
*MediaWiki: Wednesday, September 12th 2018: 14:00 UTC
;Reports
* [https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/LU63AEQYJDWRN4PL6OHFLT5ENMQBVFMW/ Datacenter Switchover recap] (7m34s of read-only time)
 
'''Switching back:'''
;Schedule
* Traffic: Wednesday, October 10th 2018 09:00 UTC
*MediaWiki: Wednesday, October 10th 2018: 14:00 UTC
*Services: Thursday, October 11th 2018 14:30 UTC
*Media storage/Swift: Thursday, October 11th 2018 15:00 UTC
;Reports
* [https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/YSAZEXIMF73OA3OBC4Z4SYJKC6I4EWJH/ Datacenter Switchback recap] (4m41s of read-only time)
 
=== 2017 switches ===
{{tracked|T138810}}
 
;Schedule
* Elasticsearch: elasticsearch is automatically following mediawiki switch
* Services: Tuesday, April 18th 2017 14:30 UTC
* Media storage/Swift: Tuesday, April 18th 2017 15:00 UTC
* Traffic: Tuesday, April 18th 2017 19:00 UTC
* MediaWiki: Wednesday, April 19th 2017 [https://www.timeanddate.com/worldclock/fixedtime.html?iso=20170419T14 14:00 UTC] (user visible, requires read-only mode)
* Deployment server: Wednesday, April 19th 2017 16:00 UTC
;Reports
* [https://blog.wikimedia.org/2017/04/18/codfw-temporary-editing-pause/ Editing pause for failover test] on Wikimedia Blog
*
 
'''Switching back:'''
;Schedule
* Traffic: Pre-switchback in two phases: Mon May 1 and Tue May 2 (to avoid cold-cache issues Weds)
* MediaWiki:  Wednesday, May 3rd 2017 [https://www.timeanddate.com/worldclock/fixedtime.html?iso=20170503T14 14:00 UTC] (user visible, requires read-only mode)
* Elasticsearch: elasticsearch is automatically following mediawiki switch
* Services: Thursday, May 4th 2017 14:30 UTC
* Swift: Thursday, May 4th 2017 15:30 UTC
* Deployment server: Thursday, May 4th 2017 16:00 UTC
;Reports
* [[Incident documentation/2017-05-03 missing index]]
* [[Incident documentation/2017-05-03 x1 outage]]
 
=== 2016 switches ===
;Schedule
* Deployment server: Wednesday, January 20th 2016
* Traffic: Thursday, March 10th 2016
* MediaWiki 5-minute read-only test: Tuesday, March 15th 2016, 07:00 UTC
* Elasticsearch: Thursday, April 7th 2016, 12:00 UTC
* Media storage/Swift: Thursday, April 14th 2016, 17:00 UTC
* Services: Monday, April 18th 2016, 10:00 UTC
* MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
;Reports
* [https://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/ Wikimedia failover test] on Wikimedia Blog
 
'''Switching back:'''
* MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
* Services, Elasticsearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done
 
== Monitoring Dashboards ==
Aggregated list of interesting dashboards
 
*[https://grafana.wikimedia.org/dashboard/db/apache-hhvm?orgId=1 Apache/HHVM]
*[https://grafana.wikimedia.org/dashboard/db/mediawiki-application-servers?orgId=1 App servers]
*[https://grafana.wikimedia.org/dashboard/db/load-balancers?orgId=1 Load Balancers]
*[https://grafana.wikimedia.org/dashboard/db/frontend-traffic?refresh=1m&orgId=1 Frontend traffic]
*[https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=cache_upload&var-layear-layer=backend Varnish eqiad]
*[https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-dc-stats?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=cache_upload&var-layer=backend Varnish codfw]
* [https://grafana.wikimedia.org/dashboard/file/swift.json?orgId=1 Swift eqiad]
* [https://grafana.wikimedia.org/dashboard/file/swift.json?orgId=1&var-DC=codfw Swift codfw]
 
[[Category:SRE Service Operations]]

Latest revision as of 14:04, 11 November 2021

Introduction

Datacenter switchovers are a standard response to certain types of disasters (web search). Technology companies regularly practice them to make sure that everything will work properly when the process is needed. Switching between datacenters also make it easier to do some maintenance work on non-active servers (e.g. database upgrades/changes), so while we're serving traffic from datacenter A, we can do all that work at datacenter B.

A Wikimedia datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component. SRE Service Operations maintains the process and software necessary to run the switchover.

Weeks in advance preparation

  • 10 weeks before: Coordinate dates and communication plan with involved groups: Switch_Datacenter/Coordination
  • 3 weeks before: Run a "live test" of the cookbooks by "switching" from the passive DC to the active DC. The --live-test flag will skip actions that could harm the active DC or do them on the passive DC. This will instrument most of the code paths used in the switchover and help identify issues. This process will !log to SAL so you should coordinate with others, but otherwise should be non-disruptive. Due to changes since the last switchover you can expect code changes to become necessary, so take the time and assistance needed into account.

Overall switchover flow

In a controlled switchover we first deactivate services in the primary datacenter and second deactivate caching in the datacenter. The next step is to switch Mediawiki itself. About a week later we activate caching in the datacenter again, as we believe that testing the situation without caching in the datacenter is sufficient.

Typical scheduling looks like:

  • Monday 14:00 UTC Services
  • Monday 15:00 UTC Caching (traffic)
  • Tuesday 14:00 UTC Mediawiki

The following week reactivate caching. 6+ Weeks later switchback Mediawiki

Per-service switchover instructions

MediaWiki

We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in the operations/cookbooks repository, in the cookbooks/sre/switchdc/mediawiki/ path.

Days in advance preparation

  1. OPTIONAL: SKIP IN AN EMERGENCY: Make sure databases are in a good state. Normally this requires no operation, as the passive datacenter databases are always prepared to receive traffic, so there are no actionables. Some things that #DBAs normally should make sure for the most optimal state possible (sanity checks):
    • There is no ongoing long-running maintenance that affects database availability or lag (schema changes, upgrades, hardware issues, etc.). Depool those servers that are not ready.
    • Replication is flowing from eqiad -> codfw and from codfw -> eqiad (replication is usually stopped in the passive -> active direction to facilitate maintenance)
    • All database servers have its buffer pool filled up. This is taken care automatically with the automatic buffer pool warmup functionality. For sanity checks, some sample load could be sent to the MediaWiki application server to check requests happen as quickly as in the active datacenter.
    • These were the things we prepared/checked for the 2018 switch
  2. Make absolutely sure that parsercache replication is working from the active to the passive DC. Verify that the parsercache servers are set to read-write in the passive DC. This is important
  3. Check appserver weights on servers in the passive DC, make sure that newer hardware is weighted higher (usually 30) and older hardware is less (usually 25)

Phase 0 - preparation

  1. Disable puppet on maintenance hosts in both eqiad and codfw: 00-disable-puppet.py
  2. Reduce the TTL on appservers-ro, appservers-rw, api-ro, api-rw, jobrunner, videoscaler, parsoid-php to 10 seconds: 00-reduce-ttl.py Make sure that at least 5 minutes (the old TTL) have passed before moving to Phase 1, the cookbook should force you to wait.
  3. Warm up APC running the mediawiki-cache-warmup on the new site clusters. The warmup queries will repeat automatically until the response times stabilize: 00-warmup-caches.py
    • The global "urls-cluster" warmup against the appservers cluster
    • The "urls-server" warmup against all hosts in the appservers cluster.
    • The "urls-server" warmup against all hosts in the api-appservers cluster.
  4. Set downtime for Read only checks on mariadb masters changed on Phase 3 so they don't page. This is not covered by the switchdc script.

Phase 1 - stop maintenance

  1. Stop maintenance jobs in both datacenters and kill all the periodic jobs (systemd timers) on maintenance hosts in both datacenters: 01-stop-maintenance.py

Phase 2 - read-only mode

  1. Go to read-only mode by changing the ReadOnly conftool value: 02-set-readonly.py

Phase 3 - lock down database masters

  1. Put old-site core DB masters (shards: s1-s8, x1, es4-es5) in read-only mode and wait for the new site's databases to catch up replication: 03-set-db-readonly.py

Phase 4 - switch active datacenter configuration

  1. Switch the discovery records and MediaWiki active datacenter: 04-switch-mediawiki.py
    • Flip appservers-ro, appservers-rw, api-ro, api-rw, jobrunner, videoscaler, parsoid-php to pooled=true in the new site. Since both sites are now pooled in etcd, this will not actually change the DNS records for the active datacenter.
    • Flip WMFMasterDatacenter from the old site to the new.
    • Flip appservers-ro, appservers-rw, api-ro, api-rw, jobrunner, videoscaler, parsoid-php to pooled=false in the old site. After this, DNS will be changed for the old DC and internal applications (except mediawiki) will start hitting the new DC

Phase 5 - Invert Redis replication for MediaWiki sessions

  1. Invert the Redis replication for the sessions cluster: 05-invert-redis-sessions.py

Phase 6 - Set new site's databases to read-write

  1. Set new-site's core DB masters (shards: s1-s8, x1, es4-es5) in read-write mode: 06-set-db-readwrite.py

Phase 7 - Set MediaWiki to read-write

  1. Go to read-write mode by changing the ReadOnly conftool value: 07-set-readwrite.py

Phase 8 - Restore rest of MediaWiki

  1. Restart Envoy on the jobrunners that are now inactive, to trigger changeprop to re-resolve the DNS name and connect to the new DC: 08-restart-envoy-on-jobrunners.py
    • A steady rate of 500s is expected until this step is completed, because changeprop will still be sending edits to jobrunners in the old DC, where the database master will reject them.
  2. Start maintenance in the new DC: 08-start-maintenance.py
    • Run puppet on the maintenance hosts, which will reactivate systemd timers in both datacenters in the primary DC
    • Most Wikidata-editing bots will restart once this is done and the "dispatch lag" has recovered. This should bring us back to 100% of editing traffic.

Phase 9 - Post read-only

  1. Update tendril for new database masters: 09-update-tendril.py
    • Pure cosmetic change, no effect on production. No changes required for database zarcillo (which has a different master for eqiad and codfw).
    • The parsercache hosts and x2 will need to manually be updated in tendril see T266723. This is not covered by the switchdc script.
  2. Set the TTL for the DNS records to 300 seconds again: 09-restore-ttl.py
  3. Update DNS records for new database masters deploying eqiad->codfw; codfw->eqiad This is not covered by the switchdc script. Please use the following to SAL log !log Phase 8.5 Update DNS records for new database masters
  4. Run Puppet on the database masters in both DCs, to update expected read-only state: 09-run-puppet-on-db-masters.py.
  5. Make sure the CentralNotice banner informing users of readonly is removed. Keep in mind, there is some minor HTTP caching involved (~5mins)
  6. Remove the downtime added in phase 0.
  7. Update disc_desired_state.py to reflect which services are pooled in which DCs. See T286231 as an example. This is not covered by the switchdc script.
  8. Re-order noc.wm.o's debug.json to have primary servers listed first, see T289745. This is not covered by the switchdc script.

Phase 10 - verification and troubleshooting

This is not covered by the switchdc script

  1. Make sure reading & editing works! :)
  2. Make sure recent changes are flowing (see Special:RecentChanges, EventStreams, and the IRC feeds)
  3. Make sure email works (exim4 -bp on mx1001/mx2001, test an email)

Dashboards

ElasticSearch

General context on how to switchover

CirrusSearch talks by default to the local datacenter ($wmfDatacenter). If Mediawiki switches datacenter, elasticsearch will automatically follow.

Manually switching CirrusSearch to a specific datacenter can always be done. Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster InitialiseSettings.php.

To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following Recovering from an Elasticsearch outage / interruption in updates.

Preserving more_like query cache performance

CirrusSearch has a caching layer that caches the result of Elasticsearch queries such as "more like this" queries (which are used, among other things, to generate "Related Articles" at the bottom of mobile Wikipedia pages).

Switching datacenters will result in degraded performance while the cache fills back up.

In order to avoid the aforementioned performance degradation, a mitigation should be deployed that will hardcode more_like queries to keep routing to the "old" datacenter for 24 hours following the switchover.

Hardcoding the cirrus cluster will allow the stampede of cache misses to be sent to the secondary search cluster which has enough capacity, once typical traffic has migrated to the new datacenter, to serve the increased load.

This hardcoding should be deployed in advance of the switchover. Since it is effectively a no-op until the actual cutover, it can be deployed as far in advance as desired.

For example, if we are switching over from eqiad to codfw, more_like queries should be hardcoded to route to eqiad; this change should be deployed before the actual cutover.

Then, 24 hours after the cutover, the hardcoding can be removed, allowing more_like queries to route to the new cirrus dc - in this example, codfw.

Days in advance preparation

Deploy a patch to hardcode more_like query routing to the currently active DC (i.e. the datacenter we are switching over from).

Example Patch: [mediawiki-config] 635411 cirrus: Temporarily hardcode more_like query routing

This mitigation should be left in place for 24 hours following the switchover (equivalent to the cache length), at which point there is no longer any performance penalty to removing the hardcoding.

One day after datacenter switch

Revert the earlier patch to hardcode more_like query routing; this will allow these queries to route to the newly active DC, and there will not be any performance degradation since the caches have been fully populated by this point.

Dashboards

Traffic

It is relatively straightforward for us to depool an entire datacenter at the traffic level, and is regularly done during maintenance or outages. For that reason, we tend to only keep the datacenter depooled for about a week, which allows us to test for a full traffic cycle (in theory).

General information on generic procedures

See Global traffic routing.

Switchover

GeoDNS (User-facing) Routing:

  1. gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458806
  2. <any authdns node>: authdns-update
  3. SAL Log using the following !log Traffic: depool eqiad from user traffic

(Running authdns-update from any authdns node will update all nameservers.)

Switchback

Same procedure as above, with reversion of the commit specified: GeoDNS.

Dashboards

Services

All services, are active-active in DNS discovery, apart from restbase, that needs special treatment. The procedure to fail over to one site only is the same for every one of them:

  1. reduce the TTL of the DNS discovery records to 10 seconds
  2. depool the datacenter we're moving away from in confctl / discovery
  3. restore the original TTL
  • All of the above is done using the sre.switchdc.services cookbooks:
# Switch the service "parsoid" to codfw-only
$ cookbook sre.switchdc.services --services parsoid -- eqiad codfw
# Switch all active-active services to codfw, excluding parsoid and cxserver
$ cookbook sre.switchdc.services --exclude parsoid cxserver -- eqiad codfw

Restbase is a bit of a special case, and needs an additional step, if we're just switching active traffic over and not simulating a complete failover:

  1. pool restbase-async everywhere, then depool restbase-async in the newly active dc, so that async traffic is separated from real-users traffic as much as possible.

Manual steps

  1. Update WDQS lag reporting to point to the new primary DC, see gerrit:701927 as an example and T285710 for more details. This is not covered by the switchdc script.

Dashboards

Databases

Main document: MariaDB/Switch Datacenter

Other miscellaneous

Upcoming and past switches

2021 switches

Schedule
Reports

Switching back:

Reports

2020 switches

Schedule
  • Services: Monday, August 31st, 2020 14:00 UTC
  • Traffic: Monday, August 31st, 2020 15:00 UTC
  • MediaWiki: Tuesday, September 1st, 2020 14:00 UTC
Reports

Switching back:

  • Traffic: Thursday, September 17th, 2020 17:00 UTC
  • MediaWiki: Tuesday, October 27th, 2020 14:00 UTC
  • Services: Wednesday, October 28th, 2020 14:00 UTC

2018 switches

Schedule
  • Services: Tuesday, September 11th 2018 14:30 UTC
  • Media storage/Swift: Tuesday, September 11th 2018 15:00 UTC
  • Traffic: Tuesday, September 11th 2018 19:00 UTC
  • MediaWiki: Wednesday, September 12th 2018: 14:00 UTC
Reports

Switching back:

Schedule
  • Traffic: Wednesday, October 10th 2018 09:00 UTC
  • MediaWiki: Wednesday, October 10th 2018: 14:00 UTC
  • Services: Thursday, October 11th 2018 14:30 UTC
  • Media storage/Swift: Thursday, October 11th 2018 15:00 UTC
Reports

2017 switches

Schedule
  • Elasticsearch: elasticsearch is automatically following mediawiki switch
  • Services: Tuesday, April 18th 2017 14:30 UTC
  • Media storage/Swift: Tuesday, April 18th 2017 15:00 UTC
  • Traffic: Tuesday, April 18th 2017 19:00 UTC
  • MediaWiki: Wednesday, April 19th 2017 14:00 UTC (user visible, requires read-only mode)
  • Deployment server: Wednesday, April 19th 2017 16:00 UTC
Reports

Switching back:

Schedule
  • Traffic: Pre-switchback in two phases: Mon May 1 and Tue May 2 (to avoid cold-cache issues Weds)
  • MediaWiki: Wednesday, May 3rd 2017 14:00 UTC (user visible, requires read-only mode)
  • Elasticsearch: elasticsearch is automatically following mediawiki switch
  • Services: Thursday, May 4th 2017 14:30 UTC
  • Swift: Thursday, May 4th 2017 15:30 UTC
  • Deployment server: Thursday, May 4th 2017 16:00 UTC
Reports

2016 switches

Schedule
  • Deployment server: Wednesday, January 20th 2016
  • Traffic: Thursday, March 10th 2016
  • MediaWiki 5-minute read-only test: Tuesday, March 15th 2016, 07:00 UTC
  • Elasticsearch: Thursday, April 7th 2016, 12:00 UTC
  • Media storage/Swift: Thursday, April 14th 2016, 17:00 UTC
  • Services: Monday, April 18th 2016, 10:00 UTC
  • MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
Reports

Switching back:

  • MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
  • Services, Elasticsearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done

Monitoring Dashboards

Aggregated list of interesting dashboards