You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Switch Datacenter: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Mobrovac
(→‎Schedule for 2017 switch: Services switch date)
imported>Giuseppe Lavagetto
Line 35: Line 35:


== Per-service switchover instructions ==
== Per-service switchover instructions ==
=== MediaWiki ===
We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in operations/switchdc [https://github.com/wikimedia/operations-switchdc]


=== MediaWiki-related ===
==== Phase 0 - preparation ====
# (days in advance) Warm up databases; see [[MariaDB/buffer_pool_dump]].
# (days in advance) Prepare puppet patches:
#* Switch mw_primary [https://gerrit.wikimedia.org/r/346321]
#* Add direct route to $new_site for all mw-related cache::app_directors [https://gerrit.wikimedia.org/r/346320]
#* Comment-out direct route to $old_site in all mw-related cache::app_directors [https://gerrit.wikimedia.org/r/346322]
# Disable puppet on all jobqueues/videoscalers and maintenance hosts, and the varnishes
# Merge the mediawiki-config [https://gerrit.wikimedia.org/r/346251 switchover changes] but don't sync '''This is not covered by the switchdc script'''
# Reduce the TTL on <tt>appservers-rw, api-rw, imagescaler-rw</tt> to 10 seconds


==== Phase 1 - preparation ====
==== Phase 1 - stop maintenance ====
# (days in advance) Warm up databases; see [[MariaDB/buffer_pool_dump]].
# Stop jobqueues in the ''active site''
# Stop jobqueues in the ''active site''
#* Merge [https://gerrit.wikimedia.org/r/282880 <s>https://gerrit.wikimedia.org/r/282880</s>]https://gerrit.wikimedia.org/r/#/c/284403/
# Kill all the cronjobs on the maintenance host in the active site
#* <s>run <code>salt -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'service jobrunner stop; service jobchron stop;'</code></s>
#* run <code>salt -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'service jobrunner stop; service jobchron stop;'</code>
# Stop all jobs running on the maintenance host
#* Merge [https://gerrit.wikimedia.org/r/#/c/283952/ <s>https://gerrit.wikimedia.org/r/#/c/283952/</s>] https://gerrit.wikimedia.org/r/#/c/284404/
#* <s>run <code>ssh terbium.eqiad.wmnet sudo 'puppet agent -t;sudo  killall php; sudo killall php5;'</code></s>
#* run <code>ssh wasat.codfw.wmnet 'sudo puppet agent -t;sudo  killall php; sudo killall php5;'</code>
#* manually check for any scripts that need to be killed


==== Phase 2 - read-only mode ====
==== Phase 2 - read-only mode ====
# Deploy mediawiki-config with all shards set to read-only (set ro on active site)
# Go to read-only mode by syncing <tt>wmf-config/db-$old-site.php</tt>
#* [[gerrit:283953|<s>https://gerrit.wikimedia.org/r/283953</s>]] https://gerrit.wikimedia.org/r/#/c/284402/
 
==== Phase 3 - lock down database masters ====
# Put old-site core DB masters (shards: s1-s7, x1, es2-es3) in [[/db read-only|read-only mode]].
# Wait for the new site's databases to catch up replication
 
==== Phase 4.1 - Wipe caches ====
# Wipe ''new site''<nowiki/>'s memcached to prevent stale values — only once the new site's read-only master/slaves are caught up.  
# Restart all HHVM servers in the new site to clear the APC cache


==== Phase 3 - lock down database masters, cache wipes ====
==== Phase 4.2 - Warmup caches in the new site ====
# Set ''active site''<nowiki/>'s databases (masters) in [[/db read-only|read-only mode]] except parsercache ones (which are dual masters) standalone es1 servers (which are always read only) and misc and labs servers (for now, as they are independent from mediawiki and do not have yet clients on codfw).
''This phase will be executed by the t04_cache_wipe task of switchdc, because there is no speed gain from not doing all of phase 4.1 + phase 4.2 separately, and they are logically related.''
#* Check with:  <code>sudo salt -C 'G@mysql_role:master and G@site:eqiad and G@mysql_group:core' cmd.run 'mysql --skip-ssl --batch --skip-column-names -e "SELECT @@global.read_only"'</code>
# Warm up memcached and APC running the mediawiki-cache-warmup on the new site clusters, specifically:
#* Change with:  <code>sudo salt -C 'G@mysql_role:master and G@site:eqiad and G@mysql_group:core' cmd.run 'mysql --skip-ssl --batch --skip-column-names -e "SET GLOBAL read_only=1"'</code>
#* The global warmup against the appservers cluster
# Wipe ''new site''<nowiki/>'s memcached to prevent stale values — only once the new site's read-only master/slaves are caught up. Run: <code><s>salt -C 'G@cluster:memcached and G@site:codfw' cmd.run 'service memcached restart'</s></code>  <code>salt -C 'G@cluster:memcached and G@site:eqiad' cmd.run 'service memcached restart'</code>
#* The apc-warmup against all hosts in the appservers and api clusters at least.
# Warm up memcached and APC <syntaxhighlight lang="bash">
apache-fast-test wiki-urls-warmup1000.txt <new_active_dc>
</syntaxhighlight>


==== Phase 4 - switch active datacenter configuration ====
==== Phase 5 - switch active datacenter configuration ====
# Switch the datacenter in puppet, by setting <code>$app_routes['mediawiki']</code>
# Merge the switch of <tt>$mw_primary</tt> at this point and add direct route to $new_site for all mw-related cache::app_directors. both changes can be puppet-merged toghether. '''This is not covered by the switchdc script'''. (Puppet is only involved in managing traffic, db alerts, and the jobrunners).
#* Merge https://gerrit.wikimedia.org/r/#/c/284397
# Switch the discovery
# Switch the datacenter in mediawiki-config (<code>$wmfMasterDatacenter</code> Merge)
#* Flip <tt>appservers-rw, api-rw, imagescaler-rw</tt> to <tt>pooled=true</tt> in the new site. This will not actually change the DNS records, but the on-disk redis config will change.
#* Merge https://gerrit.wikimedia.org/r/#/c/284398/
#* Deploy <tt>wmf-config/ConfigSettings.php</tt> changes to switch the datacenter in MediaWiki
#* Flip <tt>appservers-rw, api-rw, imagescaler-rw</tt> to <tt>pooled=false</tt> in the old site. After this, DNS will be changed and internal applications will start hitting the new DC


==== Phase 5 - apply configuration ====
==== Phase 6 - apply configuration ====
# Redis replication
# Switch the live redis configuration. This can be either scripted, or all redises can be restarted (first in the new site, then in the old one). Verify redises are indeed replicating correctly.
#* stop puppet on all redises <code>salt 'mc*' cmd.run 'puppet agent --disable'; salt 'rdb*' cmd.run 'puppet agent --disable'</code>
# Run puppet on the text caches in $new_site and $old_site. '''This starts the PII leak''' [TODO: check with traffic for the whole procedure]
#* switch the Redis replication on all Redises from the ''old site'' (codfw) to the ''new site'' (eqiad) at runtime with [https://gist.github.com/lavagetto/c3d22c22a4ccd27f38e14723d9144171 switch_redis_replication.py]
#:* Run <code>switch_redis_replication.py memcached.yaml codfw eqiad</code>
#:* Run <code>swicth_redis_replication.py jobqueue.yaml codfw eqiad</code>
#:* Verify those are now masters by running <code>check_redis.py memcached.yaml,jobqueue.yaml</code>
#* Alternatively via puppet:
#:* run <code>salt 'mc1*' cmd.run 'puppet agent --enable; puppet agent -t'; salt 'rdb1*' cmd.run 'puppet agent --enable; puppet agent -t'</code>
#:* verify eqiad Redises now think they're masters
#:* run <code>salt 'mc2*' cmd.run 'puppet agent --enable; puppet agent -t'; salt 'rdb2*' cmd.run 'puppet agent --enable; puppet agent -t'</code>
#:* verify codfw redises are replicating
# RESTBase (for the action API endpoint)
#* run <code>salt -b10% -G 'cluster:restbase' cmd.run 'puppet agent -t; service restbase restart'</code>
# Misc services cluster (for the action API endpoint)
#* run <code>salt 'sc*' cmd.run 'puppet agent -t'</code>
# Parsoid (for the action API endpoint)
#*Merge https://gerrit.wikimedia.org/r/#/c/284399/
#*Deploy
# Switch Varnish backend to <code>appserver.svc.$newsite.wmnet</code>/<code>api.svc.$newsite.wmnet</code>
#* Merge https://gerrit.wikimedia.org/r/#/c/284400/
#* run <code>salt -G 'cluster:cache_text' cmd.run 'puppet agent -t'</code>
# Point Swift imagescalers to the active MediaWiki
#* Merge [[gerrit:284401|https://gerrit.wikimedia.org/r/284401]]
#* restart swift in eqiad: <code>salt -b1 'ms-fe1*' cmd.run 'puppet agent -t; service swift-proxy restart'</code>
#* restart swift in codfw: <code>salt -b1 'ms-fe2*' cmd.run 'puppet agent -t; service swift-proxy restart'</code>


==== Phase 6 - Undo read-only ====
==== Phase 7 - Set new site's databases to read-write ====
# Set new database masters in read-write mode for every core (s1-7), External Storage (es2-3, not es1) and extra (x1) database
# Set new-site's core DB masters (shards: s1-s7, x1, es2-es3) in read-write mode.
#* Check with:  <code>sudo salt -C 'G@mysql_role:master and G@site:codfw and G@mysql_group:core' cmd.run 'mysql --skip-ssl --batch --skip-column-names -e "SELECT @@global.read_only"'</code>
#* Change with:  <code>sudo salt -C 'G@mysql_role:master and G@site:codfw and G@mysql_group:core' cmd.run 'mysql --skip-ssl --batch --skip-column-names -e "SET GLOBAL read_only=0"'</code>
# Deploy mediawiki-config eqiad with all shards set to read-write
#* [https://gerrit.wikimedia.org/r/284157 <s>https://gerrit.wikimedia.org/r/284157</s>]https://gerrit.wikimedia.org/r/#/c/284396/


==== Phase 7 - post read-only ====
==== Phase 8 - Set MediaWiki to read-write ====
# Start the jobqueue in the ''new site''
# Deploy mediawiki-config <tt>wmf-config/db-$new-site.php</tt> with all shards set to read-write
#* Merge [https://gerrit.wikimedia.org/r/282881 <s>https://gerrit.wikimedia.org/r/282881</s>]https://gerrit.wikimedia.org/r/#/c/284394/
 
#* <s>run <code>salt -b 6 -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'puppet agent -t'</code></s> run <code>salt -b 6 -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'puppet agent -t'</code>
==== Phase 9 - post read-only ====
# Start the cron jobs on the maintenance host in codfw
# Start the jobqueue in the ''new site'' by running puppet there (mw_primary controls it)
#* Merge [https://gerrit.wikimedia.org/r/#/c/283954/ <s>https://gerrit.wikimedia.org/r/#/c/283954/</s>]https://gerrit.wikimedia.org/r/#/c/284395/
# Run puppet on the maintenance hosts (mw_primary controls it)
#* <s>run <code>ssh wasat.codfw.wmnet 'sudo puppet agent -t'</code></s> run <code>ssh terbium.codfw.wmnet 'sudo puppet agent -t'</code>
# Update DNS records for new database masters
# Update DNS records for new database masters
#* Merge [[gerrit:284667|https://gerrit.wikimedia.org/r/284667]]
# Update tendril for new database masters
# Update tendril for new database masters
# Run the script to fix broken wikidata entities on the maintenance host of the active datacenter: <code>sudo -u www-data mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki --force</code>
# Set the TTL for the DNS records to 300 seconds again.
# Varnish final reconfiguration:
#* Merge the second traffic puppet patch, comment-out direct route to $old_site in all mw-related cache::app_directors. '''This is not covered by the switchdc script'''
#* Run puppet on all the cache nodes in $old_site '''this ends the PII leak'''
# [Optional] Run the script to fix broken wikidata entities on the maintenance host of the active datacenter: <code>sudo -u www-data mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki --force</code> '''This is not covered by the switchdc script'''


==== Phase 8 - verification and troubleshooting ====
==== Phase 10 - verification and troubleshooting ====
# Make sure reading & editing works! :)
# Make sure reading & editing works! :)
# Make sure recent changes are flowing (see Special:RecentChanges, [[EventStreams]], [[RCStream]] and the IRC feeds)
# Make sure recent changes are flowing (see Special:RecentChanges, [[EventStreams]], [[RCStream]] and the IRC feeds)
Line 122: Line 104:
=== Media storage/Swift ===
=== Media storage/Swift ===
==== Ahead of the switchover, originals and thumbs ====
==== Ahead of the switchover, originals and thumbs ====
# '''MediaWiki:''' Write synchronously to both sites with <s>https://gerrit.wikimedia.org/r/#/c/282888/</s>https://gerrit.wikimedia.org/r/284652
# '''Cache->app:''' Change varnish backends for <tt>swift</tt> and <tt>swift_thumbs</tt> to point to ''new site'' with <s>https://gerrit.wikimedia.org/r/#/c/282890/</s>https://gerrit.wikimedia.org/r/284651
# '''Cache->app:''' Change varnish backends for <tt>swift</tt> and <tt>swift_thumbs</tt> to point to ''new site'' with <s>https://gerrit.wikimedia.org/r/#/c/282890/</s>https://gerrit.wikimedia.org/r/284651
## Force a puppet run on cache_upload in ''both sites'': <tt>salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and ( G@site:eqiad or G@site:codfw )' cmd.run 'puppet agent --test'</tt>
## Force a puppet run on cache_upload in ''both sites'': <tt>salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and ( G@site:eqiad or G@site:codfw )' cmd.run 'puppet agent --test'</tt>
Line 206: Line 187:


=== Services ===
=== Services ===
* RESTBase and Parsoid already active in codfw, using eqiad MW API.
All services, are active-active in DNS discovery, apart from restbase, that needs special treatment. The procedure to fail over to one site only is the same for every one of them:
* Shift traffic to codfw:
# reduce the TTL of the dns discovery records to 10 seconds
** Public traffic: Update Varnish backend config.
# If the service is not active-active in varnish, make it active-active
** Update RESTBase and Flow configs in mediawiki-config to use codfw.
# depool the datacenter we're moving away from in confctl / discovery
* During MW switch-over:
# Make traffic go to the only still active datacenter restoring the active-passive status in cache::app_directors
** Update RESTBase and Parsoid to use MW API in codfw, either using puppet / Parsoid deploy, or DNS. See [[phab:T125069|https://phabricator.wikimedia.org/T125069]].
# restore the original TTL
 
Restbase is a bit of a special case, and needs an additional step, if we're just switching active traffic over and not simulating a complete failover:


* [[phab:T127974|Tracker / checklist]]
# pool restbase-async everywhere, then depool restbase-async in the newly active dc, so that async traffic is separated from real-users traffic as much as possible.


=== Other miscellaneous ===
=== Other miscellaneous ===

Revision as of 17:16, 4 April 2017

Introduction

A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.

Schedule for 2017 switch

See phab:T138810 for tasks to be undertaken during the switch

  • Deployment server:
  • Traffic:
  • MediaWiki 5-minute read-only test: (?)
  • ElasticSearch:
  • Media storage/Swift:
  • Services: Tuesday, April 18th 2017
  • MediaWiki: Wednesday, April 19th 2017 (requires read-only mode)

Switching back

  • MediaWiki: Wednesday, May 3rd 2017 (requires read-only mode)
  • Services, ElasticSearch, Traffic, Swift, Deployment server: Thursday, May 4th 2017 (after the above is done)

Schedule for Q3 FY2015-2016 rollout

  • Deployment server: Wednesday, January 20th 2016
  • Traffic: Thursday, March 10th 2016
  • MediaWiki 5-minute read-only test: Tuesday, March 15th 2016, 07:00 UTC
  • ElasticSearch: Thursday, April 7th 2016, 12:00 UTC
  • Media storage/Swift: Thursday, April 14th 2016, 17:00 UTC
  • Services: Monday, April 18th 2016, 10:00 UTC
  • MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)

Switching back

  • MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
  • Services, ElasticSearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done


Per-service switchover instructions

MediaWiki

We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in operations/switchdc [1]

Phase 0 - preparation

  1. (days in advance) Warm up databases; see MariaDB/buffer_pool_dump.
  2. (days in advance) Prepare puppet patches:
    • Switch mw_primary [2]
    • Add direct route to $new_site for all mw-related cache::app_directors [3]
    • Comment-out direct route to $old_site in all mw-related cache::app_directors [4]
  3. Disable puppet on all jobqueues/videoscalers and maintenance hosts, and the varnishes
  4. Merge the mediawiki-config switchover changes but don't sync This is not covered by the switchdc script
  5. Reduce the TTL on appservers-rw, api-rw, imagescaler-rw to 10 seconds

Phase 1 - stop maintenance

  1. Stop jobqueues in the active site
  2. Kill all the cronjobs on the maintenance host in the active site

Phase 2 - read-only mode

  1. Go to read-only mode by syncing wmf-config/db-$old-site.php

Phase 3 - lock down database masters

  1. Put old-site core DB masters (shards: s1-s7, x1, es2-es3) in read-only mode.
  2. Wait for the new site's databases to catch up replication

Phase 4.1 - Wipe caches

  1. Wipe new site's memcached to prevent stale values — only once the new site's read-only master/slaves are caught up.
  2. Restart all HHVM servers in the new site to clear the APC cache

Phase 4.2 - Warmup caches in the new site

This phase will be executed by the t04_cache_wipe task of switchdc, because there is no speed gain from not doing all of phase 4.1 + phase 4.2 separately, and they are logically related.

  1. Warm up memcached and APC running the mediawiki-cache-warmup on the new site clusters, specifically:
    • The global warmup against the appservers cluster
    • The apc-warmup against all hosts in the appservers and api clusters at least.

Phase 5 - switch active datacenter configuration

  1. Merge the switch of $mw_primary at this point and add direct route to $new_site for all mw-related cache::app_directors. both changes can be puppet-merged toghether. This is not covered by the switchdc script. (Puppet is only involved in managing traffic, db alerts, and the jobrunners).
  2. Switch the discovery
    • Flip appservers-rw, api-rw, imagescaler-rw to pooled=true in the new site. This will not actually change the DNS records, but the on-disk redis config will change.
    • Deploy wmf-config/ConfigSettings.php changes to switch the datacenter in MediaWiki
    • Flip appservers-rw, api-rw, imagescaler-rw to pooled=false in the old site. After this, DNS will be changed and internal applications will start hitting the new DC

Phase 6 - apply configuration

  1. Switch the live redis configuration. This can be either scripted, or all redises can be restarted (first in the new site, then in the old one). Verify redises are indeed replicating correctly.
  2. Run puppet on the text caches in $new_site and $old_site. This starts the PII leak [TODO: check with traffic for the whole procedure]

Phase 7 - Set new site's databases to read-write

  1. Set new-site's core DB masters (shards: s1-s7, x1, es2-es3) in read-write mode.

Phase 8 - Set MediaWiki to read-write

  1. Deploy mediawiki-config wmf-config/db-$new-site.php with all shards set to read-write

Phase 9 - post read-only

  1. Start the jobqueue in the new site by running puppet there (mw_primary controls it)
  2. Run puppet on the maintenance hosts (mw_primary controls it)
  3. Update DNS records for new database masters
  4. Update tendril for new database masters
  5. Set the TTL for the DNS records to 300 seconds again.
  6. Varnish final reconfiguration:
    • Merge the second traffic puppet patch, comment-out direct route to $old_site in all mw-related cache::app_directors. This is not covered by the switchdc script
    • Run puppet on all the cache nodes in $old_site this ends the PII leak
  7. [Optional] Run the script to fix broken wikidata entities on the maintenance host of the active datacenter: sudo -u www-data mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki --force This is not covered by the switchdc script

Phase 10 - verification and troubleshooting

  1. Make sure reading & editing works! :)
  2. Make sure recent changes are flowing (see Special:RecentChanges, EventStreams, RCStream and the IRC feeds)
  3. Make sure email works (exim4 -bp on mx1001/mx2001, test an email)

Media storage/Swift

Ahead of the switchover, originals and thumbs

  1. Cache->app: Change varnish backends for swift and swift_thumbs to point to new site with https://gerrit.wikimedia.org/r/#/c/282890/https://gerrit.wikimedia.org/r/284651
    1. Force a puppet run on cache_upload in both sites: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and ( G@site:eqiad or G@site:codfw )' cmd.run 'puppet agent --test'
  2. Inter-Cache: Switch new site from active site to 'direct' in cache::route_table for upload https://gerrit.wikimedia.org/r/#/c/282891/https://gerrit.wikimedia.org/r/284650
    1. Force a puppet run on cache_upload in new site: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:codfw' cmd.run 'puppet agent --test'
  3. Users: De-pool active site in GeoDNS https://gerrit.wikimedia.org/r/#/c/283416/https://gerrit.wikimedia.org/r/#/c/284694/ + authdns-update
  4. Inter-Cache: Switch all caching sites currently pointing from active site to new site in cache::route_table for upload https://gerrit.wikimedia.org/r/#/c/283418/https://gerrit.wikimedia.org/r/284649
    1. Force a puppet run on cache_upload in caching sites: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:esams' cmd.run 'puppet agent --test'
  5. Inter-Cache: Switch active site from 'direct' to new site in cache::route_table for upload https://gerrit.wikimedia.org/r/#/c/282892/https://gerrit.wikimedia.org/r/284648
    1. Force a puppet run on cache_upload in active site: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:eqiad' cmd.run 'puppet agent --test'

Switching back

Repeat the steps above in reverse order, with suitable revert commits

ElasticSearch

Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster InitialiseSettings.php. The usual default value is "local", which means that if mediawiki switches DC, everything should be automatic. For this specific switch, the value has been set to "codfw" to switch Elasticsearch ahead of time.

To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following Recovering from an Elasticsearch outage / interruption in updates.

Traffic

GeoDNS user routing

Inter-Cache routing

Cache->App routing

Specifics for Switchover Test Week

After switching all applayer services we plan to switch successfully, we'll switch user and inter-cache traffic away from eqiad:

  • The Upload cluster will be following similar instructions on the 14th during the Swift switch.
  • Maps and Misc clusters are not participating (low traffic, special issues, validated by the other moves)
  • This leaves just the text cluster to operate on below:
  1. Inter-Cache: Switch codfw from 'eqiad' to 'direct' in cache::route_table for the text cluster.
  2. Users: De-pool eqiad in GeoDNS for the text cluster.
  3. Inter-Cache: Switch esams from 'eqiad' to 'codfw' in cache::route_table for the text cluster.
  4. Inter-Cache: Switch eqiad from 'direct' to 'codfw' in cache::route_table for the text cluster.

Before reversion of applayer services to eqiad, we'll revert the above steps in reverse order to undo them:

  1. Inter-Cache: Switch eqiad from 'codfw' to 'direct' in cache::route_table for all clusters.
  2. Inter-Cache: Switch esams from 'codfw' to 'eqiad' in cache::route_table for all clusters.
  3. Users: Re-pool eqiad in GeoDNS.
  4. Inter-Cache: Switch codfw from 'direct' to 'eqiad' in cache::route_table for all clusters.

Services

All services, are active-active in DNS discovery, apart from restbase, that needs special treatment. The procedure to fail over to one site only is the same for every one of them:

  1. reduce the TTL of the dns discovery records to 10 seconds
  2. If the service is not active-active in varnish, make it active-active
  3. depool the datacenter we're moving away from in confctl / discovery
  4. Make traffic go to the only still active datacenter restoring the active-passive status in cache::app_directors
  5. restore the original TTL

Restbase is a bit of a special case, and needs an additional step, if we're just switching active traffic over and not simulating a complete failover:

  1. pool restbase-async everywhere, then depool restbase-async in the newly active dc, so that async traffic is separated from real-users traffic as much as possible.

Other miscellaneous