You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.
Schedule for 2017 switch
See phab:T138810 for tasks to be undertaken during the switch
- Traffic: Tuesday, April 18th 2017
- Elasticsearch: elasticsearch is automatically following mediawiki switch
- Media storage/Swift: Tuesday, April 18th 2017
- Services: Tuesday, April 18th 2017
- MediaWiki: Wednesday, April 19th 2017 14:00 UTC (user visible, requires read-only mode)
- Deployment server: (sometime after the switch)
- Traffic: Pre-switchback in two phases: Mon May 1 and Tues May 2 (to avoid cold-cache issues Weds)
- MediaWiki: Wednesday, May 3rd 2017 14:00 UTC (user visible, requires read-only mode)
- Services, Elasticsearch, Swift, Deployment server: Thursday, May 4th 2017 (after the above is done)
Schedule for Q3 FY2015-2016 rollout
- Deployment server: Wednesday, January 20th 2016
- Traffic: Thursday, March 10th 2016
- MediaWiki 5-minute read-only test: Tuesday, March 15th 2016, 07:00 UTC
- Elasticsearch: Thursday, April 7th 2016, 12:00 UTC
- Media storage/Swift: Thursday, April 14th 2016, 17:00 UTC
- Services: Monday, April 18th 2016, 10:00 UTC
- MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
- MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
- Services, Elasticsearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done
Per-service switchover instructions
We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in operations/switchdc 
Phase 0 - preparation
- (days in advance) Warm up databases; see MariaDB/buffer_pool_dump.
- (days in advance) Prepare puppet patches:
- (days in advance) Prepare the mediawiki-config patch or patches (example)
- Disable puppet on all jobqueues/videoscalers and maintenance hosts
- Merge the mediawiki-config switchover changes but don't sync This is not covered by the switchdc script
- Reduce the TTL on appservers-rw, api-rw, imagescaler-rw to 10 seconds
Phase 1 - stop maintenance
- Stop jobqueues in the active site
- Kill all the cronjobs on the maintenance host in the active site
Phase 2 - read-only mode
- Go to read-only mode by syncing wmf-config/db-$old-site.php
Phase 3 - lock down database masters
- Put old-site core DB masters (shards: s1-s7, x1, es2-es3) in read-only mode.
- Wait for the new site's databases to catch up replication
Phase 4.1 - Wipe caches
- Wipe new site's memcached to prevent stale values — only once the new site's read-only master/slaves are caught up.
- Restart all HHVM servers in the new site to clear the APC cache
Phase 4.2 - Warmup caches in the new site
This phase will be executed by the t04_cache_wipe task of switchdc, because there is no speed gain from not doing all of phase 4.1 + phase 4.2 separately, and they are logically related.
- Warm up memcached and APC running the mediawiki-cache-warmup on the new site clusters, specifically:
- The global warmup against the appservers cluster
- The apc-warmup against all hosts in the appservers and api clusters at least.
Phase 5 - switch active datacenter configuration
- Send the traffic layer to active-active:
- disable puppet on cache::text in both datacenters
- merge the varnish patch This is not covered by the switchdc script
- enable and run puppet on cache::text in $new_site. This starts the active-active traffic phase (traffic will go to both MW clusters)
- Run puppet on the text caches in $old_site. This ends the active-active phase.
- Merge the switch of $mw_primary at this point. This change can actually be puppet-merged together with the varnish one. This is not covered by the switchdc script. (Puppet is only involved in managing traffic, db alerts, and the jobrunners).
- Switch the discovery
- Flip appservers-rw, api-rw, imagescaler-rw to pooled=true in the new site. This will not actually change the DNS records, but the on-disk redis config will change.
- Deploy wmf-config/ConfigSettings.php changes to switch the datacenter in MediaWiki
- Flip appservers-rw, api-rw, imagescaler-rw to pooled=false in the old site. After this, DNS will be changed and internal applications will start hitting the new DC
Phase 6 - Redis replicas
- Switch the live redis configuration. This can be either scripted, or all redises can be restarted (first in the new site, then in the old one). Verify redises are indeed replicating correctly.
Phase 7 - Set new site's databases to read-write
- Set new-site's core DB masters (shards: s1-s7, x1, es2-es3) in read-write mode.
Phase 8 - Set MediaWiki to read-write
- Deploy mediawiki-config wmf-config/db-$new-site.php with all shards set to read-write
Phase 9 - post read-only
- Start the jobqueue in the new site by running puppet there (mw_primary controls it)
- Run puppet on the maintenance hosts (mw_primary controls it)
- Update DNS records for new database masters
- Update tendril for new database masters
- Set the TTL for the DNS records to 300 seconds again.
- [Optional] Run the script to fix broken wikidata entities on the maintenance host of the active datacenter:
sudo -u www-data mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki --forceThis is not covered by the switchdc script
Phase 10 - verification and troubleshooting
- Make sure reading & editing works! :)
- Make sure recent changes are flowing (see Special:RecentChanges, EventStreams, RCStream and the IRC feeds)
- Make sure email works (
exim4 -bpon mx1001/mx2001, test an email)
Ahead of the switchover, originals and thumbs
- Cache->app: Change varnish backends for swift and swift_thumbs to point to new site with
- Force a puppet run on cache_upload in both sites: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and ( G@site:eqiad or G@site:codfw )' cmd.run 'puppet agent --test'
- Inter-Cache: Switch new site from active site to 'direct' in cache::route_table for upload
- Force a puppet run on cache_upload in new site: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:codfw' cmd.run 'puppet agent --test'
- Users: De-pool active site in GeoDNS
https://gerrit.wikimedia.org/r/#/c/283416/https://gerrit.wikimedia.org/r/#/c/284694/ + authdns-update
- Inter-Cache: Switch all caching sites currently pointing from active site to new site in cache::route_table for upload
- Force a puppet run on cache_upload in caching sites: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:esams' cmd.run 'puppet agent --test'
- Inter-Cache: Switch active site from 'direct' to new site in cache::route_table for upload
- Force a puppet run on cache_upload in active site: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:eqiad' cmd.run 'puppet agent --test'
Repeat the steps above in reverse order, with suitable revert commits
CirrusSearch talks by default to the local datacenter ($wmfDatacenter). If Mediawiki switches datacenter, elasticsearch will automatically follow.
Manually switching CirrusSearch to a specific datacenter can always be done. Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster InitialiseSettings.php.
To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following Recovering from an Elasticsearch outage / interruption in updates.
General information on generic procedures
- gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347613/
- <any puppetmaster>:
- <any cumin master>:
sudo cumin 'cp3*.esams.wmnet' 'run-puppet-agent -q'
GeoDNS (User-facing) Routing:
- gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347616
- <any authdns node>: authdns-update
Same procedures as above, with reversions of the commits specified. The switchback will happen in two stages over two days: first reverting the inter-cache routing, then the user routing, to minimize cold-cache issues.
All services, are active-active in DNS discovery, apart from restbase, that needs special treatment. The procedure to fail over to one site only is the same for every one of them:
- reduce the TTL of the dns discovery records to 10 seconds
- If the service is not active-active in varnish, make it active-active
- depool the datacenter we're moving away from in confctl / discovery
- Make traffic go to the only still active datacenter restoring the active-passive status in cache::app_directors
- restore the original TTL
Restbase is a bit of a special case, and needs an additional step, if we're just switching active traffic over and not simulating a complete failover:
- pool restbase-async everywhere, then depool restbase-async in the newly active dc, so that async traffic is separated from real-users traffic as much as possible.