You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Switch Datacenter: Difference between revisions
imported>Jcrespo |
imported>WhatamIdoing (→Introduction: Making space for the next schedule) |
||
Line 3: | Line 3: | ||
== Introduction == | == Introduction == | ||
A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component. | A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component. | ||
=== Schedule for 2017 switch === | |||
* Deployment server: | |||
* Traffic: | |||
* MediaWiki 5-minute read-only test: (?) | |||
* ElasticSearch: | |||
* Media storage/Swift: | |||
* Services: | |||
* MediaWiki: (requires read-only mode) | |||
==== Switching back ==== | |||
* MediaWiki: (requires read-only mode) | |||
* Services, ElasticSearch, Traffic, Swift, Deployment server: (after the above is done) | |||
=== Schedule for Q3 FY2015-2016 rollout === | === Schedule for Q3 FY2015-2016 rollout === |
Revision as of 21:25, 17 January 2017
Introduction
A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.
Schedule for 2017 switch
- Deployment server:
- Traffic:
- MediaWiki 5-minute read-only test: (?)
- ElasticSearch:
- Media storage/Swift:
- Services:
- MediaWiki: (requires read-only mode)
Switching back
- MediaWiki: (requires read-only mode)
- Services, ElasticSearch, Traffic, Swift, Deployment server: (after the above is done)
Schedule for Q3 FY2015-2016 rollout
- Deployment server: Wednesday, January 20th
- Traffic: Thursday, March 10th
- MediaWiki 5-minute read-only test: Tuesday, March 15th 07:00 UTC
- ElasticSearch: Thursday, April 7th, 12:00 UTC
- Media storage/Swift: Thursday, April 14th 17:00 UTC
- Services: Monday, April 18th, 10:00 UTC
- MediaWiki: Tuesday, April 19th, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
Switching back
- MediaWiki: Thursday, April 21st, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
- Services, ElasticSearch, Traffic, Swift, Deployment server: Thursday, April 21st, after the above is done
Per-service switchover instructions
Phase 1 - preparation
- (days in advance) Warm up databases; see MariaDB/buffer_pool_dump.
- Stop jobqueues in the active site
- Merge
https://gerrit.wikimedia.org/r/282880https://gerrit.wikimedia.org/r/#/c/284403/ runsalt -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'service jobrunner stop; service jobchron stop;'
- run
salt -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'service jobrunner stop; service jobchron stop;'
- Merge
- Stop all jobs running on the maintenance host
- Merge
https://gerrit.wikimedia.org/r/#/c/283952/https://gerrit.wikimedia.org/r/#/c/284404/ runssh terbium.eqiad.wmnet sudo 'puppet agent -t;sudo killall php; sudo killall php5;'
- run
ssh wasat.codfw.wmnet 'sudo puppet agent -t;sudo killall php; sudo killall php5;'
- manually check for any scripts that need to be killed
- Merge
Phase 2 - read-only mode
- Deploy mediawiki-config with all shards set to read-only (set ro on active site)
Phase 3 - lock down database masters, cache wipes
- Set active site's databases (masters) in read-only mode except parsercache ones (which are dual masters) standalone es1 servers (which are always read only) and misc servers (for now, as they are independent from mediawiki and do not have yet clients on codfw).
- Check with:
sudo salt -C 'G@mysql_role:master and G@site:eqiad and not G@mysql_group:misc and not G@mysql_group:parsercache' cmd.run 'mysql --defaults-file=.my.cnf --batch --skip-column-names -e "SELECT @@global.read_only"'
- Change with:
sudo salt -C 'G@mysql_role:master and G@site:eqiad and not G@mysql_group:misc and not G@mysql_group:parsercache' cmd.run 'mysql --defaults-file=.my.cnf --batch --skip-column-names -e "SET GLOBAL read_only=1"'
- Check with:
- Wipe new site's memcached to prevent stale values — only once the new site's read-only master/slaves are caught up. Run:
salt -C 'G@cluster:memcached and G@site:codfw' cmd.run 'service memcached restart'salt -C 'G@cluster:memcached and G@site:eqiad' cmd.run 'service memcached restart'
- Warm up memcached and APC
apache-fast-test wiki-urls-warmup1000.txt <new_active_dc>
Phase 4 - switch active datacenter configuration
- Switch the datacenter in puppet, by setting
$app_routes['mediawiki']
- Switch the datacenter in mediawiki-config (
$wmfMasterDatacenterMerge
)
Phase 5 - apply configuration
- Redis replication
- stop puppet on all redises
salt 'mc*' cmd.run 'puppet agent --disable'; salt 'rdb*' cmd.run 'puppet agent --disable'
- switch the Redis replication on all Redises from the old site(codfw) the new site (eqiad) at runtime with switch_redis_replication.py
- Run
switch_redis_replication.py memcached.yaml codfw eqiad
- Run
swicth_redis_replication.py jobqueue.yaml codfw eqiad
- Verify those are now masters by running
check_redis.py memcached.yaml,jobqueue.yaml
- Run
- Alternatively via puppet:
- run
salt 'mc1*' cmd.run 'puppet agent --enable; puppet agent -t'; salt 'rdb1*' cmd.run 'puppet agent --enable; puppet agent -t'
- verify eqiad Redises now think they're masters
- run
salt 'mc2*' cmd.run 'puppet agent --enable; puppet agent -t'; salt 'rdb2*' cmd.run 'puppet agent --enable; puppet agent -t'
- verify codfw redises are replicating
- run
- stop puppet on all redises
- RESTBase (for the action API endpoint)
- run
salt -b10% -G 'cluster:restbase' cmd.run 'puppet agent -t; service restbase restart'
- run
- Misc services cluster (for the action API endpoint)
- run
salt 'sc*' cmd.run 'puppet agent -t'
- run
- Parsoid (for the action API endpoint)
- Merge https://gerrit.wikimedia.org/r/#/c/284399/
- Deploy
- Switch Varnish backend to
appserver.svc.$newsite.wmnet
/api.svc.$newsite.wmnet
- Merge https://gerrit.wikimedia.org/r/#/c/284400/
- run
salt -G 'cluster:cache_text' cmd.run 'puppet agent -t'
- Point Swift imagescalers to the active MediaWiki
- Merge https://gerrit.wikimedia.org/r/284401
- restart swift in eqiad:
salt -b1 'ms-fe1*' cmd.run 'puppet agent -t; swift-init all restart'
- restart swift in codfw:
salt -b1 'ms-fe2*' cmd.run 'puppet agent -t; swift-init all restart'
Phase 6 - database master swap
- Database master swap for every core (s1-7), External Storage (es2-3, not es1) and extra (x1) database
- Check with:
sudo salt -C 'G@mysql_role:master and G@site:codfw and not G@mysql_group:misc and not G@mysql_group:parsercache' cmd.run 'mysql --defaults-file=.my.cnf --batch --skip-column-names -e "SELECT @@global.read_only"'
- Change with:
sudo salt -C 'G@mysql_role:master and G@site:eqiad and not G@mysql_group:misc and not G@mysql_group:parsercache' cmd.run 'mysql --defaults-file=.my.cnf --batch --skip-column-names -e "SET GLOBAL read_only=0"'
- Check with:
Phase 7 - Undo read-only
- Deploy mediawiki-config eqiad with all shards set to read-write
Phase 8 - post read-only
- Start the jobqueue in the new site
- Merge
https://gerrit.wikimedia.org/r/282881https://gerrit.wikimedia.org/r/#/c/284394/ runrunsalt -b 6 -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'puppet agent -t'
salt -b 6 -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'puppet agent -t'
- Merge
- Start the cron jobs on the maintenance host in codfw
- Merge
https://gerrit.wikimedia.org/r/#/c/283954/https://gerrit.wikimedia.org/r/#/c/284395/ runrunssh wasat.codfw.wmnet 'sudo puppet agent -t'
ssh terbium.codfw.wmnet 'sudo puppet agent -t'
- Merge
- Re-enable puppet on all eqiad and codfw databases masters
sudo salt -C 'G@mysql_role:master or pc*' cmd.run "puppet agent --enable"
- Update DNS records for new database masters
- Update tendril for new database masters
- Run the script to fix broken wikidata entities on the maintenance host of the active datacenter:
sudo -u www-data mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki --force
Phase 9 - verification and troubleshooting
- Make sure reading & editing works! :)
- Make sure recent changes are flowing (see Special:RecentChanges, rcstream and the IRC feeds)
- Make sure email works (
exim4 -bp
on mx1001/mx2001, test an email)
Media storage/Swift
Ahead of the switchover, originals and thumbs
- MediaWiki: Write synchronously to both sites with
https://gerrit.wikimedia.org/r/#/c/282888/https://gerrit.wikimedia.org/r/284652 - Cache->app: Change varnish backends for swift and swift_thumbs to point to new site with
https://gerrit.wikimedia.org/r/#/c/282890/https://gerrit.wikimedia.org/r/284651- Force a puppet run on cache_upload in both sites: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and ( G@site:eqiad or G@site:codfw )' cmd.run 'puppet agent --test'
- Inter-Cache: Switch new site from active site to 'direct' in cache::route_table for upload
https://gerrit.wikimedia.org/r/#/c/282891/https://gerrit.wikimedia.org/r/284650- Force a puppet run on cache_upload in new site: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:codfw' cmd.run 'puppet agent --test'
- Users: De-pool active site in GeoDNS
https://gerrit.wikimedia.org/r/#/c/283416/https://gerrit.wikimedia.org/r/#/c/284694/ + authdns-update - Inter-Cache: Switch all caching sites currently pointing from active site to new site in cache::route_table for upload
https://gerrit.wikimedia.org/r/#/c/283418/https://gerrit.wikimedia.org/r/284649- Force a puppet run on cache_upload in caching sites: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:esams' cmd.run 'puppet agent --test'
- Inter-Cache: Switch active site from 'direct' to new site in cache::route_table for upload
https://gerrit.wikimedia.org/r/#/c/282892/https://gerrit.wikimedia.org/r/284648- Force a puppet run on cache_upload in active site: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:eqiad' cmd.run 'puppet agent --test'
Switching back
Repeat the steps above in reverse order, with suitable revert commits
ElasticSearch
Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster InitialiseSettings.php. The usual default value is "local", which means that if mediawiki switches DC, everything should be automatic. For this specific switch, the value has been set to "codfw" to switch Elasticsearch ahead of time.
To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following Recovering from an Elasticsearch outage / interruption in updates.
Traffic
GeoDNS user routing
- Traffic-layer only, no interdependencies elsewhere
- Granularity is per-cache-cluster (misc, maps, text, upload)
- Documented at: https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing#GeoDNS
Inter-Cache routing
- Traffic-layer only, no interdependencies elsewhere
- Granularity is per-cache-cluster (misc, maps, text, upload)
- Documented at: https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing#Inter-Cache_Routing
Cache->App routing
- Normally will have inter-dependencies with application-level work
- Granularity is per-application-service (how they're defined at the back end of varnish)
- Documented at: https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing#Cache-to-Application_Routing
Specifics for Switchover Test Week
After switching all applayer services we plan to switch successfully, we'll switch user and inter-cache traffic away from eqiad:
- The Upload cluster will be following similar instructions on the 14th during the Swift switch.
- Maps and Misc clusters are not participating (low traffic, special issues, validated by the other moves)
- This leaves just the text cluster to operate on below:
- Inter-Cache: Switch codfw from 'eqiad' to 'direct' in cache::route_table for the text cluster.
- https://gerrit.wikimedia.org/r/283430
- Force a puppet run on affected caches:
salt -v -t 10 -b 17 -C 'G@site:codfw and G@cluster:cache_text' cmd.run 'puppet agent --test'
- Users: De-pool eqiad in GeoDNS for the text cluster.
- https://gerrit.wikimedia.org/r/283433
authdns-update
on any one of the authdns servers (radon, baham, eeden)
- Inter-Cache: Switch esams from 'eqiad' to 'codfw' in cache::route_table for the text cluster.
- https://gerrit.wikimedia.org/r/283431
- Force a puppet run on affected caches:
salt -v -t 10 -b 17 -C 'G@site:esams and G@cluster:cache_text' cmd.run 'puppet agent --test'
- Inter-Cache: Switch eqiad from 'direct' to 'codfw' in cache::route_table for the text cluster.
- https://gerrit.wikimedia.org/r/283432
- Force a puppet run on affected caches:
salt -v -t 10 -b 17 -C 'G@site:eqiad and G@cluster:cache_text' cmd.run 'puppet agent --test'
Before reversion of applayer services to eqiad, we'll revert the above steps in reverse order to undo them:
- Inter-Cache: Switch eqiad from 'codfw' to 'direct' in cache::route_table for all clusters.
- https://gerrit.wikimedia.org/r/284687
- Force a puppet run on affected caches:
salt -v -t 10 -b 17 -C 'G@site:eqiad and G@cluster:cache_text' cmd.run 'puppet agent --test'
- Inter-Cache: Switch esams from 'codfw' to 'eqiad' in cache::route_table for all clusters.
- https://gerrit.wikimedia.org/r/284688
- Force a puppet run on affected caches:
salt -v -t 10 -b 17 -C 'G@site:esams and G@cluster:cache_text' cmd.run 'puppet agent --test'
- Users: Re-pool eqiad in GeoDNS.
- https://gerrit.wikimedia.org/r/284692
authdns-update
on any one of the authdns servers (radon, baham, eeden)
- Inter-Cache: Switch codfw from 'direct' to 'eqiad' in cache::route_table for all clusters.
- https://gerrit.wikimedia.org/r/284689
- Force a puppet run on affected caches:
salt -v -t 10 -b 17 -C 'G@site:codfw and G@cluster:cache_text' cmd.run 'puppet agent --test'
Services
- RESTBase and Parsoid already active in codfw, using eqiad MW API.
- Shift traffic to codfw:
- Public traffic: Update Varnish backend config.
- Update RESTBase and Flow configs in mediawiki-config to use codfw.
- During MW switch-over:
- Update RESTBase and Parsoid to use MW API in codfw, either using puppet / Parsoid deploy, or DNS. See https://phabricator.wikimedia.org/T125069.
Other miscellaneous
- Deployment server
- EventLogging
- IRC/RCstream