You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Switch Datacenter"

From Wikitech-static
Jump to navigation Jump to search
imported>Jcrespo
imported>Filippo Giunchedi
Line 24: Line 24:
=== MediaWiki-related ===
=== MediaWiki-related ===


==== Phase 1 - preparation ====
# Warm up databases; also see [[/Manual cache warmup|manual cache warmup]].
# Warm up databases; also see [[/Manual cache warmup|manual cache warmup]].
# Stop jobqueues in eqiad
# Stop jobqueues in the ''active site''
#* cherry-pick https://gerrit.wikimedia.org/r/282880
#* Merge [https://gerrit.wikimedia.org/r/282880 <s>https://gerrit.wikimedia.org/r/282880</s>]https://gerrit.wikimedia.org/r/#/c/284403/
#* run <code>salt -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'service jobrunner stop; service jobchron stop;'</code>
#* <s>run <code>salt -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'service jobrunner stop; service jobchron stop;'</code></s>
#* run <code>salt -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'service jobrunner stop; service jobchron stop;'</code>
# Stop all jobs running on the maintenance host
# Stop all jobs running on the maintenance host
#* cherry-pick https://gerrit.wikimedia.org/r/#/c/283952/
#* Merge [https://gerrit.wikimedia.org/r/#/c/283952/ <s>https://gerrit.wikimedia.org/r/#/c/283952/</s>] https://gerrit.wikimedia.org/r/#/c/284404/
#* run <code>ssh terbium.eqiad.wmnet sudo 'puppet agent -t'</code>
#* <s>run <code>ssh terbium.eqiad.wmnet sudo 'puppet agent -t;sudo  killall php; sudo killall php5;'</code></s>
#* manually kill any long-running scripts
#* run <code>ssh wasat.codfw.wmnet 'sudo puppet agent -t;sudo  killall php; sudo killall php5;'</code>
#* manually check for any scripts that need to be killed
#Disable puppet on all eqiad and codfw databases masters
#*<code>salt -L "db2016.codfw.wmnet,db2017.codfw.wmnet,db2018.codfw.wmnet,db2019.codfw.wmnet,db2023.codfw.wmnet,db2028.codfw.wmnet,db2029.codfw.wmnet,es2015.codfw.wmnet,es2018.codfw.wmnet,db2009.codfw.wmnet,pc2004.codfw.wmnet,pc2005.codfw.wmnet,pc2006.codfw.wmnet" cmd.run "puppet agent --disable 'switchover'"</code>
#*<code>salt -L "db1057.eqiad.wmnet,db1018.eqiad.wmnet,db1075.eqiad.wmnet,db1042.eqiad.wmnet,db1049.eqiad.wmnet,db1050.eqiad.wmnet,db1041.eqiad.wmnet,es1015.eqiad.wmnet,es1019.eqiad.wmnet,db1031.eqiad.wmnet,pc1004.eqiad.wmnet,pc1005.eqiad.wmnet,pc1006.eqiad.wmnet" cmd.run "puppet agent --disable 'switchover'"</code>
#Set final <code>$master</code> status for databases in advance (puppet disabled)
#* Merge https://gerrit.wikimedia.org/r/#/c/284514/
# Switch <code>pt-heartbeat</code> from ''active site'' (codfw) to ''new site'' (eqiad) masters (TODO: clean this up/make it generic)
#* Check codfw masters with: <code>salt -L "db2016.codfw.wmnet,db2017.codfw.wmnet,db2018.codfw.wmnet,db2019.codfw.wmnet,db2023.codfw.wmnet,db2028.codfw.wmnet,db2029.codfw.wmnet,es2015.codfw.wmnet,es2018.codfw.wmnet,db2009.codfw.wmnet,pc2004.codfw.wmnet,pc2005.codfw.wmnet,pc2006.codfw.wmnet" cmd.run "pgrep -f '/usr/bin/perl\s/usr/local/bin/pt-heartbeat-wikimedia'"</code>
#* Kill pt-heartbeat on codfw masters with: <code>salt -L "db2016.codfw.wmnet,db2017.codfw.wmnet,db2018.codfw.wmnet,db2019.codfw.wmnet,db2023.codfw.wmnet,db2028.codfw.wmnet,db2029.codfw.wmnet,es2015.codfw.wmnet,es2018.codfw.wmnet,db2009.codfw.wmnet,pc2004.codfw.wmnet,pc2005.codfw.wmnet,pc2006.codfw.wmnet" cmd.run "pkill -f '/usr/bin/perl\s/usr/local/bin/pt-heartbeat-wikimedia'"</code>
#* Check eqiad masters with: <code>salt -L "db1057.eqiad.wmnet,db1018.eqiad.wmnet,db1075.eqiad.wmnet,db1042.eqiad.wmnet,db1049.eqiad.wmnet,db1050.eqiad.wmnet,db1041.eqiad.wmnet,es1015.eqiad.wmnet,es1019.eqiad.wmnet,db1031.eqiad.wmnet,pc1004.eqiad.wmnet,pc1005.eqiad.wmnet,pc1006.eqiad.wmnet" cmd.run "pgrep -f '/usr/bin/perl\s/usr/local/bin/pt-heartbeat-wikimedia'"</code>
#* Start pt-heartbeat on eqiad masters with: <code>/home/volans/eqiad-start-pt-heartbeat.sh</code>
 
==== Phase 2 - read-only mode ====
# Deploy mediawiki-config with all shards set to read-only
# Deploy mediawiki-config with all shards set to read-only
#* [[gerrit:283953|https://gerrit.wikimedia.org/r/283953]]
#* [[gerrit:283953|<s>https://gerrit.wikimedia.org/r/283953</s>]] https://gerrit.wikimedia.org/r/#/c/284402/
# Set codfw databases (masters) in [[/db read-only|read-only mode]].
 
#* Also set parsercaches read_only=off for the new datacenter
==== Phase 3 - lock down database masters, cache wipes, warmups ====
# Wipe codfw memcached to prevent stale values. Once eqiad is read-only and cofdw read-only master/slaves are caught up; any subsequent requests to codfw that set memcached are fine. Run <code>salt -C 'G@cluster:memcached and G@site:codfw' cmd.run 'service memcached restart'</code>
# Set ''active site''<nowiki/>'s databases (masters) in [[/db read-only|read-only mode]] except parsercache ones.
# Switch the datacenter in puppet
#* Check with:  <code>while read master; do echo $master; mysql -h $master --batch --skip-column-names -e "SELECT @@global.read_only"; done < /home/volans/codfw-nopc-masters.txt</code>
#* Set <code>$app_routes['mediawiki'] = 'codfw'</code> in puppet (cherry-pick [[gerrit:282898|https://gerrit.wikimedia.org/r/282898]])
#* Change with:  <code>while read master; do echo $master; mysql -h $master --batch --skip-column-names -e "SET GLOBAL read_only=1"; done < /home/volans/codfw-nopc-masters.txt</code>
#* <code>$wmfMasterDatacenter</code> in mediawiki-config (https://gerrit.wikimedia.org/r/#/c/282897/). This has several consequences:
# Wipe ''new site''<nowiki/>'s memcached to prevent stale values only once the new site's read-only master/slaves are caught up. Run<code><s>salt -C 'G@cluster:memcached and G@site:codfw' cmd.run 'service memcached restart'</s></code>  <code>salt -C 'G@cluster:memcached and G@site:eqiad' cmd.run 'service memcached restart'</code>
#* Redis replication will be flowing codfw => eqiad once puppet has ran<br />first in codfw <code>salt 'mc2*' cmd.run 'puppet agent -t'; salt 'rdb2*' cmd.run 'puppet agent -t'</code><br />and then in eqiad <code>salt 'mc1*' cmd.run 'puppet agent -t'; salt 'rdb1*' cmd.run 'puppet agent -t'</code>
# Warm up memcached and APC <TODO: script by Timo>
#* RESTBase (needs puppet run + service restart): <code>salt -b10% -G 'cluster:restbase' cmd.run 'puppet agent -t; service restbase restart'</code>
 
#* All other services will be automatically reconfigured whenever puppet runs after <code>$app_routes</code> is modified <code>salt 'sc*' cmd.run 'puppet agent -t'</code>
==== Phase 4 - switch active datacenter configuration ====
# Parsoid switch of the action API endpoint (manages its own config, will need a deploy + restart of its own)
# Switch the datacenter in puppet, by setting <code>$app_routes['mediawiki']</code>
#* merge https://gerrit.wikimedia.org/r/#/c/282904/
#* Merge https://gerrit.wikimedia.org/r/#/c/284397
# Deploy Varnish to switch backend to <code>appserver.svc.codfw.wmnet</code>/<code>api.svc.codfw.wmnet</code>
# Switch the datacenter in mediawiki-config (<code>$wmfMasterDatacenterMerge</code>)
#* merge https://gerrit.wikimedia.org/r/#/c/282910/
#* Merge https://gerrit.wikimedia.org/r/#/c/284398/
#* run puppet on all cache_text
 
==== Phase 5 - apply configuration ====
# Redis replication
#* stop puppet on all redises <code>salt 'mc*' cmd.run 'puppet agent --disable'; salt 'rdb*' cmd.run 'puppet agent --disable'</code>
#* switch the Redis replication on all Redises from the ''old site''(codfw) the ''new site'' (eqiad) at runtime with [https://gist.github.com/lavagetto/c3d22c22a4ccd27f38e14723d9144171 switch_redis_replication.py]
#:* Run <code>switch_redis_replication.py memcached.yaml codfw eqiad</code>
#:* Run <code>swicth_redis_replication.py jobqueue.yaml codfw eqiad</code>
#:* Verify those are now masters by running <code>check_redis.py memcached.yaml,jobqueue.yaml</code>
#* Alternatively via puppet:
#:* run <code>salt 'mc1*' cmd.run 'puppet agent --enable; puppet agent -t'; salt 'rdb1*' cmd.run 'puppet agent --enable; puppet agent -t'</code>
#:* verify eqiad Redises now think they're masters
#:* run <code>salt 'mc2*' cmd.run 'puppet agent --enable; puppet agent -t'; salt 'rdb2*' cmd.run 'puppet agent --enable; puppet agent -t'</code>
#:* verify codfw redises are replicating
# RESTBase (for the action API endpoint)
#* run <code>salt -b10% -G 'cluster:restbase' cmd.run 'puppet agent -t; service restbase restart'</code>
# Misc services cluster (for the action API endpoint)
#* run <code>salt 'sc*' cmd.run 'puppet agent -t'</code>
# Parsoid (for the action API endpoint)
#*Merge https://gerrit.wikimedia.org/r/#/c/284399/
#*Deploy
#Switch parsercache RO/RW
#* Check with: <code>while read master; do echo $master; mysql -h $master --batch --skip-column-names -e "SELECT @@global.read_only"; done < /home/volans/all-pasercaches.txt</code>
#* codfw RO: <code>while read master; do echo $master; mysql -h $master --batch --skip-column-names -e "SET GLOBAL read_only=1"; done < /home/jynus/codfw-parsercaches.txt</code>
#* eqiad RW: <code>while read master; do echo $master; mysql -h $master --batch --skip-column-names -e "SET GLOBAL read_only=0"; done < /home/jynus/eqiad-parsercaches.txt</code>
# Switch Varnish backend to <code>appserver.svc.$newsite.wmnet</code>/<code>api.svc.$newsite.wmnet</code>
#* Merge https://gerrit.wikimedia.org/r/#/c/284400/
#* run <code>salt -G 'cluster:cache_text' cmd.run 'puppet agent -t'</code>
# Point Swift imagescalers to the active MediaWiki
# Point Swift imagescalers to the active MediaWiki
#* Merge https://gerrit.wikimedia.org/r/#/c/268080/
#* Merge [[gerrit:284401|https://gerrit.wikimedia.org/r/284401]]
#* roll-restart swift: in eqiad <code>salt -b1 'ms-fe1*' cmd.run 'puppet agent -t; swift-init all restart'</code> and codfw <code>salt -b1 'ms-fe2*' cmd.run 'puppet agent -t; swift-init all restart'</code>
#* restart swift in eqiad: <code>salt -b1 'ms-fe1*' cmd.run 'puppet agent -t; swift-init all restart'</code>
# Database master swap for every core (s1-7), External Storage (es2-3, not es1), parsercache (pc) and extra (x1) database
#* restart swift in codfw: <code>salt -b1 'ms-fe2*' cmd.run 'puppet agent -t; swift-init all restart'</code>
#* Deploy puppet <code>$master = true / false</code> for the appropriate hosts. https://gerrit.wikimedia.org/r/284144
 
#* Manually kill and start pt-heartbeat-wikimedia on the appropiate masters (do not wait for puppet)
==== Phase 6 - database master swap ====
#* [[/db read-only|Set eqiad masters databases mysql as read-write manually]]
# Database master swap for every core (s1-7), External Storage (es2-3, not es1) and extra (x1) database
#* Check with: <code>while read master; do echo $master; mysql -h $master --batch --skip-column-names -e "SELECT @@global.read_only"; done < /home/volans/eqiad-nopc-masters.txt</code>
#* Change with: <code>while read master; do echo $master; mysql -h $master --batch --skip-column-names -e "SET GLOBAL read_only=0"; done < /home/volans/eqiad-nopc-masters.txt</code>
 
==== Phase 7 - Undo read-only ====
# Deploy mediawiki-config eqiad with all shards set to read-write
# Deploy mediawiki-config eqiad with all shards set to read-write
#* https://gerrit.wikimedia.org/r/284157
#* [https://gerrit.wikimedia.org/r/284157 <s>https://gerrit.wikimedia.org/r/284157</s>]https://gerrit.wikimedia.org/r/#/c/284396/
#* Make sure recent changes are flowing (see Special:RecentChanges, rcstream and the IRC feeds) - if not, revert
 
# Start the jobqueue in codfw
==== Phase 8 - post read-only ====
#* Merge https://gerrit.wikimedia.org/r/282881
# Start the jobqueue in the ''new site''
#* run <code>salt -b 6 -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'puppet agent -t'</code>
#* Merge [https://gerrit.wikimedia.org/r/282881 <s>https://gerrit.wikimedia.org/r/282881</s>]https://gerrit.wikimedia.org/r/#/c/284394/
#* <s>run <code>salt -b 6 -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'puppet agent -t'</code></s> run <code>salt -b 6 -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'puppet agent -t'</code>
# Start the cron jobs on the maintenance host in codfw
# Start the cron jobs on the maintenance host in codfw
#* Merge https://gerrit.wikimedia.org/r/#/c/283954/
#* Merge [https://gerrit.wikimedia.org/r/#/c/283954/ <s>https://gerrit.wikimedia.org/r/#/c/283954/</s>]https://gerrit.wikimedia.org/r/#/c/284395/
#* run <code>ssh wasat.codfw.wmnet sudo 'puppet agent -t'</code>
#* <s>run <code>ssh wasat.codfw.wmnet 'sudo puppet agent -t'</code></s> run <code>ssh terbium.codfw.wmnet 'sudo puppet agent -t'</code>
==== Debugging ====
# Re-enable puppet on all eqiad and codfw databases masters
You can force Varnish to pass a request to a backend in codfw or eqiad using the [[X-Wikimedia-Debug]] header.
#* codfw masters: <code>salt -L "db2016.codfw.wmnet,db2017.codfw.wmnet,db2018.codfw.wmnet,db2019.codfw.wmnet,db2023.codfw.wmnet,db2028.codfw.wmnet,db2029.codfw.wmnet,es2015.codfw.wmnet,es2018.codfw.wmnet,db2009.codfw.wmnet,pc2004.codfw.wmnet,pc2005.codfw.wmnet,pc2006.codfw.wmnet" cmd.run "puppet agent --enable"</code>
 
#* eqiad masters: <code>salt -L "db1057.eqiad.wmnet,db1018.eqiad.wmnet,db1075.eqiad.wmnet,db1042.eqiad.wmnet,db1049.eqiad.wmnet,db1050.eqiad.wmnet,db1041.eqiad.wmnet,es1015.eqiad.wmnet,es1019.eqiad.wmnet,db1031.eqiad.wmnet,pc1004.eqiad.wmnet,pc1005.eqiad.wmnet,pc1006.eqiad.wmnet" cmd.run "puppet agent --enable"</code>
For codfw, use <code>X-Wikimedia-Debug: backend=mw2017.codfw.wmnet</code>
# Update DNS records for new database masters
#* Merge [[gerrit:284667|https://gerrit.wikimedia.org/r/284667]]
# Run the script to fix broken wikidata entities on the maintenance host of the active datacenter: <code>sudo -u www-data mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki --force</code>


For eqiad, use <code>X-Wikimedia-Debug: backend=mw1017.eqiad.wmnet</code>
==== Phase 9 - verification and troubleshooting ====
# Make sure reading & editing works! :)
# Make sure recent changes are flowing (see Special:RecentChanges, rcstream and the IRC feeds)
# Make sure email works (<code>exim4 -bp</code> on mx1001/mx2001, test an email)


=== Media storage/Swift ===
=== Media storage/Swift ===
==== Ahead of the switchover, originals and thumbs ====
==== Ahead of the switchover, originals and thumbs ====
# '''MediaWiki:''' Write synchronously to both eqiad/codfw with https://gerrit.wikimedia.org/r/#/c/282888/
# '''MediaWiki:''' Write synchronously to both sites with <s>https://gerrit.wikimedia.org/r/#/c/282888/</s>https://gerrit.wikimedia.org/r/284652
# '''Cache->app:''' Change varnish backends for <tt>swift</tt> and <tt>swift_thumbs</tt> to point to codfw with https://gerrit.wikimedia.org/r/#/c/282890/
# '''Cache->app:''' Change varnish backends for <tt>swift</tt> and <tt>swift_thumbs</tt> to point to ''new site'' with <s>https://gerrit.wikimedia.org/r/#/c/282890/</s>https://gerrit.wikimedia.org/r/284651
## Force a puppet run on cache_upload in eqiad + codfw: <tt>salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and ( G@site:eqiad or G@site:codfw )' cmd.run 'puppet agent --test'</tt>
## Force a puppet run on cache_upload in ''both sites'': <tt>salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and ( G@site:eqiad or G@site:codfw )' cmd.run 'puppet agent --test'</tt>
# '''Inter-Cache:''' Switch codfw from 'eqiad' to 'direct' in cache::route_table for upload https://gerrit.wikimedia.org/r/#/c/282891/
# '''Inter-Cache:''' Switch ''new site'' from ''active site'' to 'direct' in cache::route_table for upload <s>https://gerrit.wikimedia.org/r/#/c/282891/</s>https://gerrit.wikimedia.org/r/284650
## Force a puppet run on cache_upload in codfw: <tt>salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:codfw' cmd.run 'puppet agent --test'</tt>
## Force a puppet run on cache_upload in ''new site'': <tt>salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:codfw' cmd.run 'puppet agent --test'</tt>
# '''Users:''' De-pool eqiad in GeoDNS https://gerrit.wikimedia.org/r/#/c/283416/ + <tt>authdns-update</tt>
# '''Users:''' De-pool ''active site'' in GeoDNS <s>https://gerrit.wikimedia.org/r/#/c/283416/</s>https://gerrit.wikimedia.org/r/#/c/284694/ + <tt>authdns-update</tt>
# '''Inter-Cache:''' Switch esams from 'eqiad' to 'codfw' in cache::route_table for upload https://gerrit.wikimedia.org/r/#/c/283418/
# '''Inter-Cache:''' Switch all ''caching sites'' currently pointing from ''active site'' to ''new site''  in cache::route_table for upload <s>https://gerrit.wikimedia.org/r/#/c/283418/</s>https://gerrit.wikimedia.org/r/284649
## Force a puppet run on cache_upload in esams: <tt>salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:esams' cmd.run 'puppet agent --test'</tt>
## Force a puppet run on cache_upload in ''caching sites'': <tt>salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:esams' cmd.run 'puppet agent --test'</tt>
# '''Inter-Cache:''' Switch eqiad from 'direct' to 'codfw' in cache::route_table for upload https://gerrit.wikimedia.org/r/#/c/282892/
# '''Inter-Cache:''' Switch ''active site'' from 'direct' to ''new site'' in cache::route_table for upload <s>https://gerrit.wikimedia.org/r/#/c/282892/</s>https://gerrit.wikimedia.org/r/284648
## Force a puppet run on cache_upload in eqiad: <tt>salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:eqiad' cmd.run 'puppet agent --test'</tt>
## Force a puppet run on cache_upload in ''active site'': <tt>salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:eqiad' cmd.run 'puppet agent --test'</tt>
 
==== Switching back ====
 
Repeat the steps above in reverse order, with suitable revert commits


=== ElasticSearch ===
=== ElasticSearch ===


Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster [https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L16025-L16027 InitialiseSettings.php]. The usual default value is "local", which means that if mediawiki switches DC, everything should be automatic. For this specific switch, the value has been set to "codfw" to switch Elasticsearch ahead of time.
Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster [https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L16025-L16027 InitialiseSettings.php]. The usual default value is "local", which means that if mediawiki switches DC, everything should be automatic. For this specific switch, the value has been set to "codfw" to switch Elasticsearch ahead of time.
To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following [[Search#Recovering_from_an_Elasticsearch_outage.2Finterruption_in_updates|Recovering from an Elasticsearch outage / interruption in updates]].


=== Traffic ===
=== Traffic ===
Line 135: Line 192:


# '''Inter-Cache:''' Switch eqiad from 'codfw' to 'direct' in cache::route_table for all clusters.
# '''Inter-Cache:''' Switch eqiad from 'codfw' to 'direct' in cache::route_table for all clusters.
#* https://gerrit.wikimedia.org/r/284687
#* Force a puppet run on affected caches:
#* <code>salt -v -t 10 -b 17 -C 'G@site:eqiad and G@cluster:cache_text' cmd.run 'puppet agent --test'</code>
# '''Inter-Cache:''' Switch esams from 'codfw' to 'eqiad' in cache::route_table for all clusters.
# '''Inter-Cache:''' Switch esams from 'codfw' to 'eqiad' in cache::route_table for all clusters.
#* https://gerrit.wikimedia.org/r/284688
#* Force a puppet run on affected caches:
#* <code>salt -v -t 10 -b 17 -C 'G@site:esams and G@cluster:cache_text' cmd.run 'puppet agent --test'</code>
# '''Users:''' Re-pool eqiad in GeoDNS.
# '''Users:''' Re-pool eqiad in GeoDNS.
#* https://gerrit.wikimedia.org/r/284692
#* <code>authdns-update</code> on any one of the authdns servers (radon, baham, eeden)
# '''Inter-Cache:''' Switch codfw from 'direct' to 'eqiad' in cache::route_table for all clusters.
# '''Inter-Cache:''' Switch codfw from 'direct' to 'eqiad' in cache::route_table for all clusters.
#* https://gerrit.wikimedia.org/r/284689
#* Force a puppet run on affected caches:
#* <code>salt -v -t 10 -b 17 -C 'G@site:codfw and G@cluster:cache_text' cmd.run 'puppet agent --test'</code>


=== Services ===
=== Services ===

Revision as of 15:42, 21 April 2016

Introduction

A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.

Schedule for Q3 FY2015-2016 rollout

  • Deployment server: Wednesday, January 20th
  • Traffic: Thursday, March 10th
  • MediaWiki 5-minute read-only test: Tuesday, March 15th 07:00 UTC
  • ElasticSearch: Thursday, April 7th, 12:00 UTC
  • Media storage/Swift: Thursday, April 14th 17:00 UTC
  • Services: Monday, April 18th, 10:00 UTC
  • MediaWiki: Tuesday, April 19th, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)

Switching back

  • MediaWiki: Thursday, April 21st, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
  • Services, ElasticSearch, Traffic, Swift, Deployment server: Thursday, April 21st, after the above is done


Per-service switchover instructions

MediaWiki-related

Phase 1 - preparation

  1. Warm up databases; also see manual cache warmup.
  2. Stop jobqueues in the active site
  3. Stop all jobs running on the maintenance host
  4. Disable puppet on all eqiad and codfw databases masters
    • salt -L "db2016.codfw.wmnet,db2017.codfw.wmnet,db2018.codfw.wmnet,db2019.codfw.wmnet,db2023.codfw.wmnet,db2028.codfw.wmnet,db2029.codfw.wmnet,es2015.codfw.wmnet,es2018.codfw.wmnet,db2009.codfw.wmnet,pc2004.codfw.wmnet,pc2005.codfw.wmnet,pc2006.codfw.wmnet" cmd.run "puppet agent --disable 'switchover'"
    • salt -L "db1057.eqiad.wmnet,db1018.eqiad.wmnet,db1075.eqiad.wmnet,db1042.eqiad.wmnet,db1049.eqiad.wmnet,db1050.eqiad.wmnet,db1041.eqiad.wmnet,es1015.eqiad.wmnet,es1019.eqiad.wmnet,db1031.eqiad.wmnet,pc1004.eqiad.wmnet,pc1005.eqiad.wmnet,pc1006.eqiad.wmnet" cmd.run "puppet agent --disable 'switchover'"
  5. Set final $master status for databases in advance (puppet disabled)
  6. Switch pt-heartbeat from active site (codfw) to new site (eqiad) masters (TODO: clean this up/make it generic)
    • Check codfw masters with: salt -L "db2016.codfw.wmnet,db2017.codfw.wmnet,db2018.codfw.wmnet,db2019.codfw.wmnet,db2023.codfw.wmnet,db2028.codfw.wmnet,db2029.codfw.wmnet,es2015.codfw.wmnet,es2018.codfw.wmnet,db2009.codfw.wmnet,pc2004.codfw.wmnet,pc2005.codfw.wmnet,pc2006.codfw.wmnet" cmd.run "pgrep -f '/usr/bin/perl\s/usr/local/bin/pt-heartbeat-wikimedia'"
    • Kill pt-heartbeat on codfw masters with: salt -L "db2016.codfw.wmnet,db2017.codfw.wmnet,db2018.codfw.wmnet,db2019.codfw.wmnet,db2023.codfw.wmnet,db2028.codfw.wmnet,db2029.codfw.wmnet,es2015.codfw.wmnet,es2018.codfw.wmnet,db2009.codfw.wmnet,pc2004.codfw.wmnet,pc2005.codfw.wmnet,pc2006.codfw.wmnet" cmd.run "pkill -f '/usr/bin/perl\s/usr/local/bin/pt-heartbeat-wikimedia'"
    • Check eqiad masters with: salt -L "db1057.eqiad.wmnet,db1018.eqiad.wmnet,db1075.eqiad.wmnet,db1042.eqiad.wmnet,db1049.eqiad.wmnet,db1050.eqiad.wmnet,db1041.eqiad.wmnet,es1015.eqiad.wmnet,es1019.eqiad.wmnet,db1031.eqiad.wmnet,pc1004.eqiad.wmnet,pc1005.eqiad.wmnet,pc1006.eqiad.wmnet" cmd.run "pgrep -f '/usr/bin/perl\s/usr/local/bin/pt-heartbeat-wikimedia'"
    • Start pt-heartbeat on eqiad masters with: /home/volans/eqiad-start-pt-heartbeat.sh

Phase 2 - read-only mode

  1. Deploy mediawiki-config with all shards set to read-only

Phase 3 - lock down database masters, cache wipes, warmups

  1. Set active site's databases (masters) in read-only mode except parsercache ones.
    • Check with: while read master; do echo $master; mysql -h $master --batch --skip-column-names -e "SELECT @@global.read_only"; done < /home/volans/codfw-nopc-masters.txt
    • Change with: while read master; do echo $master; mysql -h $master --batch --skip-column-names -e "SET GLOBAL read_only=1"; done < /home/volans/codfw-nopc-masters.txt
  2. Wipe new site's memcached to prevent stale values — only once the new site's read-only master/slaves are caught up. Run: salt -C 'G@cluster:memcached and G@site:codfw' cmd.run 'service memcached restart' salt -C 'G@cluster:memcached and G@site:eqiad' cmd.run 'service memcached restart'
  3. Warm up memcached and APC <TODO: script by Timo>

Phase 4 - switch active datacenter configuration

  1. Switch the datacenter in puppet, by setting $app_routes['mediawiki']
  2. Switch the datacenter in mediawiki-config ($wmfMasterDatacenterMerge)

Phase 5 - apply configuration

  1. Redis replication
    • stop puppet on all redises salt 'mc*' cmd.run 'puppet agent --disable'; salt 'rdb*' cmd.run 'puppet agent --disable'
    • switch the Redis replication on all Redises from the old site(codfw) the new site (eqiad) at runtime with switch_redis_replication.py
    • Run switch_redis_replication.py memcached.yaml codfw eqiad
    • Run swicth_redis_replication.py jobqueue.yaml codfw eqiad
    • Verify those are now masters by running check_redis.py memcached.yaml,jobqueue.yaml
    • Alternatively via puppet:
    • run salt 'mc1*' cmd.run 'puppet agent --enable; puppet agent -t'; salt 'rdb1*' cmd.run 'puppet agent --enable; puppet agent -t'
    • verify eqiad Redises now think they're masters
    • run salt 'mc2*' cmd.run 'puppet agent --enable; puppet agent -t'; salt 'rdb2*' cmd.run 'puppet agent --enable; puppet agent -t'
    • verify codfw redises are replicating
  2. RESTBase (for the action API endpoint)
    • run salt -b10% -G 'cluster:restbase' cmd.run 'puppet agent -t; service restbase restart'
  3. Misc services cluster (for the action API endpoint)
    • run salt 'sc*' cmd.run 'puppet agent -t'
  4. Parsoid (for the action API endpoint)
  5. Switch parsercache RO/RW
    • Check with: while read master; do echo $master; mysql -h $master --batch --skip-column-names -e "SELECT @@global.read_only"; done < /home/volans/all-pasercaches.txt
    • codfw RO: while read master; do echo $master; mysql -h $master --batch --skip-column-names -e "SET GLOBAL read_only=1"; done < /home/jynus/codfw-parsercaches.txt
    • eqiad RW: while read master; do echo $master; mysql -h $master --batch --skip-column-names -e "SET GLOBAL read_only=0"; done < /home/jynus/eqiad-parsercaches.txt
  6. Switch Varnish backend to appserver.svc.$newsite.wmnet/api.svc.$newsite.wmnet
  7. Point Swift imagescalers to the active MediaWiki
    • Merge https://gerrit.wikimedia.org/r/284401
    • restart swift in eqiad: salt -b1 'ms-fe1*' cmd.run 'puppet agent -t; swift-init all restart'
    • restart swift in codfw: salt -b1 'ms-fe2*' cmd.run 'puppet agent -t; swift-init all restart'

Phase 6 - database master swap

  1. Database master swap for every core (s1-7), External Storage (es2-3, not es1) and extra (x1) database
    • Check with: while read master; do echo $master; mysql -h $master --batch --skip-column-names -e "SELECT @@global.read_only"; done < /home/volans/eqiad-nopc-masters.txt
    • Change with: while read master; do echo $master; mysql -h $master --batch --skip-column-names -e "SET GLOBAL read_only=0"; done < /home/volans/eqiad-nopc-masters.txt

Phase 7 - Undo read-only

  1. Deploy mediawiki-config eqiad with all shards set to read-write

Phase 8 - post read-only

  1. Start the jobqueue in the new site
  2. Start the cron jobs on the maintenance host in codfw
  3. Re-enable puppet on all eqiad and codfw databases masters
    • codfw masters: salt -L "db2016.codfw.wmnet,db2017.codfw.wmnet,db2018.codfw.wmnet,db2019.codfw.wmnet,db2023.codfw.wmnet,db2028.codfw.wmnet,db2029.codfw.wmnet,es2015.codfw.wmnet,es2018.codfw.wmnet,db2009.codfw.wmnet,pc2004.codfw.wmnet,pc2005.codfw.wmnet,pc2006.codfw.wmnet" cmd.run "puppet agent --enable"
    • eqiad masters: salt -L "db1057.eqiad.wmnet,db1018.eqiad.wmnet,db1075.eqiad.wmnet,db1042.eqiad.wmnet,db1049.eqiad.wmnet,db1050.eqiad.wmnet,db1041.eqiad.wmnet,es1015.eqiad.wmnet,es1019.eqiad.wmnet,db1031.eqiad.wmnet,pc1004.eqiad.wmnet,pc1005.eqiad.wmnet,pc1006.eqiad.wmnet" cmd.run "puppet agent --enable"
  4. Update DNS records for new database masters
  5. Run the script to fix broken wikidata entities on the maintenance host of the active datacenter: sudo -u www-data mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki --force

Phase 9 - verification and troubleshooting

  1. Make sure reading & editing works! :)
  2. Make sure recent changes are flowing (see Special:RecentChanges, rcstream and the IRC feeds)
  3. Make sure email works (exim4 -bp on mx1001/mx2001, test an email)

Media storage/Swift

Ahead of the switchover, originals and thumbs

  1. MediaWiki: Write synchronously to both sites with https://gerrit.wikimedia.org/r/#/c/282888/https://gerrit.wikimedia.org/r/284652
  2. Cache->app: Change varnish backends for swift and swift_thumbs to point to new site with https://gerrit.wikimedia.org/r/#/c/282890/https://gerrit.wikimedia.org/r/284651
    1. Force a puppet run on cache_upload in both sites: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and ( G@site:eqiad or G@site:codfw )' cmd.run 'puppet agent --test'
  3. Inter-Cache: Switch new site from active site to 'direct' in cache::route_table for upload https://gerrit.wikimedia.org/r/#/c/282891/https://gerrit.wikimedia.org/r/284650
    1. Force a puppet run on cache_upload in new site: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:codfw' cmd.run 'puppet agent --test'
  4. Users: De-pool active site in GeoDNS https://gerrit.wikimedia.org/r/#/c/283416/https://gerrit.wikimedia.org/r/#/c/284694/ + authdns-update
  5. Inter-Cache: Switch all caching sites currently pointing from active site to new site in cache::route_table for upload https://gerrit.wikimedia.org/r/#/c/283418/https://gerrit.wikimedia.org/r/284649
    1. Force a puppet run on cache_upload in caching sites: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:esams' cmd.run 'puppet agent --test'
  6. Inter-Cache: Switch active site from 'direct' to new site in cache::route_table for upload https://gerrit.wikimedia.org/r/#/c/282892/https://gerrit.wikimedia.org/r/284648
    1. Force a puppet run on cache_upload in active site: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:eqiad' cmd.run 'puppet agent --test'

Switching back

Repeat the steps above in reverse order, with suitable revert commits

ElasticSearch

Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster InitialiseSettings.php. The usual default value is "local", which means that if mediawiki switches DC, everything should be automatic. For this specific switch, the value has been set to "codfw" to switch Elasticsearch ahead of time.

To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following Recovering from an Elasticsearch outage / interruption in updates.

Traffic

GeoDNS user routing

Inter-Cache routing

Cache->App routing

Specifics for Switchover Test Week

After switching all applayer services we plan to switch successfully, we'll switch user and inter-cache traffic away from eqiad:

  • The Upload cluster will be following similar instructions on the 14th during the Swift switch.
  • Maps and Misc clusters are not participating (low traffic, special issues, validated by the other moves)
  • This leaves just the text cluster to operate on below:
  1. Inter-Cache: Switch codfw from 'eqiad' to 'direct' in cache::route_table for the text cluster.
  2. Users: De-pool eqiad in GeoDNS for the text cluster.
  3. Inter-Cache: Switch esams from 'eqiad' to 'codfw' in cache::route_table for the text cluster.
  4. Inter-Cache: Switch eqiad from 'direct' to 'codfw' in cache::route_table for the text cluster.

Before reversion of applayer services to eqiad, we'll revert the above steps in reverse order to undo them:

  1. Inter-Cache: Switch eqiad from 'codfw' to 'direct' in cache::route_table for all clusters.
  2. Inter-Cache: Switch esams from 'codfw' to 'eqiad' in cache::route_table for all clusters.
  3. Users: Re-pool eqiad in GeoDNS.
  4. Inter-Cache: Switch codfw from 'direct' to 'eqiad' in cache::route_table for all clusters.

Services

  • RESTBase and Parsoid already active in codfw, using eqiad MW API.
  • Shift traffic to codfw:
    • Public traffic: Update Varnish backend config.
    • Update RESTBase and Flow configs in mediawiki-config to use codfw.
  • During MW switch-over:

Other miscellaneous