You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Switch Datacenter"

From Wikitech-static
Jump to navigation Jump to search
imported>Gehel
imported>Alex Monk
(new sub-step for future MW switches: be careful about RC)
Line 1: Line 1:
{{See|See also "[https://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/ Failover Test]" on the Wikimedia Techblog. (April 11, 2016)}}
== Introduction ==
== Introduction ==
A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.
A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.
Line 4: Line 6:
=== Schedule for Q3 FY2015-2016 rollout ===
=== Schedule for Q3 FY2015-2016 rollout ===
* Deployment server: Wednesday, January 20th
* Deployment server: Wednesday, January 20th
* Media storage/Swift: Thursday, April 14th 17:00 UTC
* Traffic: Thursday, March 10th
* Traffic: Thursday, March 10th
* MediaWiki 5-minute read-only test: Tuesday, March 15th 07:00 UTC
* MediaWiki 5-minute read-only test: Tuesday, March 15th 07:00 UTC
* Services:  Thursday, March 17th, 10:00 UTC
* Services (second test): (week of March 28th)
* ElasticSearch: Thursday, April 7th, 12:00 UTC
* ElasticSearch: Thursday, April 7th, 12:00 UTC
* Media storage/Swift: Thursday, April 14th 17:00 UTC
* Services: Monday, April 18th, 10:00 UTC
* MediaWiki: Tuesday, April 19th, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
* MediaWiki: Tuesday, April 19th, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)


Line 15: Line 16:
* MediaWiki: Thursday, April 21st, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
* MediaWiki: Thursday, April 21st, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
* Services, ElasticSearch, Traffic, Swift, Deployment server: Thursday, April 21st, after the above is done
* Services, ElasticSearch, Traffic, Swift, Deployment server: Thursday, April 21st, after the above is done
<br>
__TOC__


== Per-service switchover instructions ==
== Per-service switchover instructions ==
Line 20: Line 24:
=== MediaWiki-related ===
=== MediaWiki-related ===


Before switch over (after any local testing within codfw):
# Warm up databases; also see [[/Manual cache warmup|manual cache warmup]].
* Wipe memcached to prevent stale values (MediaWiki isn't ready for Multi-DC yet).
# Stop jobqueues in eqiad
** Once eqiad is read-only and cofdw read-only master/slaves are caught up; any subsequent requests to codfw that set memcached are fine.
#* cherry-pick https://gerrit.wikimedia.org/r/282880
 
#* run <code>salt -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'service jobrunner stop; service jobchron stop;'</code>
The overall plan for an eqiad->codfw switchover is:
# Stop all jobs running on the maintenance host
* Warmup codfw databases '''DONE''', it may need a sanity check an hour before the failover
#* cherry-pick https://gerrit.wikimedia.org/r/#/c/283952/
* Warming up memcached with parsercache entries: parsercache has been '''DONE''', memcache may need a second pass after going read-only See [[/Manual cache warmup]].
#* run <code>ssh terbium.eqiad.wmnet sudo 'puppet agent -t'</code>
* Stop jobqueues in eqiad: cherry-pick https://gerrit.wikimedia.org/r/282880 and run <code> salt -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'service jobrunner stop; service jobchron stop;'</code>
#* manually kill any long-running scripts
* Stop all jobs running on the maintenance host in eqiad: cherry-pick https://gerrit.wikimedia.org/r/#/c/283952/ and run <code>ssh terbium.eqiad.wmnet sudo 'puppet agent -t'</code> and manually kill any long-running scripts.
# Deploy mediawiki-config with all shards set to read-only
* Deploy mediawiki-config with all shards set to read-only in eqiad: [[gerrit:283953|https://gerrit.wikimedia.org/r/283953]] (new commit saying 15 minutes)
#* [[gerrit:283953|https://gerrit.wikimedia.org/r/283953]]
* Set eqiad databases (masters) in [[/db read-only|read-only mode]]
# Set eqiad databases (masters) in [[/db read-only|read-only mode]].
* Switch the datacenter in puppet: set <code>$app_routes['mediawiki'] = 'codfw'</code> in puppet (cherry-pick [[gerrit:282898|https://gerrit.wikimedia.org/r/282898]]) and <code>$wmfMasterDatacenter</code> in mediawiki-config (https://gerrit.wikimedia.org/r/#/c/282897/). This has several consequences:
# Wipe codfw memcached to prevent stale values. Once eqiad is read-only and cofdw read-only master/slaves are caught up; any subsequent requests to codfw that set memcached are fine. Run <code>salt -C 'G@cluster:memcached and G@site:codfw' cmd.run 'service memcached restart'</code>
** Redis replication will be flowing codfw => eqiad once puppet has ran first in codfw <code>salt 'mc2*' cmd.run 'puppet agent -t'; salt 'rdb2*' cmd.run 'puppet agent -t'</code> and then in eqiad <code>salt 'mc1*' cmd.run 'puppet agent -t'; salt 'rdb1*' cmd.run 'puppet agent -t'</code>
# Switch the datacenter in puppet
** RESTBase (uses puppet's $app_routes, needs puppet run + service restart) <code>salt -b10% -G 'cluster:restbase' cmd.run 'puppet agent -t; service restbase restart'</code>
#* Set <code>$app_routes['mediawiki'] = 'codfw'</code> in puppet (cherry-pick [[gerrit:282898|https://gerrit.wikimedia.org/r/282898]])
** All other services will be automatically reconfigured whenever puppet runs after $app_routes is modified <code>salt 'sc*' cmd.run 'puppet agent -t'</code>  
#* <code>$wmfMasterDatacenter</code> in mediawiki-config (https://gerrit.wikimedia.org/r/#/c/282897/). This has several consequences:
* Parsoid switch of the action api endpoint (manages its own config, will need a deploy + restart of its own) merge https://gerrit.wikimedia.org/r/#/c/282904/
#* Redis replication will be flowing codfw => eqiad once puppet has ran<br />first in codfw <code>salt 'mc2*' cmd.run 'puppet agent -t'; salt 'rdb2*' cmd.run 'puppet agent -t'</code><br />and then in eqiad <code>salt 'mc1*' cmd.run 'puppet agent -t'; salt 'rdb1*' cmd.run 'puppet agent -t'</code>
* Deploy Varnish to switch backend to appserver.svc.codfw.wmnet/api.svc.codfw.wmnet - merge https://gerrit.wikimedia.org/r/#/c/282910/ and run puppet on all cache_text
#* RESTBase (needs puppet run + service restart): <code>salt -b10% -G 'cluster:restbase' cmd.run 'puppet agent -t; service restbase restart'</code>
* Point swift <tt>rewrite</tt> middleware to codfw with https://gerrit.wikimedia.org/r/#/c/268080/ and run puppet on <tt>ms-fe*</tt> plus <tt>swift-init all reload</tt> for the changes to take effect. Now 404s for thumbs will hit codfw imagescalers.
#* All other services will be automatically reconfigured whenever puppet runs after <code>$app_routes</code> is modified <code>salt 'sc*' cmd.run 'puppet agent -t'</code>
* Master swap for every core (s1-7), External Storage (es2-3, not es1), parsercache (pc) and extra (x1) database
# Parsoid switch of the action API endpoint (manages its own config, will need a deploy + restart of its own)
** In theory there is nothing to do at database level after circular replication was setup
#* merge https://gerrit.wikimedia.org/r/#/c/282904/
** In reality, some small deploys:
# Deploy Varnish to switch backend to <code>appserver.svc.codfw.wmnet</code>/<code>api.svc.codfw.wmnet</code>
*** Deploy puppet $master = true for the appropriate hosts. Make sure pt-heartbeat-wikimedia is running on the new hosts and not on the old ones by applying the parameter change on all masters.
#* merge https://gerrit.wikimedia.org/r/#/c/282910/
*** Make sure all masters, including the inactive ones are using STATEMENT-based replication (required for labs filtering)
#* run puppet on all cache_text
*** [[/db read-only|set codfw masters mysql as read-write]]
# Point Swift imagescalers to the active MediaWiki
*** change *-master dns to the new masters (only used by humans)
#* Merge https://gerrit.wikimedia.org/r/#/c/268080/
* Deploy mediawiki-config (only codfw?) with all shards set to read-write
#* run puppet on <code>ms-fe*</code> plus <code>swift-init all reload</code> for the changes to take effect, eqiad <code>salt -b1 'ms-fe1*' cmd.run 'puppet agent -t; swift-init all reload'</code> and codfw <code>salt -b1 'ms-fe2*' cmd.run 'puppet agent -t; swift-init all reload'</code>
* Start the jobqueue in codfw: cherry-pick https://gerrit.wikimedia.org/r/282881 and run <code>salt -b 6 -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'puppet agent -t'</code>
# Database master swap for every core (s1-7), External Storage (es2-3, not es1), parsercache (pc) and extra (x1) database
* Start the cron jobs on the maintenance host in codfw: cherry-pick https://gerrit.wikimedia.org/r/#/c/283954/ and run <code>ssh wasat.codfw.wmnet sudo 'puppet agent -t'</code>
#* Deploy puppet <code>$master = true</code> for the appropriate hosts. https://gerrit.wikimedia.org/r/284144 (Make sure pt-heartbeat-wikimedia is running on the new hosts and not on the old ones by applying the parameter change on all masters)
The plan for switching back is the reverse of the above, with the following extra step:
#* Make sure all masters, including the inactive ones are using STATEMENT-based replication (required for labs filtering)
* Wipe memcached to clear invalidated cached memory
#* [[/db read-only|Set codfw masters mysql as read-write]]
 
#* Change *-master dns to the new masters (only used by humans)
==== Databases ====
# Deploy mediawiki-config codfw with all shards set to read-write
See the separate page on [[MariaDB/troubleshooting#Depooling_a_master_.28a.k.a._promoting_a_new_slave_to_master.29|how to promote a new slave to master]]. Please note that no topology change happens on failover, so very little from that applies.
#* https://gerrit.wikimedia.org/r/284157
 
#* Make sure recent changes are flowing (see Special:RecentChanges, rcstream and the IRC feeds) - if not, revert
# Start the jobqueue in codfw
#* Merge https://gerrit.wikimedia.org/r/282881
#* run <code>salt -b 6 -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'puppet agent -t'</code>
# Start the cron jobs on the maintenance host in codfw
#* Merge https://gerrit.wikimedia.org/r/#/c/283954/
#* run <code>ssh wasat.codfw.wmnet sudo 'puppet agent -t'</code>
==== Debugging ====
==== Debugging ====
You can force Varnish to pass a request to a backend in codfw or eqiad using the [[X-Wikimedia-Debug]] header.
You can force Varnish to pass a request to a backend in codfw or eqiad using the [[X-Wikimedia-Debug]] header.

Revision as of 17:54, 19 April 2016

Introduction

A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.

Schedule for Q3 FY2015-2016 rollout

  • Deployment server: Wednesday, January 20th
  • Traffic: Thursday, March 10th
  • MediaWiki 5-minute read-only test: Tuesday, March 15th 07:00 UTC
  • ElasticSearch: Thursday, April 7th, 12:00 UTC
  • Media storage/Swift: Thursday, April 14th 17:00 UTC
  • Services: Monday, April 18th, 10:00 UTC
  • MediaWiki: Tuesday, April 19th, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)

Switching back

  • MediaWiki: Thursday, April 21st, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
  • Services, ElasticSearch, Traffic, Swift, Deployment server: Thursday, April 21st, after the above is done


Per-service switchover instructions

MediaWiki-related

  1. Warm up databases; also see manual cache warmup.
  2. Stop jobqueues in eqiad
  3. Stop all jobs running on the maintenance host
  4. Deploy mediawiki-config with all shards set to read-only
  5. Set eqiad databases (masters) in read-only mode.
  6. Wipe codfw memcached to prevent stale values. Once eqiad is read-only and cofdw read-only master/slaves are caught up; any subsequent requests to codfw that set memcached are fine. Run salt -C 'G@cluster:memcached and G@site:codfw' cmd.run 'service memcached restart'
  7. Switch the datacenter in puppet
    • Set $app_routes['mediawiki'] = 'codfw' in puppet (cherry-pick https://gerrit.wikimedia.org/r/282898)
    • $wmfMasterDatacenter in mediawiki-config (https://gerrit.wikimedia.org/r/#/c/282897/). This has several consequences:
    • Redis replication will be flowing codfw => eqiad once puppet has ran
      first in codfw salt 'mc2*' cmd.run 'puppet agent -t'; salt 'rdb2*' cmd.run 'puppet agent -t'
      and then in eqiad salt 'mc1*' cmd.run 'puppet agent -t'; salt 'rdb1*' cmd.run 'puppet agent -t'
    • RESTBase (needs puppet run + service restart): salt -b10% -G 'cluster:restbase' cmd.run 'puppet agent -t; service restbase restart'
    • All other services will be automatically reconfigured whenever puppet runs after $app_routes is modified salt 'sc*' cmd.run 'puppet agent -t'
  8. Parsoid switch of the action API endpoint (manages its own config, will need a deploy + restart of its own)
  9. Deploy Varnish to switch backend to appserver.svc.codfw.wmnet/api.svc.codfw.wmnet
  10. Point Swift imagescalers to the active MediaWiki
    • Merge https://gerrit.wikimedia.org/r/#/c/268080/
    • run puppet on ms-fe* plus swift-init all reload for the changes to take effect, eqiad salt -b1 'ms-fe1*' cmd.run 'puppet agent -t; swift-init all reload' and codfw salt -b1 'ms-fe2*' cmd.run 'puppet agent -t; swift-init all reload'
  11. Database master swap for every core (s1-7), External Storage (es2-3, not es1), parsercache (pc) and extra (x1) database
    • Deploy puppet $master = true for the appropriate hosts. https://gerrit.wikimedia.org/r/284144 (Make sure pt-heartbeat-wikimedia is running on the new hosts and not on the old ones by applying the parameter change on all masters)
    • Make sure all masters, including the inactive ones are using STATEMENT-based replication (required for labs filtering)
    • Set codfw masters mysql as read-write
    • Change *-master dns to the new masters (only used by humans)
  12. Deploy mediawiki-config codfw with all shards set to read-write
  13. Start the jobqueue in codfw
  14. Start the cron jobs on the maintenance host in codfw

Debugging

You can force Varnish to pass a request to a backend in codfw or eqiad using the X-Wikimedia-Debug header.

For codfw, use X-Wikimedia-Debug: backend=mw2017.codfw.wmnet

For eqiad, use X-Wikimedia-Debug: backend=mw1017.eqiad.wmnet

Media storage/Swift

Ahead of the switchover, originals and thumbs

  1. MediaWiki: Write synchronously to both eqiad/codfw with https://gerrit.wikimedia.org/r/#/c/282888/
  2. Cache->app: Change varnish backends for swift and swift_thumbs to point to codfw with https://gerrit.wikimedia.org/r/#/c/282890/
    1. Force a puppet run on cache_upload in eqiad + codfw: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and ( G@site:eqiad or G@site:codfw )' cmd.run 'puppet agent --test'
  3. Inter-Cache: Switch codfw from 'eqiad' to 'direct' in cache::route_table for upload https://gerrit.wikimedia.org/r/#/c/282891/
    1. Force a puppet run on cache_upload in codfw: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:codfw' cmd.run 'puppet agent --test'
  4. Users: De-pool eqiad in GeoDNS https://gerrit.wikimedia.org/r/#/c/283416/ + authdns-update
  5. Inter-Cache: Switch esams from 'eqiad' to 'codfw' in cache::route_table for upload https://gerrit.wikimedia.org/r/#/c/283418/
    1. Force a puppet run on cache_upload in esams: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:esams' cmd.run 'puppet agent --test'
  6. Inter-Cache: Switch eqiad from 'direct' to 'codfw' in cache::route_table for upload https://gerrit.wikimedia.org/r/#/c/282892/
    1. Force a puppet run on cache_upload in eqiad: salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:eqiad' cmd.run 'puppet agent --test'

ElasticSearch

Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster InitialiseSettings.php. The usual default value is "local", which means that if mediawiki switches DC, everything should be automatic. For this specific switch, the value has been set to "codfw" to switch Elasticsearch ahead of time.

Traffic

GeoDNS user routing

Inter-Cache routing

Cache->App routing

Specifics for Switchover Test Week

After switching all applayer services we plan to switch successfully, we'll switch user and inter-cache traffic away from eqiad:

  • The Upload cluster will be following similar instructions on the 14th during the Swift switch.
  • Maps and Misc clusters are not participating (low traffic, special issues, validated by the other moves)
  • This leaves just the text cluster to operate on below:
  1. Inter-Cache: Switch codfw from 'eqiad' to 'direct' in cache::route_table for the text cluster.
  2. Users: De-pool eqiad in GeoDNS for the text cluster.
  3. Inter-Cache: Switch esams from 'eqiad' to 'codfw' in cache::route_table for the text cluster.
  4. Inter-Cache: Switch eqiad from 'direct' to 'codfw' in cache::route_table for the text cluster.

Before reversion of applayer services to eqiad, we'll revert the above steps in reverse order to undo them:

  1. Inter-Cache: Switch eqiad from 'codfw' to 'direct' in cache::route_table for all clusters.
  2. Inter-Cache: Switch esams from 'codfw' to 'eqiad' in cache::route_table for all clusters.
  3. Users: Re-pool eqiad in GeoDNS.
  4. Inter-Cache: Switch codfw from 'direct' to 'eqiad' in cache::route_table for all clusters.

Services

  • RESTBase and Parsoid already active in codfw, using eqiad MW API.
  • Shift traffic to codfw:
    • Public traffic: Update Varnish backend config.
    • Update RESTBase and Flow configs in mediawiki-config to use codfw.
  • During MW switch-over:

Other miscellaneous