You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Switch Datacenter"

From Wikitech-static
Jump to navigation Jump to search
imported>Giuseppe Lavagetto
(→‎Job queue: Removed some unnecessary steps as redis is smart and doesn't need those precautions.)
imported>BBlack
Line 4: Line 4:
=== Schedule for Q3 FY2015-2016 rollout ===
=== Schedule for Q3 FY2015-2016 rollout ===
* Deployment server: Wednesday, January 20th
* Deployment server: Wednesday, January 20th
* Media storage/Swift: Thursday, March 10th 17:00 UTC
* Media storage/Swift: Thursday, April 14th 17:00 UTC
* Traffic: Thursday, March 10th
* Traffic: Thursday, March 10th
* MediaWiki 5-minute read-only test: Tuesday, March 15th 07:00 UTC
* MediaWiki 5-minute read-only test: Tuesday, March 15th 07:00 UTC
Line 27: Line 27:
* Warmup codfw databases
* Warmup codfw databases
* Warming up memcached with parsercache entries
* Warming up memcached with parsercache entries
* Stop jobqueues in eqiad
* Stop jobqueues in eqiad: cherry-pick https://gerrit.wikimedia.org/r/282880 and run <code> salt -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'service jobrunner stop; service jobchron stop;'</code>
* Deploy mediawiki-config with all shards set to read-only
* Deploy mediawiki-config with all shards set to read-only in eqiad: revive https://gerrit.wikimedia.org/r/#/c/277462/
* Set eqiad databases (masters) in read-only mode; stop pt-heartbeat (?)
* Set eqiad databases (masters) in read-only mode; stop pt-heartbeat (?)
* Set <code>$app_routes['mediawiki'] = 'codfw'</code> in puppet
* Switch the datacenter in puppet: set <code>$app_routes['mediawiki'] = 'codfw'</code> in puppet (cherry-pick [[gerrit:282898|https://gerrit.wikimedia.org/r/282898]]) and <code>$wmfMasterDatacenter</code> in mediawiki-config (https://gerrit.wikimedia.org/r/#/c/282897/). This has several consequences:
* Run puppet on all redis hosts. Replication is now flowing codfw => eqiad
** Redis replication will be flowing codfw => eqiad once puppet has ran first in codfw <code>salt 'mc2*' cmd.run 'puppet agent -t'; salt 'rdb2*' cmd.run 'puppet agent -t'</code> and then in eqiad <code>salt 'mc1*' cmd.run 'puppet agent -t'; salt 'rdb1*' cmd.run 'puppet agent -t'</code>
* Deploy Varnish to switch backend to appserver.svc.codfw.wmnet/api.svc.codfw.wmnet
** RESTBase (uses puppet's $app_routes, needs puppet run + service restart) <code>salt -b10% -G 'cluster:restbase' cmd.run 'puppet agent -t; service restbase restart'</code>
* Deploy services to change their MediaWiki action API endpoint
** All other services will be automatically reconfigured whenever puppet runs after $app_routes is modified <code>salt 'sc*' cmd.run 'puppet agent -t'</code>
** RESTBase (uses puppet's $app_routes, needs puppet run + service restart)
* Parsoid switch of the action api endpoint (manages its own config, will need a deploy + restart of its own) merge https://gerrit.wikimedia.org/r/#/c/282904/
** Parsoid (manages its own config, will need a deploy + restart of its own)
* Deploy Varnish to switch backend to appserver.svc.codfw.wmnet/api.svc.codfw.wmnet - merge https://gerrit.wikimedia.org/r/#/c/282910/ and run puppet on all cache_text
** All other services will be automatically reconfigured whenever puppet runs after $app_routes is modified
* Master swap for every core (s1-7), ES (es1-3), parsercache (pc) and External store (x1) database
* Master swap for every core (s1-7), ES (es1-3), parsercache (pc) and External store (x1) database
** Technically there is nothing to do at database level once circular replication is setup
** Technically there is nothing to do at database level once circular replication is setup
** In reality, some small deploys: set codfw masters mysql as read-write and start pt-heartbeat-wikimedia there; change *-master dns to the new masters (only used by humans); optionally: puppetize $master = true
** In reality, some small deploys: set codfw masters mysql as read-write and start pt-heartbeat-wikimedia there; change *-master dns to the new masters (only used by humans); optionally: puppetize $master = true
* Deploy mediawiki-config (only codfw?) with all shards set to read-write
* Deploy mediawiki-config (only codfw?) with all shards set to read-write
* Start the jobqueue in codfw
* Start the jobqueue in codfw: cherry-pick https://gerrit.wikimedia.org/r/282881 and run <code>salt -b 6 -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'puppet agent -t'</code>


The plan for switching back is the reverse of the above, with the following extra step:
The plan for switching back is the reverse of the above, with the following extra step:
Line 66: Line 65:


=== Media storage/Swift ===
=== Media storage/Swift ===
==== Originals ====
==== Ahead of the switchover, originals and thumbs ====
* Instruct mediawiki to write synchronously to both eqiad/codfw with https://gerrit.wikimedia.org/r/276071
* Instruct mediawiki to write synchronously to both eqiad/codfw with https://gerrit.wikimedia.org/r/#/c/282888/
* Change varnish backends for app <tt>swift</tt> in <tt>hieradata/common/cache/upload.yaml</tt> to point to codfw
* Change varnish backends for <tt>swift</tt> and <tt>swift_thumbs</tt> to point to codfw with https://gerrit.wikimedia.org/r/#/c/282890/
* Change <tt>route_table</tt> in <tt>hieradata/role/common/cache/upload.yaml</tt> so that codfw is direct.
* Force a puppet run on cache_upload in eqiad:
* Change <tt>route_table</tt> in <tt>hieradata/role/common/cache/upload.yaml</tt> so that eqiad points to codfw.
  salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:eqiad' cmd.run 'puppet agent --test'
* Change <tt>route_table</tt> in <tt>hieradata/role/common/cache/upload.yaml</tt> so that codfw is direct with https://gerrit.wikimedia.org/r/#/c/282891/
* Force a puppet run on cache_upload:
  salt -v -t 10 -b 17 -C 'G@cluster:cache_upload' cmd.run 'puppet agent --test'
* Change <tt>route_table</tt> in <tt>hieradata/role/common/cache/upload.yaml</tt> so that eqiad points to codfw with https://gerrit.wikimedia.org/r/#/c/282892/
* Force a puppet run on cache_upload:
  salt -v -t 10 -b 17 -C 'G@cluster:cache_upload' cmd.run 'puppet agent --test'


==== Thumbnails ====
==== Once mediawiki has been switched to codfw ====
* Point swift <tt>rewrite</tt> middleware to codfw with https://gerrit.wikimedia.org/r/#/c/268080/ and run puppet on <tt>ms-fe*</tt> plus <tt>swift-init all reload</tt> for the changes to take effect. Now 404s for thumbs will hit codfw imagescalers.
* Point swift <tt>rewrite</tt> middleware to codfw with https://gerrit.wikimedia.org/r/#/c/268080/ and run puppet on <tt>ms-fe*</tt> plus <tt>swift-init all reload</tt> for the changes to take effect. Now 404s for thumbs will hit codfw imagescalers.
* Change varnish backends for app <tt>rendering</tt> in <tt>hieradata/common/cache/text.yaml</tt> to point to codfw
* Change varnish backends for app <tt>rendering</tt> in <tt>hieradata/common/cache/text.yaml</tt> to point to codfw with https://gerrit.wikimedia.org/r/282893
* Change varnish backends for app <tt>swift_thumbs</tt> in <tt>hieradata/common/cache/upload.yaml</tt> to point to codfw, user traffic for thumbs hits swift codfw


=== ElasticSearch ===
=== ElasticSearch ===

Revision as of 12:23, 12 April 2016

Introduction

A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.

Schedule for Q3 FY2015-2016 rollout

  • Deployment server: Wednesday, January 20th
  • Media storage/Swift: Thursday, April 14th 17:00 UTC
  • Traffic: Thursday, March 10th
  • MediaWiki 5-minute read-only test: Tuesday, March 15th 07:00 UTC
  • Services: Thursday, March 17th, 10:00 UTC
  • Services (second test): (week of March 28th)
  • ElasticSearch: Thursday, April 7th, 12:00 UTC
  • MediaWiki: Tuesday, April 19th, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)

Switching back

  • MediaWiki: Thursday, April 21st, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
  • Services, ElasticSearch, Traffic, Swift, Deployment server: Thursday, April 21st, after the above is done

Per-service switchover instructions

MediaWiki-related

Before switch over (after any local testing within codfw):

  • Wipe memcached to prevent stale values (MediaWiki isn't ready for Multi-DC yet).
    • Once eqiad is read-only and cofdw read-only master/slaves are caught up; any subsequent requests to codfw that set memcached are fine.

The overall plan for an eqiad->codfw switchover is:

  • Warmup codfw databases
  • Warming up memcached with parsercache entries
  • Stop jobqueues in eqiad: cherry-pick https://gerrit.wikimedia.org/r/282880 and run salt -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'service jobrunner stop; service jobchron stop;'
  • Deploy mediawiki-config with all shards set to read-only in eqiad: revive https://gerrit.wikimedia.org/r/#/c/277462/
  • Set eqiad databases (masters) in read-only mode; stop pt-heartbeat (?)
  • Switch the datacenter in puppet: set $app_routes['mediawiki'] = 'codfw' in puppet (cherry-pick https://gerrit.wikimedia.org/r/282898) and $wmfMasterDatacenter in mediawiki-config (https://gerrit.wikimedia.org/r/#/c/282897/). This has several consequences:
    • Redis replication will be flowing codfw => eqiad once puppet has ran first in codfw salt 'mc2*' cmd.run 'puppet agent -t'; salt 'rdb2*' cmd.run 'puppet agent -t' and then in eqiad salt 'mc1*' cmd.run 'puppet agent -t'; salt 'rdb1*' cmd.run 'puppet agent -t'
    • RESTBase (uses puppet's $app_routes, needs puppet run + service restart) salt -b10% -G 'cluster:restbase' cmd.run 'puppet agent -t; service restbase restart'
    • All other services will be automatically reconfigured whenever puppet runs after $app_routes is modified salt 'sc*' cmd.run 'puppet agent -t'
  • Parsoid switch of the action api endpoint (manages its own config, will need a deploy + restart of its own) merge https://gerrit.wikimedia.org/r/#/c/282904/
  • Deploy Varnish to switch backend to appserver.svc.codfw.wmnet/api.svc.codfw.wmnet - merge https://gerrit.wikimedia.org/r/#/c/282910/ and run puppet on all cache_text
  • Master swap for every core (s1-7), ES (es1-3), parsercache (pc) and External store (x1) database
    • Technically there is nothing to do at database level once circular replication is setup
    • In reality, some small deploys: set codfw masters mysql as read-write and start pt-heartbeat-wikimedia there; change *-master dns to the new masters (only used by humans); optionally: puppetize $master = true
  • Deploy mediawiki-config (only codfw?) with all shards set to read-write
  • Start the jobqueue in codfw: cherry-pick https://gerrit.wikimedia.org/r/282881 and run salt -b 6 -C 'G@cluster:jobrunner and G@site:codfw' cmd.run 'puppet agent -t'

The plan for switching back is the reverse of the above, with the following extra step:

  • Wipe memcached to clear invalidated cached memory

Databases

See the separate page on how to promote a new slave to master. Please note that no topology change happens on failover, so very little from that applies.

Job queue

  • Jobrunners in eqiad get stopped. This is done by setting jobrunner_state: 'stopped' in hiera
  • mediawiki goes read-only - this should ensure no new job gets enqueued.
  • $mw_primary is set to 'codfw
  • mediawiki primary gets switched in mediawiki-config
  • force a puppet run on the codfw redis hosts and on eqiad hosts after that: salt 'rdb2*' cmd.run 'puppet agent --enable; puppet agent -t'
  • mediawiki is set read-write in codfw
  • We start the jobrunners in codfw and they will consume the jobs left over from eqiad. This is done by setting jobrunner_state: running

Debugging

You can force Varnish to pass a request to a backend in codfw or eqiad using the X-Wikimedia-Debug header.

For codfw, use X-Wikimedia-Debug: backend=mw2017.codfw.wmnet

For eqiad, use X-Wikimedia-Debug: backend=mw1017.eqiad.wmnet

Media storage/Swift

Ahead of the switchover, originals and thumbs

 salt -v -t 10 -b 17 -C 'G@cluster:cache_upload and G@site:eqiad' cmd.run 'puppet agent --test'
 salt -v -t 10 -b 17 -C 'G@cluster:cache_upload' cmd.run 'puppet agent --test'
 salt -v -t 10 -b 17 -C 'G@cluster:cache_upload' cmd.run 'puppet agent --test'

Once mediawiki has been switched to codfw

ElasticSearch

Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster InitialiseSettings.php. The default value is "local", which means that if mediawiki switches DC, everything should be automatic.

Traffic

GeoDNS user routing

Inter-Cache routing

Cache->App routing

Services

  • RESTBase and Parsoid already active in codfw, using eqiad MW API.
  • Shift traffic to codfw:
    • Public traffic: Update Varnish backend config.
    • Update RESTBase and Flow configs in mediawiki-config to use codfw.
  • During MW switch-over:

Other miscellaneous