You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Switch Datacenter"

From Wikitech-static
Jump to navigation Jump to search
imported>Giuseppe Lavagetto
imported>Jcrespo
(→‎MediaWiki-related: Database overview steps)
Line 1: Line 1:
This page documents all the steps needed to switch over from a master datacenter to a second one.
== Introduction ==
A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.


=== List of systems that need to switchover ===
=== Schedule for Q3 FY2015-2016 rollout ===
Here are the links to the individual switchover procedures:
* Deployment server: Jan 20th
* [[Switch Datacenter/DeploymentServer|DeploymentServer]]
* Media storage/Swift: March 10th
* Traffic: Mar 10th
* ElasticSearch:
* Services:
* MediaWiki: March 22nd (fallback: April 18th)
 
== Per-service switchover instructions ==
 
=== MediaWiki-related ===
 
Before switch over (after any local testing within codfw):
* Wipe memcached to prevent stale values (MediaWiki isn't ready for Multi-DC yet).
** Once eqiad is read-only and cofdw read-only master/slaves are caught up; any subsequent requests to codfw that set memcached are fine.
 
The overall plan for an eqiad->codfw switchover is:
* Warmup codfw databases
* Warming up memcached with parsercache entries
* Deploy mediawiki-config with all shards set to read-only
* Set eqiad databases (masters) in read-only mode; stop pt-heartbeat (?)
* Redis failover
* Deploy Varnish to switch backend to appserver.svc.codfw.wmnet/api.svc.codfw.wmnet
* Deploy services to change their MediaWiki action API endpoint
** RESTBase (uses puppet's $mw_primary, needs puppet run + service restart)
** Parsoid (manages its own config, will need a deploy + restart of its own)
** Mobileapps (no manual intervention needed)
* Master swap for every core (s1-7), ES (es1-3), parsercache (pc) and External store (x1) database
** Technically there is nothing to do at database level once circular replication is setup
** In reality, some small deploys: set codfw masters mysql as read-write and start pt-heartbeat-wikimedia there; change *-master dns to the new masters (only used by humans); optionally: puppetize $master = true
* Deploy mediawiki-config (only codfw?) with all shards set to read-write
 
The plan for switching back is the reverse of the above, with the following extra step:
* Wipe memcached to clear invalidated cached memory
 
==== Databases ====
See the separate page on [[MariaDB/troubleshooting#Depooling_a_master_.28a.k.a._promoting_a_new_slave_to_master.29|how to promote a new slave to master]].
 
==== Job queue ====


==== Mediawiki-related ====
* [[MariaDB/troubleshooting#Depooling_a_master_.28a.k.a._promoting_a_new_slave_to_master.29|Databases]]
The following subsystems have their switchover depending on the puppet variable $mw_primary and thus depend on puppet for the actual switchover
The following subsystems have their switchover depending on the puppet variable $mw_primary and thus depend on puppet for the actual switchover
* [[Switch_Datacenter/MemcachedRedisSessions]]
* [[Switch_Datacenter/MemcachedRedisSessions]]
==== Debugging ====
You can force Varnish to pass a request to a backend in CODFW or EQIAD using the [[X-Wikimedia-Debug]] header.
For CODFW, use <code>X-Wikimedia-Debug: backend=mw2017.codfw.wmnet</code>
For EQIAD, use <code>X-Wikimedia-Debug: backend=mw1017.eqiad.wmnet</code>
=== Media storage/Swift ===
==== Originals ====
* Instruct mediawiki to write synchronously to both eqiad/codfw with https://gerrit.wikimedia.org/r/276071
* Change varnish backends for app <tt>swift</tt> in <tt>hieradata/common/cache/upload.yaml</tt> to point to codfw
* Change <tt>route_table</tt> in <tt>hieradata/role/common/cache/upload.yaml</tt> so that codfw is direct.
* Change <tt>route_table</tt> in <tt>hieradata/role/common/cache/upload.yaml</tt> so that eqiad points to codfw.
==== Thumbnails ====
* Point swift <tt>rewrite</tt> middleware to codfw with https://gerrit.wikimedia.org/r/#/c/268080/ and run puppet on <tt>ms-fe*</tt> plus <tt>swift-init all reload</tt> for the changes to take effect. Now 404s for thumbs will hit codfw imagescalers.
* Change varnish backends for app <tt>rendering</tt> in <tt>hieradata/common/cache/text.yaml</tt> to point to codfw
* Change varnish backends for app <tt>swift_thumbs</tt> in <tt>hieradata/common/cache/upload.yaml</tt> to point to codfw, user traffic for thumbs hits swift codfw
=== ElasticSearch ===
=== Traffic ===
==== GeoDNS user routing ====
* Traffic-layer only, no interdependencies elsewhere
* Granularity is per-cache-cluster (misc, maps, text, upload)
* Documented at: https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing#GeoDNS
==== Inter-Cache routing ====
* Traffic-layer only, no interdependencies elsewhere
* Granularity is per-cache-cluster (misc, maps, text, upload)
* Documented at: https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing#Inter-Cache_Routing
==== Cache->App routing ====
* Normally will have inter-dependencies with application-level work
* Granularity is per-application-service (how they're defined at the back end of varnish)
* Documented at: https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing#Cache-to-Application_Routing
=== Services ===
* RESTBase and Parsoid already active in codfw, using eqiad MW API.
* Shift traffic to codfw:
** Public traffic: Update Varnish backend config.
** Update RESTBase and Flow configs in mediawiki-config to use codfw.
* During MW switch-over:
** Update RESTBase and Parsoid to use MW API in codfw, either using puppet / Parsoid deploy, or DNS. See [[phab:T125069|https://phabricator.wikimedia.org/T125069]].
* [[phab:T127974|Tracker / checklist]]
=== Other miscellaneous ===
* [[Switch Datacenter/DeploymentServer|Deployment server]]
* EventLogging
* IRC/RCstream

Revision as of 20:44, 9 March 2016

Introduction

A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.

Schedule for Q3 FY2015-2016 rollout

  • Deployment server: Jan 20th
  • Media storage/Swift: March 10th
  • Traffic: Mar 10th
  • ElasticSearch:
  • Services:
  • MediaWiki: March 22nd (fallback: April 18th)

Per-service switchover instructions

MediaWiki-related

Before switch over (after any local testing within codfw):

  • Wipe memcached to prevent stale values (MediaWiki isn't ready for Multi-DC yet).
    • Once eqiad is read-only and cofdw read-only master/slaves are caught up; any subsequent requests to codfw that set memcached are fine.

The overall plan for an eqiad->codfw switchover is:

  • Warmup codfw databases
  • Warming up memcached with parsercache entries
  • Deploy mediawiki-config with all shards set to read-only
  • Set eqiad databases (masters) in read-only mode; stop pt-heartbeat (?)
  • Redis failover
  • Deploy Varnish to switch backend to appserver.svc.codfw.wmnet/api.svc.codfw.wmnet
  • Deploy services to change their MediaWiki action API endpoint
    • RESTBase (uses puppet's $mw_primary, needs puppet run + service restart)
    • Parsoid (manages its own config, will need a deploy + restart of its own)
    • Mobileapps (no manual intervention needed)
  • Master swap for every core (s1-7), ES (es1-3), parsercache (pc) and External store (x1) database
    • Technically there is nothing to do at database level once circular replication is setup
    • In reality, some small deploys: set codfw masters mysql as read-write and start pt-heartbeat-wikimedia there; change *-master dns to the new masters (only used by humans); optionally: puppetize $master = true
  • Deploy mediawiki-config (only codfw?) with all shards set to read-write

The plan for switching back is the reverse of the above, with the following extra step:

  • Wipe memcached to clear invalidated cached memory

Databases

See the separate page on how to promote a new slave to master.

Job queue

The following subsystems have their switchover depending on the puppet variable $mw_primary and thus depend on puppet for the actual switchover

Debugging

You can force Varnish to pass a request to a backend in CODFW or EQIAD using the X-Wikimedia-Debug header.

For CODFW, use X-Wikimedia-Debug: backend=mw2017.codfw.wmnet

For EQIAD, use X-Wikimedia-Debug: backend=mw1017.eqiad.wmnet

Media storage/Swift

Originals

  • Instruct mediawiki to write synchronously to both eqiad/codfw with https://gerrit.wikimedia.org/r/276071
  • Change varnish backends for app swift in hieradata/common/cache/upload.yaml to point to codfw
  • Change route_table in hieradata/role/common/cache/upload.yaml so that codfw is direct.
  • Change route_table in hieradata/role/common/cache/upload.yaml so that eqiad points to codfw.

Thumbnails

  • Point swift rewrite middleware to codfw with https://gerrit.wikimedia.org/r/#/c/268080/ and run puppet on ms-fe* plus swift-init all reload for the changes to take effect. Now 404s for thumbs will hit codfw imagescalers.
  • Change varnish backends for app rendering in hieradata/common/cache/text.yaml to point to codfw
  • Change varnish backends for app swift_thumbs in hieradata/common/cache/upload.yaml to point to codfw, user traffic for thumbs hits swift codfw

ElasticSearch

Traffic

GeoDNS user routing

Inter-Cache routing

Cache->App routing

Services

  • RESTBase and Parsoid already active in codfw, using eqiad MW API.
  • Shift traffic to codfw:
    • Public traffic: Update Varnish backend config.
    • Update RESTBase and Flow configs in mediawiki-config to use codfw.
  • During MW switch-over:

Other miscellaneous