You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Switch Datacenter"

From Wikitech-static
Jump to navigation Jump to search
imported>Lucas Werkmeister (WMDE)
(→‎Phase 9 - post read-only: the build was killed)
imported>Alexandros Kosiaris
Line 1: Line 1:
{{See|See also the failover blog posts from "[https://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/ April 11, 2016]" and "[https://blog.wikimedia.org/2017/04/18/codfw-temporary-editing-pause/ April 18, 2017]" on the Wikimedia Techblog.}}
{{See|See also the blog posts from the "[https://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/ 2016 Failover]" and "[https://blog.wikimedia.org/2017/04/18/codfw-temporary-editing-pause/ 2017 Failover]" on the Wikimedia Blog.}}


== Introduction ==
== Introduction ==
A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.
A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.


=== Schedule for 2017 switch ===
=== Schedule for 2018 switch ===
See [[phab:T138810]] for tasks to be undertaken during the switch


* Elasticsearch: elasticsearch is automatically following mediawiki switch
* Services: Tuesday, September 11th 2018 14:30 UTC
* Services: Tuesday, April 18th 2017 14:30 UTC
*Media storage/Swift: Tuesday, September 11th 2018 15:00 UTC
* Media storage/Swift: Tuesday, April 18th 2017 15:00 UTC
*Traffic: Tuesday, September 11th 2018 19:00 UTC
* Traffic: Tuesday, April 18th 2017 19:00 UTC
*<mark>MediaWiki: Wednesday, September 12th 2018: 14:00 UTC</mark>
* MediaWiki: '''Wednesday, April 19th 2017 [https://www.timeanddate.com/worldclock/fixedtime.html?iso=20170419T14 14:00 UTC]''' (user visible, requires read-only mode)
* Deployment server: Wednesday, April 19th 2017 16:00 UTC


==== Switching back ====
'''Switching back:'''
* Traffic: Pre-switchback in two phases: Mon May 1 and Tues May 2 (to avoid cold-cache issues Weds)
* Traffic: Wednesday, October 10th 2018 09:00 UTC
* MediaWiki: '''Wednesday, May 3rd 2017 [https://www.timeanddate.com/worldclock/fixedtime.html?iso=20170503T14 14:00 UTC]''' (user visible, requires read-only mode)
*<mark>MediaWiki: Wednesday, October 10th 2018: 14:00 UTC</mark>
* Elasticsearch: elasticsearch is automatically following mediawiki switch
*Services: Thursday, October 11th 2018 14:30 UTC
* Services: Thursday, May 4th 2017 14:30 UTC
*Media storage/Swift: Thursday, October 11th 2018 15:00 UTC
* Swift: Thursday, May 4th 2017 15:30 UTC
* Deployment server: Thursday, May 4th 2017 16:00 UTC


__TOC__
__TOC__
Line 26: Line 21:
== Per-service switchover instructions ==
== Per-service switchover instructions ==
=== MediaWiki ===
=== MediaWiki ===
We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in operations/switchdc [https://github.com/wikimedia/operations-switchdc]
We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in the [[gerrit:plugins/gitiles/operations/cookbooks/|operations/cookbooks]] repository, in the[[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/|cookbooks/sre/switchdc/mediawiki/]] path.


==== Days in advance preparation ====
==== Days in advance preparation ====
# Warm up databases; see [[MariaDB/buffer pool dump]].
# '''OPTIONAL: SKIP IN AN EMERGENCY''': Make sure databases are in a good state. Normally this requires no operation, as the passive datacenter databases are always prepared to receive traffic, so there are no actionables. Some things that #DBAs normally should make sure for the most optimal state possible (sanity checks):
#* There is no ongoing long-running maintenance that affects database availability or lag (schema changes, upgrades, hardware issues, etc.). Depool those servers that are not ready.
#* Replication is flowing from eqiad -> codfw and from codfw -> eqiad (sometimes replication gets stopped in the passive -> active direction to facilitate maintenance)
#* All database servers have its buffer pool filled up. This is taken care automatically with the [[MariaDB/buffer_pool_dump|automatic buffer pool warmup functionality]]. For sanity checks, some sample load could be sent to the mediawiki application server to check requests happen as quickly as in the active datacenter.
#* These where the [[Switch_Datacenter/planned_db_maintenance#2018_Switch_Datacenter|things we prepared/checked for the 2018 switch]]
# Prepare puppet patches:
# Prepare puppet patches:
#* Switch mw_primary ([https://gerrit.wikimedia.org/r/346321 eqiad->codfw];  [https://gerrit.wikimedia.org/r/#/c/351315 codfw->eqiad] )
#* Switch cache::app_directors backends from old_site-active to new_site-active ([https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458772/ eqiad->codfw]; [https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458773/ codfw->eqiad] )
#* Switch cache::app_routes backends from old_site-active to new_site-active ([https://gerrit.wikimedia.org/r/346320 eqiad->codfw]; [https://gerrit.wikimedia.org/r/#/c/351313 codfw->eqiad] )
# Prepare the mediawiki-config patch or patches ([https://gerrit.wikimedia.org/r/346251 eqiad->codfw]; [https://gerrit.wikimedia.org/r/#/c/351592/ codfw->eqiad])


==== Stage 0 - preparation ====
==== Phase 0 - preparation ====
# Disable puppet on all MediaWiki jobqueues/videoscalers and maintenance hosts and cache::text in both eqiad and codfw. [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t00_disable_puppet.py switchdc t00_disable_puppet.py] [[Switch Datacenter/MediaWiki#t00 disable puppet|sample output]]
# Disable puppet on maintenance hosts and cache::text in both eqiad and codfw: [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-disable-puppet.py|00-disable-puppet.py]]
# Merge the mediawiki-config [https://gerrit.wikimedia.org/r/346251 switchover changes] but don't sync '''This is not covered by the switchdc script'''
# Reduce the TTL on <tt>appservers-rw, api-rw, jobrunner, videoscaler</tt> to 10 seconds: [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-reduce-ttl.py|00-reduce-ttl.py]] For operational reasons to be able to run multiple steps one after the other, this step doesn't wait the old TTL (300s) to expire. '''Make sure that at least 5 minutes have passed before moving to Phase 1'''.
# Stop <code>swiftrepl</code> on <code>me-fe1005</code>  '''This is not covered by the switchdc script'''
# Warm up APC running the mediawiki-cache-warmup on the new site clusters, specifically: [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/00-wipe-and-warmup-caches.py|00-wipe-and-warmup-caches.py]]
# Reduce the TTL on <tt>appservers-rw, api-rw, imagescaler-rw</tt> to 10 seconds: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t00_reduce_ttl.py switchdc t00_reduce_ttl.py] [[Switch Datacenter/MediaWiki#t00 reduce ttl|sample output]]
#*Restart all HHVM servers in the new site to clear the APC cache
#*The global warmup against the appservers cluster
#* The apc-warmup against all hosts in the appservers cluster.


==== Phase 1 - stop maintenance ====
==== Phase 1 - stop maintenance and merge traffic changes ====
# Stop jobqueues in the ''active site'' and kill all the cronjobs on the maintenance host in the active site: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t01_stop_maintenance.py switchdc t01_stop_maintenance.py] [[Switch Datacenter/MediaWiki#output 2|sample output]]
# Stop maintenance jobs in the ''active site'' and kill all the cronjobs on the maintenance host in the active site: [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/01-stop-maintenance.py|01-stop-maintenance.py]]
#Merge and puppet-merge the traffic change for text caches [https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458772/ eqiad->codfw]; [https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458773/ codfw->eqiad]. '''This is not covered by the switchdc script'''. Note: This does not yet switch traffic since puppet is disabled.


==== Phase 2 - read-only mode ====
==== Phase 2 - read-only mode ====
# Go to read-only mode by syncing <tt>wmf-config/db-$old-site.php</tt>: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t02_start_mediawiki_readonly.py switchdc t02_start_mediawiki_readonly.py] [[Switch Datacenter/MediaWiki#output 3|sample output]]
# Go to read-only mode by changing the <code>ReadOnly</code> conftool value: [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/02-set-readonly.py|02-set-readonly.py]]


==== Phase 3 - lock down database masters ====
==== Phase 3 - lock down database masters ====
# Put old-site core DB masters (shards: s1-s7, x1, es2-es3) in read-only mode: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t03_coredb_masters_readonly.py switchdc t03_coredb_masters_readonly.py] [[Switch Datacenter/MediaWiki#output 4|sample output]]
# Put old-site core DB masters (shards: s1-s8, x1, es2-es3) in read-only mode and wait for the new site's databases to catch up replication: [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/03-set-db-readonly.py|03-set-db-readonly.py]]


==== Phase 4 - Wipe caches in the new site and warmup them ====
==== Phase 4 - switch active datacenter configuration ====
# All the following tasks are performed by [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t04_cache_wipe.py switchdc t04_cache_wipe.py] [[Switch Datacenter/MediaWiki#output 5|sample output]]
## Wait for the new site's databases to catch up replication
## Wipe ''new site''<nowiki/>'s memcached to prevent stale values — only once the new site's read-only master/slaves are caught up.
## Restart all HHVM servers in the new site to clear the APC cache
## Warm up memcached and APC running the mediawiki-cache-warmup on the new site clusters, specifically:
##* The global warmup against the appservers cluster
##* The apc-warmup against all hosts in the appservers cluster.
# Resync redises in the destination datacenter using [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t04_resync_redis.py switchdc t04_resync_redis.py]
# Merge and puppet-merge the traffic change for text caches [https://gerrit.wikimedia.org/r/#/c/346320 eqiad->codfw]; [https://gerrit.wikimedia.org/r/#/c/351313 codfw->eqiad]. '''This is not covered by the switchdc script'''..


==== Phase 5 - switch active datacenter configuration ====
# Send the traffic layer to active-active: [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/04-switch-traffic.py|04-switch-traffic.py]]  
# Send the traffic layer to active-active: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t05_switch_traffic.py switchdc t05_switch_traffic.py] [[Switch Datacenter/MediaWiki#t05 switch traffic|sample output]]
#* enable and run puppet on cache::text in $new_site. This starts the active-active traffic phase (traffic will  go to both MW clusters)
#* enable and run puppet on cache::text in $new_site. This starts the active-active traffic phase (traffic will  go to both MW clusters)
#* ensure that the change was applied on all hosts in $new_site
#* ensure that the change was applied on all hosts in $new_site
#* Run puppet on the text caches in $old_site. This ends the active-active phase.
#* Run puppet on the text caches in $old_site. This ends the active-active phase.
# Merge the switch of <tt>$mw_primary</tt> at this point. This change can actually be puppet-merged together with the varnish one. '''This is not covered by the switchdc script'''. (Puppet is only involved in managing traffic, db alerts, and the jobrunners).
# Switch the discovery records and MediaWiki active datacenter: [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/04-switch-mediawiki.py|04-switch-mediawiki.py]]  
# Switch the discovery: switchdc [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t05_switch_datacenter.py t05_switch_datacenter] [[Switch Datacenter/MediaWiki#t05 switch datacenter|sample output]]
#* Flip <tt>appservers-rw, api-rw, jobrunner, videoscaler</tt> to <tt>pooled=true</tt> in the new site. This will not actually change the DNS records for the active datacenter.
#* Flip <tt>appservers-rw, api-rw, imagescaler-rw</tt> to <tt>pooled=true</tt> in the new site. This will not actually change the DNS records, but the on-disk redis config will change.
#* Flip <code>WMFMasterDatacenter</code> from the old site to the new.
#* Deploy <tt>wmf-config/ConfigSettings.php</tt> changes to switch the datacenter in MediaWiki
#* Flip <tt>appservers-rw, api-rw, jobrunner, videoscaler</tt> to <tt>pooled=false</tt> in the old site. After this, DNS will be changed for the old DC and internal applications (but mediawiki!!) will start hitting the new DC
#* Flip <tt>appservers-rw, api-rw, imagescaler-rw</tt> to <tt>pooled=false</tt> in the old site. After this, DNS will be changed and internal applications will start hitting the new DC


==== Phase 6 - Redis replicas ====
==== Phase 5 - Invert Redis replication for MediaWiki sessions ====
# Switch the live redis configuration. Verify redises are indeed replicating correctly: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t06_redis.py switchdc t06_redis] [[Switch Datacenter/MediaWiki#output 7|sample output]]


==== Phase 7 - Set new site's databases to read-write ====
# Invert the Redis replication for the <code>sessions</code> cluster: [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/05-invert-redis-sessions.py|05-invert-redis-sessions.py]]
# Set new-site's core DB masters (shards: s1-s7, x1, es2-es3) in read-write mode: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t07_coredb_masters_readwrite.py switchdc t07_coredb_masters_readwrite.py] [[Switch Datacenter/MediaWiki#output 8|sample output]]


==== Phase 8 - Set MediaWiki to read-write ====
==== Phase 6 - Set new site's databases to read-write ====
# Deploy mediawiki-config <tt>wmf-config/db-$new-site.php</tt> with all shards set to read-write: switchdc [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t08_stop_mediawiki_readonly.py t08_stop_mediawiki_readonly.py] [[Switch Datacenter/MediaWiki#output 9|sample output]]
# Set new-site's core DB masters (shards: s1-s8, x1, es2-es3) in read-write mode: [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/06-set-db-readwrite.py|06-set-db-readwrite.py]]


==== Phase 9 - post read-only ====
==== Phase 7 - Set MediaWiki to read-write ====
# Start maintenance in the new DC: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t09_start_maintenance.py switchdc t09_start_maintenance.py] [[Switch Datacenter/MediaWiki#t09 start maintenance|sample output]]
# Go to read-write mode by changing the <code>ReadOnly</code> conftool value: [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/07-set-readwrite.py|07-set-readwrite.py]]
## the jobqueue in the ''new site'' by running puppet there (mw_primary controls it)
## Run puppet on the maintenance hosts (mw_primary controls it)
# Update tendril for new database masters: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t09_tendril.py switchdc t09_tendril.py] [[Switch Datacenter/MediaWiki#t09 tendril|sample output]]
# Restart parsoid: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t09_restart_parsoid.py switchdc t09_restart_parsoid.py] [[Switch Datacenter/MediaWiki#t09 restart parsoid|sample output]]
# Set the TTL for the DNS records to 300 seconds again: [https://github.com/wikimedia/operations-switchdc/blob/master/switchdc/stages/t09_restore_ttl.py switchdc t09_restore_ttl.py] [[Switch Datacenter/MediaWiki#t09 restore ttl|sample output]]
# Update DNS records for new database masters deploying [https://gerrit.wikimedia.org/r/#/c/348440 eqiad->codfw]; [https://gerrit.wikimedia.org/r/#/c/350824/ codfw->eqiad] '''This is not covered by the switchdc script'''
# Start <code>swiftrepl</code> on codfw  '''This is not covered by the switchdc script'''
# [Optional] Run the script to fix broken wikidata entities on the maintenance host of the active datacenter: <code>sudo -u www-data mwscript extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki --force</code> '''This is not covered by the switchdc script'''


==== Phase 10 - verification and troubleshooting ====
==== Phase 8 - post read-only ====
 
# Start maintenance in the new DC: [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/08-start-maintenance.py|08-start-maintenance.py]]
## Run puppet on the maintenance hosts
# Update tendril for new database masters: [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/08-update-tendril.py|08-update-tendril.py]] Pure cosmetic change, no effect on production. No changes required for database zarcillo (which has a different master for eqiad and codfw).
# Restart parsoid: [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/08-restart-parsoid.py|08-restart-parsoid.py]]
# Set the TTL for the DNS records to 300 seconds again: [[gerrit:plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/switchdc/mediawiki/08-restore-ttl.py|08-restore-ttl.py]]
# Update DNS records for new database masters deploying [https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458787/ eqiad->codfw]; [https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458790/ codfw->eqiad] '''This is not covered by the switchdc script'''
 
==== Phase 9 - verification and troubleshooting ====
'''This is not covered by the switchdc script'''
# Make sure reading & editing works! :)
# Make sure reading & editing works! :)
# Make sure recent changes are flowing (see Special:RecentChanges, [[EventStreams]], [[RCStream]] and the IRC feeds)
# Make sure recent changes are flowing (see Special:RecentChanges, [[EventStreams]], and the IRC feeds)
# Make sure email works (<code>exim4 -bp</code> on mx1001/mx2001, test an email)
# Make sure email works (<code>exim4 -bp</code> on mx1001/mx2001, test an email)
==== Dashboards ====
*[https://grafana.wikimedia.org/dashboard/db/apache-hhvm?orgId=1 Apache/HHVM]
*[https://grafana.wikimedia.org/dashboard/db/mediawiki-application-servers?orgId=1 App servers]
*[https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=cache_upload&var-layer=backend Varnish backends]
*[https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor Fatalmonitor (mostly mediawiki logs)]
*[https://logstash.wikimedia.org/app/kibana#/dashboard/87348b60-90dd-11e8-8687-73968bebd217 Database Errors logs]


=== Media storage/Swift ===
=== Media storage/Swift ===
Line 101: Line 98:


* Set temporary active/active for Swift
* Set temporary active/active for Swift
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347859/
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458794
# <any puppetmaster>: <code>puppet-merge</code>
# <any puppetmaster>: <code>puppet-merge</code>
# <any cumin master>: <code>sudo cumin 'R:class = role::cache::upload and ( *.eqiad.wmnet or *.codfw.wmnet )' 'run-puppet-agent'</code>
# <any cumin master>: <code>sudo cumin 'A:cp-upload_eqiad or A:cp-upload_codfw' 'run-puppet-agent'</code>
* The above must complete correctly and fully (applying the change)
* The above must complete correctly and fully (applying the change)
* Set Swift to active/passive in codfw only:
* Set Swift to active/passive in codfw only:
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347860/
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458795
# <any puppetmaster>: <code>puppet-merge</code>
# <any puppetmaster>: <code>puppet-merge</code>
# <any cumin master>: <code>sudo cumin 'R:class = role::cache::upload and ( *.eqiad.wmnet or *.codfw.wmnet )' 'run-puppet-agent'</code>
# <any cumin master>: <code>sudo cumin 'A:cp-upload_eqiad or A:cp-upload_codfw' 'run-puppet-agent'</code>


==== Switching back ====
==== Switching back ====


Repeat the steps above in reverse order, with suitable revert commits
Repeat the steps above in reverse order with the commits https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458796 and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458797 respectively
 
==== Dashboards ====
* [https://grafana.wikimedia.org/dashboard/file/swift.json?orgId=1 Swift eqiad]
* [https://grafana.wikimedia.org/dashboard/file/swift.json?orgId=1&var-DC=codfw Swift codfw]
* [https://grafana.wikimedia.org/dashboard/db/thumbor?orgId=1 Thumbor]


=== ElasticSearch ===
=== ElasticSearch ===
Line 121: Line 123:


To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following [[Search#Recovering_from_an_Elasticsearch_outage.2Finterruption_in_updates|Recovering from an Elasticsearch outage / interruption in updates]].
To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following [[Search#Recovering_from_an_Elasticsearch_outage.2Finterruption_in_updates|Recovering from an Elasticsearch outage / interruption in updates]].
==== Dashboards ====
* [https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles-prometheus?orgId=1 ElasticSearch Percentiles]


=== Traffic ===
=== Traffic ===
Line 131: Line 136:


Inter-Cache Routing:
Inter-Cache Routing:
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347613/
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458808
# <any puppetmaster>: <code>puppet-merge</code>
# <any puppetmaster>: <code>puppet-merge</code>
# <any cumin master>: <code>sudo cumin 'cp3*.esams.wmnet' 'run-puppet-agent -q'</code>
# <any cumin master>: <code>sudo cumin 'A:cp-esams' 'run-puppet-agent -q'</code>


GeoDNS (User-facing) Routing:
GeoDNS (User-facing) Routing:


# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347616
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458806
# <any authdns node>: authdns-update
# <any authdns node>: authdns-update


==== Switchback ====
==== Switchback ====


Same procedures as above, with reversions of the commits specified.  The switchback will happen in two stages over two days: first reverting the inter-cache routing, then the user routing, to minimize cold-cache issues.
Same procedures as above, with reversions of the commits specified.  The switchback will happen in two stages over two days: first reverting the inter-cache routing, then the user routing, to minimize cold-cache issues. Respective changes [https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458807 GeoDNS], [https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458809 Inter-Cache Routing]
 
==== Dashboards ====
*[https://grafana.wikimedia.org/dashboard/db/load-balancers?orgId=1 Load Balancers]
*[https://grafana.wikimedia.org/dashboard/db/frontend-traffic?refresh=1m&orgId=1 Frontend traffic]
*[https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=cache_upload&var-layear-layer=backend Varnish eqiad]
*[https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-dc-stats?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=cache_upload&var-layer=backend Varnish codfw]


=== Traffic - Services ===
=== Traffic - Services ===
Line 149: Line 160:
* cache_text:
* cache_text:
** restbase (active/passive for now)
** restbase (active/passive for now)
** eventstreams (active/active)
** cxserver (active/active)
** cxserver (active/active)
** citoid (active/active)
* cache_maps:
** kartotherian (active/passive for now)
* cache_misc:
** noc (active/active)
** noc (active/active)
** pybal_config (active/active)
** pybal_config (active/active)
** wdqs (active/active)
** wdqs (active/active)
** ores (active/active)
** ores (active/active)
** eventstreams (active/active)
 
* cache_upload:
** kartotherian (active/active) - aka maps


==== Switchover ====
==== Switchover ====


* Set temporary active/active for active/passive services above:
* Set temporary active/active for active/passive services above:
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347852/
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458802
# <any frontend puppetmaster>: <code>puppet-merge</code>
# <any frontend puppetmaster>: <code>puppet-merge</code>
# <any cumin master>: <code>sudo cumin '( cp1*.eqiad.wmnet or cp2*.codfw.wmnet ) and R:class ~ "(?i)role::cache::(maps|text)"' 'run-puppet-agent'</code>
# <any cumin master>: <code>sudo cumin 'A:cp-text_eqiad or A:cp-text_codfw' 'run-puppet-agent'</code>
* The above must complete correctly and fully (applying the change)
* The above must complete correctly and fully (applying the change)
* Set all active/active (including temps above) to active/passive in codfw only:
* Set all active/active (including temps above) to active/passive in codfw only:
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/347853/
# gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458803
# <any frontend puppetmaster>: <code>puppet-merge</code>
# <any frontend puppetmaster>: <code>puppet-merge</code>
# <any cumin master>: <code>sudo cumin '( cp1*.eqiad.wmnet or cp2*.codfw.wmnet ) and R:class ~ "(?i)role::cache::(misc|maps|text)"' 'run-puppet-agent'</code>
# <any cumin master>: <code>sudo cumin '(A:cp-text or A:cp-upload) and (A:eqiad or A:codfw)' 'run-puppet-agent'</code>


==== Switchback ====
==== Switchback ====


Reverse the above with reverted commits.
Reverse the above with reverted commits, https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458804 and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458805 respectively
 
==== Dashboards ====
*[https://grafana.wikimedia.org/dashboard/db/eventstreams?refresh=1m&orgId=1 EventStreams]
*[https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&orgId=1 ORES]
*[https://grafana.wikimedia.org/dashboard/db/maps-performances?orgId=1 Maps]
*[https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m WDQS]


=== Services ===
=== Services ===
All services, are active-active in DNS discovery, apart from restbase, that needs special treatment. The procedure to fail over to one site only is the same for every one of them:
All services, are active-active in DNS discovery, apart from restbase, that needs special treatment. The procedure to fail over to one site only is the same for every one of them:
# reduce the TTL of the dns discovery records to 10 seconds
# reduce the TTL of the DNS discovery records to 10 seconds
# depool the datacenter we're moving away from in confctl / discovery
# depool the datacenter we're moving away from in confctl / discovery
# restore the original TTL
# restore the original TTL
Line 189: Line 205:
* [[Switch Datacenter/DeploymentServer|Deployment server]]
* [[Switch Datacenter/DeploymentServer|Deployment server]]
* [[Analytics/Systems/EventLogging|EventLogging]]
* [[Analytics/Systems/EventLogging|EventLogging]]
* [[Irc.wikimedia.org|IRC]], [[RCStream]], [[EventStreams]]
* [[Irc.wikimedia.org|IRC]], <s>[[RCStream]]</s>, [[EventStreams]]


== Schedule of past switches ==
== Schedule of past switches ==
=== Schedule for 2017 switch ===
{{See|See also "[https://blog.wikimedia.org/2017/04/18/codfw-temporary-editing-pause/ Editing pause for failover test]" blog post (18 April 2017), on the Wikimedia Blog.}}
[[phab:T138810|T138810]] on Phabricator tracks tasks to be undertaken during the 2017 switch.
* Elasticsearch: elasticsearch is automatically following mediawiki switch
* Services: Tuesday, April 18th 2017 14:30 UTC
* Media storage/Swift: Tuesday, April 18th 2017 15:00 UTC
* Traffic: Tuesday, April 18th 2017 19:00 UTC
* MediaWiki: Wednesday, April 19th 2017 [https://www.timeanddate.com/worldclock/fixedtime.html?iso=20170419T14 14:00 UTC] (user visible, requires read-only mode)
* Deployment server: Wednesday, April 19th 2017 16:00 UTC
'''Switching back:'''
* Traffic: Pre-switchback in two phases: Mon May 1 and Tue May 2 (to avoid cold-cache issues Weds)
* MediaWiki:  Wednesday, May 3rd 2017 [https://www.timeanddate.com/worldclock/fixedtime.html?iso=20170503T14 14:00 UTC] (user visible, requires read-only mode)
* Elasticsearch: elasticsearch is automatically following mediawiki switch
* Services: Thursday, May 4th 2017 14:30 UTC
* Swift: Thursday, May 4th 2017 15:30 UTC
* Deployment server: Thursday, May 4th 2017 16:00 UTC


=== Schedule for 2016 switch ===
=== Schedule for 2016 switch ===
{{See|See also "[https://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/ Wikimedia failover test]" blog post (11 April 2016), on the Wikimedia Blog.}}
* Deployment server: Wednesday, January 20th 2016
* Deployment server: Wednesday, January 20th 2016
* Traffic: Thursday, March 10th 2016
* Traffic: Thursday, March 10th 2016
Line 202: Line 239:
* MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
* MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)


==== Switching back ====
'''Switching back:'''
* MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
* MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
* Services, Elasticsearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done
* Services, Elasticsearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done
== Monitoring Dashboards ==
Aggregated list of interesting dashboards
*[https://grafana.wikimedia.org/dashboard/db/apache-hhvm?orgId=1 Apache/HHVM]
*[https://grafana.wikimedia.org/dashboard/db/mediawiki-application-servers?orgId=1 App servers]
*[https://grafana.wikimedia.org/dashboard/db/load-balancers?orgId=1 Load Balancers]
*[https://grafana.wikimedia.org/dashboard/db/frontend-traffic?refresh=1m&orgId=1 Frontend traffic]
*[https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=cache_upload&var-layear-layer=backend Varnish eqiad]
*[https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-dc-stats?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=cache_upload&var-layer=backend Varnish codfw]
* [https://grafana.wikimedia.org/dashboard/file/swift.json?orgId=1 Swift eqiad]
* [https://grafana.wikimedia.org/dashboard/file/swift.json?orgId=1&var-DC=codfw Swift codfw]

Revision as of 14:31, 12 September 2018

Introduction

A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.

Schedule for 2018 switch

  • Services: Tuesday, September 11th 2018 14:30 UTC
  • Media storage/Swift: Tuesday, September 11th 2018 15:00 UTC
  • Traffic: Tuesday, September 11th 2018 19:00 UTC
  • MediaWiki: Wednesday, September 12th 2018: 14:00 UTC

Switching back:

  • Traffic: Wednesday, October 10th 2018 09:00 UTC
  • MediaWiki: Wednesday, October 10th 2018: 14:00 UTC
  • Services: Thursday, October 11th 2018 14:30 UTC
  • Media storage/Swift: Thursday, October 11th 2018 15:00 UTC

Per-service switchover instructions

MediaWiki

We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in the operations/cookbooks repository, in thecookbooks/sre/switchdc/mediawiki/ path.

Days in advance preparation

  1. OPTIONAL: SKIP IN AN EMERGENCY: Make sure databases are in a good state. Normally this requires no operation, as the passive datacenter databases are always prepared to receive traffic, so there are no actionables. Some things that #DBAs normally should make sure for the most optimal state possible (sanity checks):
    • There is no ongoing long-running maintenance that affects database availability or lag (schema changes, upgrades, hardware issues, etc.). Depool those servers that are not ready.
    • Replication is flowing from eqiad -> codfw and from codfw -> eqiad (sometimes replication gets stopped in the passive -> active direction to facilitate maintenance)
    • All database servers have its buffer pool filled up. This is taken care automatically with the automatic buffer pool warmup functionality. For sanity checks, some sample load could be sent to the mediawiki application server to check requests happen as quickly as in the active datacenter.
    • These where the things we prepared/checked for the 2018 switch
  2. Prepare puppet patches:

Phase 0 - preparation

  1. Disable puppet on maintenance hosts and cache::text in both eqiad and codfw: 00-disable-puppet.py
  2. Reduce the TTL on appservers-rw, api-rw, jobrunner, videoscaler to 10 seconds: 00-reduce-ttl.py For operational reasons to be able to run multiple steps one after the other, this step doesn't wait the old TTL (300s) to expire. Make sure that at least 5 minutes have passed before moving to Phase 1.
  3. Warm up APC running the mediawiki-cache-warmup on the new site clusters, specifically: 00-wipe-and-warmup-caches.py
    • Restart all HHVM servers in the new site to clear the APC cache
    • The global warmup against the appservers cluster
    • The apc-warmup against all hosts in the appservers cluster.

Phase 1 - stop maintenance and merge traffic changes

  1. Stop maintenance jobs in the active site and kill all the cronjobs on the maintenance host in the active site: 01-stop-maintenance.py
  2. Merge and puppet-merge the traffic change for text caches eqiad->codfw; codfw->eqiad. This is not covered by the switchdc script. Note: This does not yet switch traffic since puppet is disabled.

Phase 2 - read-only mode

  1. Go to read-only mode by changing the ReadOnly conftool value: 02-set-readonly.py

Phase 3 - lock down database masters

  1. Put old-site core DB masters (shards: s1-s8, x1, es2-es3) in read-only mode and wait for the new site's databases to catch up replication: 03-set-db-readonly.py

Phase 4 - switch active datacenter configuration

  1. Send the traffic layer to active-active: 04-switch-traffic.py
    • enable and run puppet on cache::text in $new_site. This starts the active-active traffic phase (traffic will go to both MW clusters)
    • ensure that the change was applied on all hosts in $new_site
    • Run puppet on the text caches in $old_site. This ends the active-active phase.
  2. Switch the discovery records and MediaWiki active datacenter: 04-switch-mediawiki.py
    • Flip appservers-rw, api-rw, jobrunner, videoscaler to pooled=true in the new site. This will not actually change the DNS records for the active datacenter.
    • Flip WMFMasterDatacenter from the old site to the new.
    • Flip appservers-rw, api-rw, jobrunner, videoscaler to pooled=false in the old site. After this, DNS will be changed for the old DC and internal applications (but mediawiki!!) will start hitting the new DC

Phase 5 - Invert Redis replication for MediaWiki sessions

  1. Invert the Redis replication for the sessions cluster: 05-invert-redis-sessions.py

Phase 6 - Set new site's databases to read-write

  1. Set new-site's core DB masters (shards: s1-s8, x1, es2-es3) in read-write mode: 06-set-db-readwrite.py

Phase 7 - Set MediaWiki to read-write

  1. Go to read-write mode by changing the ReadOnly conftool value: 07-set-readwrite.py

Phase 8 - post read-only

  1. Start maintenance in the new DC: 08-start-maintenance.py
    1. Run puppet on the maintenance hosts
  2. Update tendril for new database masters: 08-update-tendril.py Pure cosmetic change, no effect on production. No changes required for database zarcillo (which has a different master for eqiad and codfw).
  3. Restart parsoid: 08-restart-parsoid.py
  4. Set the TTL for the DNS records to 300 seconds again: 08-restore-ttl.py
  5. Update DNS records for new database masters deploying eqiad->codfw; codfw->eqiad This is not covered by the switchdc script

Phase 9 - verification and troubleshooting

This is not covered by the switchdc script

  1. Make sure reading & editing works! :)
  2. Make sure recent changes are flowing (see Special:RecentChanges, EventStreams, and the IRC feeds)
  3. Make sure email works (exim4 -bp on mx1001/mx2001, test an email)

Dashboards

Media storage/Swift

Switchover

  • Set temporary active/active for Swift
  1. gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458794
  2. <any puppetmaster>: puppet-merge
  3. <any cumin master>: sudo cumin 'A:cp-upload_eqiad or A:cp-upload_codfw' 'run-puppet-agent'
  • The above must complete correctly and fully (applying the change)
  • Set Swift to active/passive in codfw only:
  1. gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458795
  2. <any puppetmaster>: puppet-merge
  3. <any cumin master>: sudo cumin 'A:cp-upload_eqiad or A:cp-upload_codfw' 'run-puppet-agent'

Switching back

Repeat the steps above in reverse order with the commits https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458796 and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458797 respectively

Dashboards

ElasticSearch

CirrusSearch talks by default to the local datacenter ($wmfDatacenter). If Mediawiki switches datacenter, elasticsearch will automatically follow.

Manually switching CirrusSearch to a specific datacenter can always be done. Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster InitialiseSettings.php.

To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following Recovering from an Elasticsearch outage / interruption in updates.

Dashboards

Traffic

General information on generic procedures

https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing

Switchover

Inter-Cache Routing:

  1. gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458808
  2. <any puppetmaster>: puppet-merge
  3. <any cumin master>: sudo cumin 'A:cp-esams' 'run-puppet-agent -q'

GeoDNS (User-facing) Routing:

  1. gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458806
  2. <any authdns node>: authdns-update

Switchback

Same procedures as above, with reversions of the commits specified. The switchback will happen in two stages over two days: first reverting the inter-cache routing, then the user routing, to minimize cold-cache issues. Respective changes GeoDNS, Inter-Cache Routing

Dashboards

Traffic - Services

For reference, the public-facing services involved which are confirmed active/active or failover-capable (other than MW and Swift, handled elsewhere):

  • cache_text:
    • restbase (active/passive for now)
    • eventstreams (active/active)
    • cxserver (active/active)
    • noc (active/active)
    • pybal_config (active/active)
    • wdqs (active/active)
    • ores (active/active)
  • cache_upload:
    • kartotherian (active/active) - aka maps

Switchover

  • Set temporary active/active for active/passive services above:
  1. gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458802
  2. <any frontend puppetmaster>: puppet-merge
  3. <any cumin master>: sudo cumin 'A:cp-text_eqiad or A:cp-text_codfw' 'run-puppet-agent'
  • The above must complete correctly and fully (applying the change)
  • Set all active/active (including temps above) to active/passive in codfw only:
  1. gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458803
  2. <any frontend puppetmaster>: puppet-merge
  3. <any cumin master>: sudo cumin '(A:cp-text or A:cp-upload) and (A:eqiad or A:codfw)' 'run-puppet-agent'

Switchback

Reverse the above with reverted commits, https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458804 and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458805 respectively

Dashboards

Services

All services, are active-active in DNS discovery, apart from restbase, that needs special treatment. The procedure to fail over to one site only is the same for every one of them:

  1. reduce the TTL of the DNS discovery records to 10 seconds
  2. depool the datacenter we're moving away from in confctl / discovery
  3. restore the original TTL

Restbase is a bit of a special case, and needs an additional step, if we're just switching active traffic over and not simulating a complete failover:

  1. pool restbase-async everywhere, then depool restbase-async in the newly active dc, so that async traffic is separated from real-users traffic as much as possible.

Other miscellaneous

Schedule of past switches

Schedule for 2017 switch

T138810 on Phabricator tracks tasks to be undertaken during the 2017 switch.

  • Elasticsearch: elasticsearch is automatically following mediawiki switch
  • Services: Tuesday, April 18th 2017 14:30 UTC
  • Media storage/Swift: Tuesday, April 18th 2017 15:00 UTC
  • Traffic: Tuesday, April 18th 2017 19:00 UTC
  • MediaWiki: Wednesday, April 19th 2017 14:00 UTC (user visible, requires read-only mode)
  • Deployment server: Wednesday, April 19th 2017 16:00 UTC

Switching back:

  • Traffic: Pre-switchback in two phases: Mon May 1 and Tue May 2 (to avoid cold-cache issues Weds)
  • MediaWiki: Wednesday, May 3rd 2017 14:00 UTC (user visible, requires read-only mode)
  • Elasticsearch: elasticsearch is automatically following mediawiki switch
  • Services: Thursday, May 4th 2017 14:30 UTC
  • Swift: Thursday, May 4th 2017 15:30 UTC
  • Deployment server: Thursday, May 4th 2017 16:00 UTC

Schedule for 2016 switch

  • Deployment server: Wednesday, January 20th 2016
  • Traffic: Thursday, March 10th 2016
  • MediaWiki 5-minute read-only test: Tuesday, March 15th 2016, 07:00 UTC
  • Elasticsearch: Thursday, April 7th 2016, 12:00 UTC
  • Media storage/Swift: Thursday, April 14th 2016, 17:00 UTC
  • Services: Monday, April 18th 2016, 10:00 UTC
  • MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)

Switching back:

  • MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
  • Services, Elasticsearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done

Monitoring Dashboards

Aggregated list of interesting dashboards