You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Switch Datacenter: Difference between revisions
imported>RLazarus (→Introduction: 2020 switchback dates) |
imported>Ebernhardson |
||
Line 105: | Line 105: | ||
=== ElasticSearch === | === ElasticSearch === | ||
==== General context on how to switchover ==== | |||
CirrusSearch talks by default to the local datacenter ($wmfDatacenter). If Mediawiki switches datacenter, elasticsearch will automatically follow. | CirrusSearch talks by default to the local datacenter (<code>$wmfDatacenter</code>). If Mediawiki switches datacenter, elasticsearch will automatically follow. | ||
Manually switching CirrusSearch to a specific datacenter can always be done. Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster [https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L16025-L16027 InitialiseSettings.php]. | Manually switching CirrusSearch to a specific datacenter can always be done. Point CirrusSearch to codfw by editing <code>wmgCirrusSearchDefaultCluster</code> [https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L16025-L16027 InitialiseSettings.php]. | ||
To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following [[Search#Recovering_from_an_Elasticsearch_outage.2Finterruption_in_updates|Recovering from an Elasticsearch outage / interruption in updates]]. | To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following [[Search#Recovering_from_an_Elasticsearch_outage.2Finterruption_in_updates|Recovering from an Elasticsearch outage / interruption in updates]]. | ||
==== Preserving more_like query cache performance ==== | |||
CirrusSearch has a caching layer that caches the result of Elasticsearch queries such as "more like this" queries (which are used, among other things, to generate "Related Articles" at the bottom of mobile Wikipedia pages). | |||
Switching datacenters will result in degraded performance while the cache fills back up. | |||
In order to avoid the aforementioned performance degradation, a mitigation should be deployed that will hardcode <code>more_like</code> queries to keep routing to the "old" datacenter for 24 hours following the switchover. | |||
Hardcoding the cirrus cluster will allow the stampede of cache misses to be sent to the secondary search cluster which has enough capacity, once typical traffic has migrated to the new datacenter, to serve the increased load. | |||
This hardcoding should be deployed in advance of the switchover. Since it is effectively a no-op until the actual cutover, it can be deployed as far in advance as desired. | |||
For example, if we are switching over from eqiad to codfw, <code>more_like</code> queries should be hardcoded to route to eqiad; this change should be deployed before the actual cutover. | |||
Then, 24 hours after the cutover, the hardcoding can be removed, allowing <code>more_like queries</code> to route to the ''new'' cirrus dc - in this example, codfw. | |||
===== Days in advance preparation ===== | |||
Deploy a patch to hardcode more_like query routing to the currently active DC (i.e. the datacenter we are switching over ''from''). | |||
''Example Patch:'' [mediawiki-config] {{gerrit|635411}} cirrus: Temporarily hardcode more_like query routing | |||
This mitigation should be left in place for 24 hours following the switchover (equivalent to the cache length), at which point there is no longer any performance penalty to removing the hardcoding. | |||
===== One day after datacenter switch ===== | |||
Revert the earlier patch to hardcode more_like query routing; this will allow these queries to route to the newly active DC, and there will not be any performance degradation since the caches have been fully populated by this point. | |||
==== Dashboards ==== | ==== Dashboards ==== |
Revision as of 18:34, 22 October 2020
Introduction
A datacenter switchover (from eqiad to codfw, or vice-versa) comprises switching over multiple different components, some of which can happen independently and many of which need to happen in lockstep. This page documents all the steps needed to switch over from a master datacenter to another one, broken up by component.
Schedule for 2020 switch
- Services: Monday, August 31st, 2020 14:00 UTC
- Traffic: Monday, August 31st, 2020 15:00 UTC
- MediaWiki: Tuesday, September 1st, 2020 14:00 UTC
Switching back:
- Traffic: Thursday, September 17th, 2020 17:00 UTC
- MediaWiki: Tuesday, October 27th, 2020 14:00 UTC
- Services: Wednesday, October 28th, 2020 14:00 UTC
Per-service switchover instructions
MediaWiki
We divide the process in logical phases that should be executed sequentially. Within any phase, top-level tasks can be executed in parallel to each other, while subtasks are to be executed sequentially to each other. The phase number is referred to in the names of the tasks in the operations/cookbooks repository, in the cookbooks/sre/switchdc/mediawiki/ path.
Days in advance preparation
- OPTIONAL: SKIP IN AN EMERGENCY: Make sure databases are in a good state. Normally this requires no operation, as the passive datacenter databases are always prepared to receive traffic, so there are no actionables. Some things that #DBAs normally should make sure for the most optimal state possible (sanity checks):
- There is no ongoing long-running maintenance that affects database availability or lag (schema changes, upgrades, hardware issues, etc.). Depool those servers that are not ready.
- Replication is flowing from eqiad -> codfw and from codfw -> eqiad (sometimes replication gets stopped in the passive -> active direction to facilitate maintenance)
- All database servers have its buffer pool filled up. This is taken care automatically with the automatic buffer pool warmup functionality. For sanity checks, some sample load could be sent to the mediawiki application server to check requests happen as quickly as in the active datacenter.
- These where the things we prepared/checked for the 2018 switch
- Make absolutely sure that parsercache replication is working from the active to the passive DC. This is important
Phase 0 - preparation
- Disable puppet on maintenance hosts in both eqiad and codfw: 00-disable-puppet.py
- Reduce the TTL on
appservers-ro, appservers-rw, api-ro, api-rw, jobrunner, videoscaler, parsoid-php
to 10 seconds: 00-reduce-ttl.py The warmup step (below) typically takes longer than the old TTL (5 minutes) so this step doesn't wait it to expire. Make sure that at least 5 minutes have passed before moving to Phase 1. - Warm up APC running the mediawiki-cache-warmup on the new site clusters. The warmup queries are automatically sent three times, because empircally that's when the response times stabilize: 00-warmup-caches.py
- The global warmup against the appservers cluster
- The apc-warmup against all hosts in the appservers cluster.
- Set downtime for Read only checks on mariadb masters changed on Phase 3 (they will complain after the etcd primary master changes and until puppet runs there)
Phase 1 - stop maintenance
- Stop maintenance jobs in the active site and kill all the cronjobs on the maintenance host in the active site: 01-stop-maintenance.py
Phase 2 - read-only mode
- Go to read-only mode by changing the
ReadOnly
conftool value: 02-set-readonly.py
Phase 3 - lock down database masters
- Put old-site core DB masters (shards: s1-s8, x1, es4-es5) in read-only mode and wait for the new site's databases to catch up replication: 03-set-db-readonly.py
Phase 4 - switch active datacenter configuration
- Switch the discovery records and MediaWiki active datacenter: 04-switch-mediawiki.py
- Flip
appservers-ro, appservers-rw, api-ro, api-rw, jobrunner, videoscaler, parsoid-php
topooled=true
in the new site. Since both sites are now pooled in etcd, this will not actually change the DNS records for the active datacenter. - Flip
WMFMasterDatacenter
from the old site to the new. - Flip
appservers-ro, appservers-rw, api-ro, api-rw, jobrunner, videoscaler, parsoid-php
topooled=false
in the old site. After this, DNS will be changed for the old DC and internal applications (except mediawiki) will start hitting the new DC
- Flip
Phase 5 - Invert Redis replication for MediaWiki sessions
- Invert the Redis replication for the
sessions
cluster: 05-invert-redis-sessions.py
Phase 6 - Set new site's databases to read-write
- Set new-site's core DB masters (shards: s1-s8, x1, es4-es5) in read-write mode: 06-set-db-readwrite.py
Phase 7 - Set MediaWiki to read-write
- Go to read-write mode by changing the
ReadOnly
conftool value: 07-set-readwrite.py
Phase 8 - post read-only
- Start maintenance in the new DC: 08-start-maintenance.py
- Run puppet on the maintenance hosts
- Update tendril for new database masters: 08-update-tendril.py Pure cosmetic change, no effect on production. No changes required for database zarcillo (which has a different master for eqiad and codfw).
- Set the TTL for the DNS records to 300 seconds again: 08-restore-ttl.py
- Update DNS records for new database masters deploying eqiad->codfw; codfw->eqiad This is not covered by the switchdc script. Please use the following to SAL log
!log Phase 8.4 Update DNS records for new database masters
- Make sure the CentralNotice banner informing users of readonly is removed. Keep in mind, there is some minor HTTP caching involved (~5mins)
- Remove the downtime added in phase 0.
Phase 9 - verification and troubleshooting
This is not covered by the switchdc script
- Make sure reading & editing works! :)
- Make sure recent changes are flowing (see Special:RecentChanges, EventStreams, and the IRC feeds)
- Make sure email works (
exim4 -bp
on mx1001/mx2001, test an email)
Dashboards
- Apache/HHVM
- App servers
- ATS cluster view (text)
- ATS backends<->Origin servers overview (appservers, api, restbase)
- MediaWiki errors
- Database Errors logs
Media storage/Swift
Switchover
Cache-wise, Swift is active-active. However the background synchronization swiftrepl
needs to run in what Mediawiki considers the primary datacenter. A patch such as 622522 accomplishes this and needs to be merged after Mediawiki has been switched over (up to e.g. 24/48 hours later is acceptable)
Switching back
Revert 622522 to move swiftrepl
back to eqiad
Dashboards
- Swift eqiad
- Swift codfw
- Thumbor
- ATS cluster view (upload)
- ATS backends<->Origin servers overview (swift)
ElasticSearch
General context on how to switchover
CirrusSearch talks by default to the local datacenter ($wmfDatacenter
). If Mediawiki switches datacenter, elasticsearch will automatically follow.
Manually switching CirrusSearch to a specific datacenter can always be done. Point CirrusSearch to codfw by editing wmgCirrusSearchDefaultCluster
InitialiseSettings.php.
To ensure coherence in case of lost updates, a reindex of the pages modified during the switch can be done by following Recovering from an Elasticsearch outage / interruption in updates.
Preserving more_like query cache performance
CirrusSearch has a caching layer that caches the result of Elasticsearch queries such as "more like this" queries (which are used, among other things, to generate "Related Articles" at the bottom of mobile Wikipedia pages).
Switching datacenters will result in degraded performance while the cache fills back up.
In order to avoid the aforementioned performance degradation, a mitigation should be deployed that will hardcode more_like
queries to keep routing to the "old" datacenter for 24 hours following the switchover.
Hardcoding the cirrus cluster will allow the stampede of cache misses to be sent to the secondary search cluster which has enough capacity, once typical traffic has migrated to the new datacenter, to serve the increased load.
This hardcoding should be deployed in advance of the switchover. Since it is effectively a no-op until the actual cutover, it can be deployed as far in advance as desired.
For example, if we are switching over from eqiad to codfw, more_like
queries should be hardcoded to route to eqiad; this change should be deployed before the actual cutover.
Then, 24 hours after the cutover, the hardcoding can be removed, allowing more_like queries
to route to the new cirrus dc - in this example, codfw.
Days in advance preparation
Deploy a patch to hardcode more_like query routing to the currently active DC (i.e. the datacenter we are switching over from).
Example Patch: [mediawiki-config] 635411 cirrus: Temporarily hardcode more_like query routing
This mitigation should be left in place for 24 hours following the switchover (equivalent to the cache length), at which point there is no longer any performance penalty to removing the hardcoding.
One day after datacenter switch
Revert the earlier patch to hardcode more_like query routing; this will allow these queries to route to the newly active DC, and there will not be any performance degradation since the caches have been fully populated by this point.
Dashboards
Traffic
General information on generic procedures
https://wikitech.wikimedia.org/wiki/Global_Traffic_Routing
Switchover
GeoDNS (User-facing) Routing:
- gerrit: C+2 and Submit commit https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458806
- <any authdns node>: authdns-update
- SAL Log using the following
!log Traffic: depool eqiad from user traffic
Switchback
Same procedure as above, with reversion of the commit specified: GeoDNS.
Dashboards
Services
All services, are active-active in DNS discovery, apart from restbase, that needs special treatment. The procedure to fail over to one site only is the same for every one of them:
- reduce the TTL of the DNS discovery records to 10 seconds
- depool the datacenter we're moving away from in confctl / discovery
- restore the original TTL
- All of the above is done using the
sre.switchdc.services
cookbooks:
# Switch the service "parsoid" to codfw-only
$ cookbook sre.switchdc.services --services parsoid -- eqiad codfw
# Switch all active-active services to codfw, excluding parsoid and cxserver
$ cookbook sre.switchdc.services --exclude parsoid cxserver -- eqiad codfw
Restbase is a bit of a special case, and needs an additional step, if we're just switching active traffic over and not simulating a complete failover:
- pool restbase-async everywhere, then depool restbase-async in the newly active dc, so that async traffic is separated from real-users traffic as much as possible.
Dashboards
Other miscellaneous
Schedule of past switches
Schedule for 2018 switch
- Services: Tuesday, September 11th 2018 14:30 UTC
- Media storage/Swift: Tuesday, September 11th 2018 15:00 UTC
- Traffic: Tuesday, September 11th 2018 19:00 UTC
- MediaWiki: Wednesday, September 12th 2018: 14:00 UTC
Switching back:
- Traffic: Wednesday, October 10th 2018 09:00 UTC
- MediaWiki: Wednesday, October 10th 2018: 14:00 UTC
- Services: Thursday, October 11th 2018 14:30 UTC
- Media storage/Swift: Thursday, October 11th 2018 15:00 UTC
Schedule for 2017 switch
T138810 on Phabricator tracks tasks to be undertaken during the 2017 switch.
- Elasticsearch: elasticsearch is automatically following mediawiki switch
- Services: Tuesday, April 18th 2017 14:30 UTC
- Media storage/Swift: Tuesday, April 18th 2017 15:00 UTC
- Traffic: Tuesday, April 18th 2017 19:00 UTC
- MediaWiki: Wednesday, April 19th 2017 14:00 UTC (user visible, requires read-only mode)
- Deployment server: Wednesday, April 19th 2017 16:00 UTC
Switching back:
- Traffic: Pre-switchback in two phases: Mon May 1 and Tue May 2 (to avoid cold-cache issues Weds)
- MediaWiki: Wednesday, May 3rd 2017 14:00 UTC (user visible, requires read-only mode)
- Elasticsearch: elasticsearch is automatically following mediawiki switch
- Services: Thursday, May 4th 2017 14:30 UTC
- Swift: Thursday, May 4th 2017 15:30 UTC
- Deployment server: Thursday, May 4th 2017 16:00 UTC
Schedule for 2016 switch
- Deployment server: Wednesday, January 20th 2016
- Traffic: Thursday, March 10th 2016
- MediaWiki 5-minute read-only test: Tuesday, March 15th 2016, 07:00 UTC
- Elasticsearch: Thursday, April 7th 2016, 12:00 UTC
- Media storage/Swift: Thursday, April 14th 2016, 17:00 UTC
- Services: Monday, April 18th 2016, 10:00 UTC
- MediaWiki: Tuesday, April 19th 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
Switching back:
- MediaWiki: Thursday, April 21st 2016, 14:00 UTC / 07:00 PDT / 16:00 CEST (requires read-only mode)
- Services, Elasticsearch, Traffic, Swift, Deployment server: Thursday, April 21st 2016, after the above is done
Monitoring Dashboards
Aggregated list of interesting dashboards