You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Cache servers: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Volans
m (→‎Add host to new cluster: Switch wmf-auto-reimage to the reimage cookbook)
imported>Krinkle
mNo edit summary
 
Line 2: Line 2:
{{Navigation Wikimedia infrastructure|expand=caching}}
{{Navigation Wikimedia infrastructure|expand=caching}}


WMF currently runs around 100 [[Traffic_cache_hardware|physical servers]] in [[Clusters|5 data centres]] with the purpose of caching HTTP traffic. The servers are split in different [[Varnish#Cache_Clusters|logical clusters]] according to the type of contents they cache. A puppet role corresponds to each logical cluster, for example '''cache::text''' and '''cache::upload'''.
WMF currently runs around 100 [[Traffic_cache_hardware|physical servers]] in [[Data centers|5 data centres]] with the purpose of caching HTTP traffic. The servers are split in different [[Varnish#Cache_Clusters|logical clusters]] according to the type of contents they cache. A puppet role corresponds to each logical cluster, for example '''cache::text''' and '''cache::upload'''.


== Move server to another cluster ==
== Move server to another cluster ==

Latest revision as of 23:04, 17 June 2022

WMF currently runs around 100 physical servers in 5 data centres with the purpose of caching HTTP traffic. The servers are split in different logical clusters according to the type of contents they cache. A puppet role corresponds to each logical cluster, for example cache::text and cache::upload.

Move server to another cluster

The following procedure allows to move a cache server to another cluster. The example shows how cp3043.esams.wmnet can be moved from cache_text to cache_upload.

Depool and downtime

On the cluster::management node (neodymium.eqiad.wmnet at the time of this writing):

sudo -i confctl select name=cp3043.esams.wmnet set/pooled=no

Make sure that depooling took place correctly, for instance by looking at the frontend and backend traffic instance breakdown dashboards.

On the alerting_host node (currently einsteinium.wikimedia.org):

sudo -i icinga-downtime -h cp3043 -d 7200 -r "move $1 to cache_upload --$USER"

Remove host from its current cluster

Remove the host from the list of cache_text machines in conftool and hiera (commit example)

Puppet needs to be run on all other (non-cp3043) cache_text nodes to reflect the hiera changes. Conftool changes take effect automatically upon puppet-merge. You can double-check whether the node has been removed from the list of cache_text servers in esams.

Add host to new cluster

Disable puppet on all cache nodes belonging to the new cluster (cache_upload in this example).

Add the node to cache_upload in hiera, change server role (example) and follow the Server Lifecycle/Reimage procedure

Add node to conftool.

Run puppet on all cache_upload nodes.

Final verification and pooling

Ensure that the new node is working as expected (eg: varnishtest -k /usr/share/varnish/tests/upload/*.vtc, test requests locally against varnish-fe and varnish-be)

Repool.