You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Service restarts: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Hnowlan
imported>Btullis
m (→‎From the Cumin server (cookbook): Update to refer to the new cookbook,)
 
(29 intermediate revisions by 14 users not shown)
Line 3: Line 3:
== acmechief hosts ==
== acmechief hosts ==


Every server running the acme_chief::cert class profile will fail to run Puppet while the servers are down. To avoid Puppet spam you can disable Puppet on the hosts during reboots (apart from the Puppet spam there's no errors caused by the temporary non-availability of the acmechief servers):
Every server running the acme_chief::cert class profile will fail to run Puppet while the servers are down. To avoid Puppet spam you can [[Cumin#Disable Puppet|disable Puppet]] on the hosts during reboots (apart from the Puppet spam there's no errors caused by the temporary non-availability of the acmechief servers):


  sudo cumin 'R:acme_chief::cert' "disable-puppet 'acmechief maintenance - ${USER}'"
  sudo cumin 'R:acme_chief::cert' "disable-puppet 'acmechief maintenance - ${USER}'"
Line 11: Line 11:
  sudo keyholder arm
  sudo keyholder arm


== [[Application servers]] (also image/video scalers and job runners and parsoid) ==
{{Ombox/core|text=When rebooting an application server it should be depooled before the reboot. Whether a server has been correctly depooled can be checked by tailing /var/log/apache2/other_vhosts_access.log.}}


=== php-fpm restart ===
Restarts of PHP-FPM should be spread out a little, e.g. by waiting 30 seconds between each restart, but we can run multiple at the same time:
cumin -b 1 -s 30 'A:mw or A:mw-api or A:parsoid or A:mw-jobrunner' 'restart-php7.4-fpm'


== [[Application servers]] (also image/video scalers and job runners) ==
=== Other services restart ===
If you need to restart any other system service on these servers, you should also run the corresponding <code>restart-<service-name></code> scripts if available.


When rebooting an application server it should be depooled before the reboot. Whether a server has been correctly depooled can be checked by tailing /var/log/apache2/other_vhosts_access.log.
Currently, this list is:<syntaxhighlight lang="bash">
restart-apache2                                                                                                              
restart-envoyproxy                                                                                                           
restart-mcrouter
restart-nginx # only on parsoid servers


Restarts of PHP-FPM should be spread out a little, e.g. by waiting 30 seconds between each restart:
</syntaxhighlight>To restart systemd services on multiple application servers, use the <code>sre.mediawiki.restart-appservers</code> cookbook. This cookbook is only available for <code>api_appserver</code>, <code>appserver</code>, and <code>jobrunner</code> clusters.<syntaxhighlight lang="bash">
cumin -b 1 -s 30 'mw1*' 'restart-php7.2-fpm'
# Restart mcrouter and nutcracker on the appservers in codfw and eqiad:
sudo cookbook sre.mediawiki.restart_appservers -c appserver -d eqiad codfw -- mcrouter nutcracker


Jobrunners can be stopped completely with the following commands '''(information below is outdated):'''
# Restart php7.4-fpm on the api appservers in eqiad (no more than 10% at a time):
<syntaxhighlight lang="bash">
sudo cookbook sre.mediawiki.restart_appservers -p 10 -c api_appserver -d eqiad -- php7.4-fpm
service jobchron stop
</syntaxhighlight>
service jobrunner stop
</syntaxhighlight> Our infrastructure is resilient against job errors so this is a safe operation, but please be careful anyway avoiding stopping too many jobrunners at the same time.


Note that restarting jobrunner in the non-active datacenter will lead to surprises when puppet tries to stop it, see also {{Bug|T158288}}.
===Server Reboot===
The mw* servers can be rebooted in batches of 10 (5 API servers and 5 job-runners).  The API servers normally stop receiving requests within seconds of being depooled and can be validated by checking the apache logs for incoming requests or the apache workers or apache incoming requests on the [https://grafana.wikimedia.org/d/000000327/apache-hhvm?orgId=1 grafana dashboard] .  To reboot use the <code>sre.hosts.reboot-single</code> cookbook to reboot single servers or <code>sre.hosts.reboot-cluster</code> cookbook to perform a rolling reboot.


The mediawiki servers also run a local TLS terminator based on nginx, which is used for asyncronous processing of Restbase/Parsoid updates. The service is handled via pybal/confctl. A restart of nginx itself is also acceptable without depooling.
Example to restart a single mediawiki api server in codfw:  <syntaxhighlight lang="bash">
sudo -i confctl --quiet select 'name=mw1312.eqiad.wmnet' set/pooled=no
# wait until server gets no more traffic (logs, Grafana)
sudo cookbook sre.hosts.reboot-single mw1312.eqiad.wmnet
sudo -i confctl --quiet select 'name=mw1312.eqiad.wmnet' set/pooled=yes
</syntaxhighlight>


===Server Reboot===
Example to restart all mediawiki api server in codfw: <syntaxhighlight lang="bash">
The mw* servers can be rebooted in batches of 10 (5 API servers and 5 job-runners).  The API servers normally stop receiving requests within seconds of being depooled and can be validated by checking the apache logs for incoming requests or the apache workers or apache incoming requests on the [https://grafana.wikimedia.org/d/000000327/apache-hhvm?orgId=1 grafana dashboard] .  The job runners are responsible for trans codding video thumbnails and can often have long running jobs.  To ensure a the job runners have been successfully depooled one can pgrep for ffmpeg process.  
sudo cookbook sre.hosts.reboot-cluster -D codfw -c api_appserver --percentage 5 --grace_sleep 60
</syntaxhighlight>The jobrunners and videoscalers are responsible for transcoding video thumbnails and can often have long running jobs.  To ensure the job runners have been successfully depooled one can pgrep for ffmpeg process.  


A rough playbook of this follows, however we hope to create a playbook for this process
A rough example to restart a single mediawiki jobrunner (a dedicated playbook of this follows)
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
sudo cookbook sre.hosts.downtime -r 'Reboot' -M 10 'mw122[1-5].eqiad.wmnet,mw129[3-7].eqiad.wmnet'
sudo -i confctl --quiet select 'name=mw1221.eqiad.wmnet' set/pooled=no
sudo -i confctl --quiet select 'name=mw122[1-5].eqiad.wmnet|mw129[3-7].eqiad.wmnet' set/pooled=no
sudo cumin 'mw1221.eqiad.wmnet' 'pgrep ffmpeg'
sudo cumin 'mw129[3-7].eqiad.wmnet' 'pgrep ffmpeg'
# proceed if no ffmpeg process is found
sudo cumin 'mw129[3-7].eqiad.wmnet,mw122[1-5].eqiad.wmnet' 'reboot'
sudo cookbook sre.hosts.reboot-single mw1221.eqiad.wmnet
sudo -i confctl --quiet select 'name=mw122[1-5].eqiad.wmnet|mw129[3-7].eqiad.wmnet' set/pooled=yes
sudo -i confctl --quiet select 'name=mw1221.eqiad.wmnet' set/pooled=yes
</syntaxhighlight>To reboot multiple jobrunner or videoscaler the <code>sre.hosts.reboot-cluster</code> cookbook can also be used with a higher <code>--grace sleep</code> parameter to give ffmpeg jobs more time to finish. However this is not a guarantee that all jobs are finished. Make sure to check for ffmpeg jobs during cookbokk execution.
 
A rough example to restart all mediawiki jobrunner in codfw (a dedicated playbook of this follows)<syntaxhighlight lang="bash">
sudo cookbook sre.hosts.reboot-cluster -D codfw -c jobrunner --percentage 10 --grace_sleep 180
# check for ffmpeg jobs before each server is restarted
sudo cumin 'P:cumin::target%cluster = jobrunner and *.codfw.wmnet' 'pgrep ffmpeg'
</syntaxhighlight>
</syntaxhighlight>


Line 57: Line 79:


==== From the Cumin server (cookbook) ====
==== From the Cumin server (cookbook) ====
<code>$ sudo cookbook sre.aqs.roll-restart.py aqs</code>  
<code>$ sudo cookbook sre.aqs.roll-restart-reboot restart-daemons</code>  


==[[ATS]]==
==[[ATS]]==
Line 101: Line 123:
sudo cumin -s5 -b3 'C:varnishkafka' 'systemctl restart varnishkafka-webrequest.servicee'
sudo cumin -s5 -b3 'C:varnishkafka' 'systemctl restart varnishkafka-webrequest.servicee'
</syntaxhighlight>
</syntaxhighlight>
===Varnish backend===
*(/usr/local/sbin/)varnish-backend-restart


==[[Cassandra]] (as used in aqs, sessionstore and restbase)==
==[[Cassandra]] (as used in aqs, sessionstore and restbase)==
Line 185: Line 203:


==Authoritative DNS==
==Authoritative DNS==
The following hosts are the current (Oct 2018) authoritative DNS servers:
For restarting authdns servers (currently dns[12]001 and dns300[12]), homer changes are '''NOT''' required.
 
The following hosts are the current (March 2023) authoritative DNS servers:


{| class="wikitable"
{| class="wikitable"
Line 191: Line 211:
!DC!!Hostname!!AKA
!DC!!Hostname!!AKA
|-
|-
|eqiad||authdns1001.wikimedia.org||ns0.wikimedia.org
|eqiad||dns100[123].wikimedia.org||ns0.wikimedia.org
|-
|-
|codfw||authdns2001.wikimedia.org||ns1.wikimedia.org
|codfw||dns200[123].wikimedia.org||ns1.wikimedia.org
|-
|-
|esams||dns300[12].wikimedia.org||ns2.wikimedia.org
|esams||dns300[12].wikimedia.org||ns2.wikimedia.org
Line 204: Line 224:
!Host to reboot!!Action
!Host to reboot!!Action
|-
|-
|authdns1001.wikimedia.org||route ns0.wikimedia.org to authdns2001 (codfw)
|dns1001.wikimedia.org||route ns0.wikimedia.org to dns2001 (codfw)
|-
|-
|authdns2001.wikimedia.org||route ns1.wikimedia.org to authdns1001 (eqiad)
|dns2001.wikimedia.org||route ns1.wikimedia.org to dns1001 (eqiad)
|-
|-
|dns300[12].wikimedia.org||route ns2.wikimedia.org to authdns1001 (eqiad)
|dns300[12].wikimedia.org||route ns2.wikimedia.org to dns1001 (eqiad)
|}
|}


For example, to route ns0 to authdns2001, on cr1 and cr2-eqiad, apply the following changes:
For example, to route ns0 to dns2001, on cr1 and cr2-eqiad, apply the following changes:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# delete routing-options static route 208.80.154.238/32 next-hop 208.80.154.134
# delete routing-options static route 208.80.154.238/32 next-hop 208.80.154.10
# set routing-options static route 208.80.154.238/32 next-hop 208.80.153.17
# set routing-options static route 208.80.154.238/32 next-hop 208.80.153.77
# commit
# commit
</syntaxhighlight>
</syntaxhighlight>
Where 208.80.154.238/32 is ns0 and 208.80.153.17 is authdns2001. The VIPs are only IPv4 for now so no need to do the same with v6.
Where 208.80.154.238/32 is ns0 and 208.80.153.77 is dns2001. The VIPs are only IPv4 for now so no need to do the same with v6.


Make sure to revert the changes before proceeding to the next host.
Make sure to revert the changes before proceeding to the next host.
Line 224: Line 244:


==DNS recursors ==
==DNS recursors ==
The DNS recursors in eqiad, codfw, ulsfo and eqsin can simply be depooled/repooled one at a time (they only run the pdns_recursor service).
Rebooting the servers will cause some currently unavoidable BGP alerts, e.g. for a reboot in eqsin, simply give a headsup in #wikimedia-operations before you start:
Rebooting the servers will cause some currently unavoidable BGP alerts, e.g. for a reboot in eqsin, simply give a headsup in #wikimedia-operations before you start:


Line 234: Line 252:
</pre>
</pre>


The DNS servers in esams co-host the authdns service (on eqiad/codfw these have separate servers (authdns1001/authdns2001), so when rebooting dns3* servers, see the instructions for authdns reboots above.
The DNS servers in eqiad/codfw/esams co-host the authdns service, so when rebooting dns[12]001 and dns300[12] servers, see the instructions for authdns reboots above.


== Docker-registry ==
== Docker-registry ==
Line 247: Line 265:
   curl -s localhost:9200/_cluster/health?pretty
   curl -s localhost:9200/_cluster/health?pretty


Initially the "status" field should be "green". After elasticsearch has been stopped/rebooted, the "number_of_nodes" will go down by one and the "status" will switch to "yellow". The search cluster will resync, but it might take 1-2 hours to reach that state. Once it has recovered that next node can be restarted/rebooted. See [[Search#Administration|search cluster administration]] for more details about elasticsearch administration.
Initially the "status" field should be "green". After elasticsearch has been stopped/rebooted, the "number_of_nodes" will go down by one and the "status" will switch to "yellow". The search cluster will resync, but it might take awhile hours to reach that state. Once it has recovered that next node can be restarted/rebooted. See [[Search#Administration|search cluster administration]] for more details about elasticsearch administration.


The time needed for recovery can be slightly decreased by disabling shard allocation during the downtime of a node. This can be done by running <code>es-tool stop-replication</code> / <code>es-tool start-replication</code> on any of the elasticsearch node:
=== Disgusting script to view Elasticsearch cluster health ===
Apologies in advance, but you can use [[gitlab:repos/search-platform/es-maint-viewer/-/tree/main|es-maint-viewer.sh]] to view cluster health during a maintenance operation. Run the script from a cumin host and it will open several tmux windows to measure the health of all Elastic clusters in the DC (there are two small and one large clusters per DC).


  es-tool stop-replication
=== Restart Procedure ===
  reboot
For cluster-wide restarts, reboots, or upgrades,  use [[gerrit:plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/elasticsearch/rolling-operation.py|the rolling-operation cookbook.]]
  es-tool start-replication


The elasticsearch hosts also use Nginx for TLS termination, restarting nginx using "service nginx restart" will kill currently open requests, so it's recommended to depool the server for the restart.
=== Node ban procedure (for DC maintenances) ===
To ban specific Elastic nodes from the cluster (such as for DC maintenances), use [[gerrit:plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/elasticsearch/ban.py|the node ban cookbook]]


==[[etcd]]==
==[[etcd]]==
Line 311: Line 330:
The restart should be pre-announced on #wikimedia-operations (for maybe 15 minutes) to give people a heads-up:
The restart should be pre-announced on #wikimedia-operations (for maybe 15 minutes) to give people a heads-up:
  service gerrit restart
  service gerrit restart
== GitLab ==
The restart should be pre-announced on #wikimedia-gitlab (for maybe 15 minutes) to give people a heads-up. To restart all GitLab components run:<syntaxhighlight lang="bash">
gitlab-ctl restart
</syntaxhighlight>GitLab needs some time to recover and answers with 5XX for some minutes. It is also possible to restart individual components of GitLab using <code>gitlab-ctl restart nginx</code>. Get a list of all services using <code>gitlab-ctl status</code>.
The GitLab SSH daemon can be restarted using:<syntaxhighlight lang="bash">
systemctl restart ssh-gitlab
</syntaxhighlight>
=== GitLab runners ===
[[GitLab/Gitlab_Runner|GitLab runners]] are separate from GitLab servers and, unlike servers, can be rebooted anytime. Use the <code>cookbook sre.gitlab.reboot-runner</code> to gracefully reboot Runners.
To restart gitlab-runner process:<syntaxhighlight lang="bash">
systemctl restart gitlab-runner
</syntaxhighlight>They exist both in production and in cloud.
See also GitLab [[GitLab/Cheat Sheet|Cheat Sheet]].


==[[Hadoop]] workers==
==[[Hadoop]] workers==
Line 333: Line 371:
<code>$ sudo cookbook sre.hadoop.roll-restart-workers --hdfs-dn-sleep-seconds 120 --hdfs-jn-sleep-seconds 120 analytics</code>
<code>$ sudo cookbook sre.hadoop.roll-restart-workers --hdfs-dn-sleep-seconds 120 --hdfs-jn-sleep-seconds 120 analytics</code>


==[[Haproxy]]==
==[[HAProxy]]==


Please check the [https://wikitech.wikimedia.org/wiki/Service_restarts#Authoritative_DNS DNS] section to depool servers on at a time before restarting haproxy on DNS servers
Please check the [https://wikitech.wikimedia.org/wiki/Service_restarts#Authoritative_DNS DNS] section to depool servers on at a time before restarting haproxy on DNS servers
Line 378: Line 416:
After restarting a broker a [https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Kafka/Administration#Replica_Elections replica election] should be performed.
After restarting a broker a [https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Kafka/Administration#Replica_Elections replica election] should be performed.


{{warning}} Restarting a broker will trigger a bug in the Kafka 0.9 truncate log function that will set all the logs' mtime to now (more info https://phabricator.wikimedia.org/T136690). This will mess up the regular Kafka cleaning policy that we have set, namely remove all the files with mtime older than 7 days. This could lead to excessive data stored in one disk partition and disk full alarms. To avoid this, the Analytics team deployed a limit for the total topic partition size (500GiB), so even in case of restarts we should be ok. Just alert the Analytics team on IRC to give them a heads up.
{{warning|Restarting a broker will trigger a bug in the Kafka 0.9 truncate log function that will set all the logs' mtime to now (more info https://phabricator.wikimedia.org/T136690). This will mess up the regular Kafka cleaning policy that we have set, namely remove all the files with mtime older than 7 days. This could lead to excessive data stored in one disk partition and disk full alarms. To avoid this, the Analytics team deployed a limit for the total topic partition size (500GiB), so even in case of restarts we should be ok. Just alert the Analytics team on IRC to give them a heads up.}}


==Kafka brokers (main clusters)==
==Kafka brokers (main clusters)==
Line 406: Line 444:
==[[LVS]]==
==[[LVS]]==


The LVS servers are configured in primary/backup pairs (configured on the routers and visible in puppet in modules/lvs/manifests/configuration.pp). To redirect the traffic from a primary to the backup, pybal can be stopped (traffic is then being redirected to the backup).
The LVS servers are configured in primary/backup pairs (configured on the routers and visible in puppet in modules/lvs/manifests/configuration.pp). To redirect the traffic from a primary to the backup, pybal can be stopped (traffic is then being redirected to the backup). Before stopping Pybal, make sure to disable Puppet first as a Puppet run might enable it.


==[[maps]]==
==[[maps]]==
Line 498: Line 536:
  {"ncredir1002.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=ncredir,service=nginx"}
  {"ncredir1002.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=ncredir,service=nginx"}


You can also use the restart-nginx script provided by profile::lvs::realserver to restart nginx safely:<syntaxhighlight lang="bash">
vgutierrez@ncredir5001:~$ sudo -i restart-nginx
2022-03-18 11:34:27,945 [INFO] Depooling currently pooled services
2022-03-18 11:34:28,435 [INFO] Waiting 3 seconds before restarting the service...
2022-03-18 11:34:31,439 [INFO] Restarting the service
2022-03-18 11:34:31,522 [INFO] Repooling previously pooled services
</syntaxhighlight>


==netflow==
==netflow==
Line 529: Line 574:
==[[Parsoid]]==
==[[Parsoid]]==


For service restarts, parsoid can simply be restarted using 'sudo systemctl restart parsoid'.
Parsoid has moved from JS to PHP. This means nowadays a parsoid restart equals a php-fpm restart. Use "restart-php7.2-fpm" via [[cumin]]. See [[Service_restarts#Application_servers_(also_image/video_scalers_and_job_runners)]]
 
When rebooting one of the wtp* hosts, they should be depooled via pybal/conftool (two systems at at time). Whether a server has been correctly depooled can be checked by tailing /srv/log/parsoid/main.log.
 


==ping==
==ping==
Line 542: Line 584:
Pivot is stateless and can be restarted at any time.
Pivot is stateless and can be restarted at any time.


==[[Pool counters]]==
==[[PoolCounter|Pool counters]]==


The pool counters in the inactive data centre can be rebooted rightaway. For the active data centre, they should be removed/readded from mediawiki-config one by one (example commit: https://gerrit.wikimedia.org/r/#/c/418870/)
The pool counters in the inactive data center can be rebooted rightaway. For the active data center, they should be removed/readded from mediawiki-config one by one (example commit: https://gerrit.wikimedia.org/r/#/c/418870/)


Before rebooting it can be doublechecked with "ss" (on port 7531) that no further mediawikis are connected. Thumbor uses poolcounter1001 independant of the mediawiki-config, but it's acceptable for a reboot to miss it.
Before rebooting it can be doublechecked with "ss" (on port 7531) that no further mediawikis are connected. Thumbor uses poolcounter1001 independant of the mediawiki-config, but it's acceptable for a reboot to miss it.
==[[PKI|PKI]]==
The pki-root* servers can simply be rebooted at any time
for pki* servers passive services can simply be rebooted.  The active server (the server pointed to via pki.discovery.wmnet.) [https://gerrit.wikimedia.org/r/c/operations/dns/+/785873 will need updating in DNS first].  As this domain is only used internaly you can clear the DNS cache instead of waiting for the TTL to expire (<code> sudo cookbook sre.dns.wipe-cache pki.discovery.wmnet </code>)


==Postgres==
==Postgres==
Line 666: Line 713:


These run a Casssandra cluster, see the section for details on restarts/reboots.
These run a Casssandra cluster, see the section for details on restarts/reboots.
==[[Strongswan]]==
On cache hosts, the strongswan service can be restarted with all services depooled.
  sudo depool ; sudo systemctl restart strongswan ; sudo pool


==[[Swift]]==
==[[Swift]]==
Line 684: Line 725:
  '''Frontends service restarts'''
  '''Frontends service restarts'''
  systemctl restart swift-proxy
  systemctl restart swift-proxy
As an example, to restart the swift-proxy in codfw:
<code>sudo cumin -b 1 -s 5 'A:codfw and P{O:swift::proxy}' 'depool && sleep 3 && systemctl restart swift-proxy && sleep 3 && pool'</code>


==[[Thanos]]==
==[[Thanos]]==
Line 696: Line 740:
To restart thumbor instances
To restart thumbor instances
  systemctl restart 'thumbor*'
  systemctl restart 'thumbor*'
== Volunteer Response Team (VRTS/OTRS) ==
Service and server restarts should be annouced in IRC channel first (#wikimedia-vrt). Please note VRTS is also known as OTRS (old name) or znuny (ticket system).
=== Service Restart ===
VRTS runs multiple services:
cron:
<code>sudo systemctl restart cron.service</code>
exim4 (email):
<code>sudo systemctl restart exim4.service</code>
apache2 (webserver):
<code>sudo systemctl restart apache2.service</code>
vrts-daemon (znuny ticket system):
<code>sudo systemctl restart vrts-daemon.service</code>
=== Server Reboot ===
To reboot use the <code>sre.hosts.reboot-single</code> cookbook:
<code>sudo cookbook sre.hosts.reboot-single otrs1001.eqiad.wmnet</code>


==Wikidata Query Service (WDQS)==
==Wikidata Query Service (WDQS)==
Line 711: Line 782:
=== Service Restart ===
=== Service Restart ===


Wikidough hosts: <code>doh[12345]00[12]</code>
Wikidough hosts: <code>doh[123456]00[12]</code>


Wikidough has two components, dnsdist and pdns-recursor. To restart,
Wikidough has two components, dnsdist and pdns-recursor. To restart,
Line 769: Line 840:
</syntaxhighlight>'''Please note that only executing "ruok" is not enough since it will only tell you what is the status of the daemon, not the health of the cluster! This caused [[Incident documentation/20170831-Zookeeper]].'''
</syntaxhighlight>'''Please note that only executing "ruok" is not enough since it will only tell you what is the status of the daemon, not the health of the cluster! This caused [[Incident documentation/20170831-Zookeeper]].'''


Please also double check https://grafana.wikimedia.org/dashboard/db/zookeeper after each restart.
Please also double check https://grafana.wikimedia.org/d/000000261/zookeeper after each restart.


There is also a cumin cookbook:
There is also a cumin cookbook:

Latest revision as of 15:09, 17 May 2023

This page collects procedures to restart services (or reboot the underlying server) in the WMF production cluster.

acmechief hosts

Every server running the acme_chief::cert class profile will fail to run Puppet while the servers are down. To avoid Puppet spam you can disable Puppet on the hosts during reboots (apart from the Puppet spam there's no errors caused by the temporary non-availability of the acmechief servers):

sudo cumin 'R:acme_chief::cert' "disable-puppet 'acmechief maintenance - ${USER}'"

After rebooting you need to rearm the keyholder:

sudo keyholder arm

Application servers (also image/video scalers and job runners and parsoid)

php-fpm restart

Restarts of PHP-FPM should be spread out a little, e.g. by waiting 30 seconds between each restart, but we can run multiple at the same time:

cumin -b 1 -s 30 'A:mw or A:mw-api or A:parsoid or A:mw-jobrunner' 'restart-php7.4-fpm'

Other services restart

If you need to restart any other system service on these servers, you should also run the corresponding restart-<service-name> scripts if available.

Currently, this list is:

restart-apache2                                                                                                               
restart-envoyproxy                                                                                                            
restart-mcrouter
restart-nginx # only on parsoid servers

To restart systemd services on multiple application servers, use the sre.mediawiki.restart-appservers cookbook. This cookbook is only available for api_appserver, appserver, and jobrunner clusters.

# Restart mcrouter and nutcracker on the appservers in codfw and eqiad:
sudo cookbook sre.mediawiki.restart_appservers -c appserver -d eqiad codfw -- mcrouter nutcracker

# Restart php7.4-fpm on the api appservers in eqiad (no more than 10% at a time):
sudo cookbook sre.mediawiki.restart_appservers -p 10 -c api_appserver -d eqiad -- php7.4-fpm

Server Reboot

The mw* servers can be rebooted in batches of 10 (5 API servers and 5 job-runners). The API servers normally stop receiving requests within seconds of being depooled and can be validated by checking the apache logs for incoming requests or the apache workers or apache incoming requests on the grafana dashboard . To reboot use the sre.hosts.reboot-single cookbook to reboot single servers or sre.hosts.reboot-cluster cookbook to perform a rolling reboot.

Example to restart a single mediawiki api server in codfw:

sudo -i confctl --quiet select 'name=mw1312.eqiad.wmnet' set/pooled=no
# wait until server gets no more traffic (logs, Grafana)
sudo cookbook sre.hosts.reboot-single mw1312.eqiad.wmnet
sudo -i confctl --quiet select 'name=mw1312.eqiad.wmnet' set/pooled=yes

Example to restart all mediawiki api server in codfw:

sudo cookbook sre.hosts.reboot-cluster -D codfw -c api_appserver --percentage 5 --grace_sleep 60

The jobrunners and videoscalers are responsible for transcoding video thumbnails and can often have long running jobs. To ensure the job runners have been successfully depooled one can pgrep for ffmpeg process.

A rough example to restart a single mediawiki jobrunner (a dedicated playbook of this follows)

sudo -i confctl --quiet select 'name=mw1221.eqiad.wmnet' set/pooled=no
sudo cumin 'mw1221.eqiad.wmnet' 'pgrep ffmpeg'
# proceed if no ffmpeg process is found
sudo cookbook sre.hosts.reboot-single mw1221.eqiad.wmnet
sudo -i confctl --quiet select 'name=mw1221.eqiad.wmnet' set/pooled=yes

To reboot multiple jobrunner or videoscaler the sre.hosts.reboot-cluster cookbook can also be used with a higher --grace sleep parameter to give ffmpeg jobs more time to finish. However this is not a guarantee that all jobs are finished. Make sure to check for ffmpeg jobs during cookbokk execution. A rough example to restart all mediawiki jobrunner in codfw (a dedicated playbook of this follows)

sudo cookbook sre.hosts.reboot-cluster -D codfw -c jobrunner --percentage 10 --grace_sleep 180
# check for ffmpeg jobs before each server is restarted
sudo cumin 'P:cumin::target%cluster = jobrunner and *.codfw.wmnet' 'pgrep ffmpeg'

aqs

There are two services running on the AQS hosts: Cassandra and the HTTP nodejs service.

Cassandra needs to be roll restarted one node at the time, see the section about Cassandra in this page.

The aqs service is stateless and can be restarted on the individual servers (but only one at a time). The following commands can be used to restart the aqs service

From the one of the AQS servers

$ sudo depool ; sleep 5 ; sudo systemctl restart aqs.service ; sleep 5 ; sudo pool

From the Cumin server

$ sudo cumin -m async -b 1 -s 20 A:aqs 'depool' 'sleep 5' 'systemctl restart aqs.service' 'sleep 5' 'pool'

From the Cumin server (cookbook)

$ sudo cookbook sre.aqs.roll-restart-reboot restart-daemons

ATS

The ats-backend-restart script is installed on all ATS nodes and takes care of depooling the service, restarting it, and repooling. If trafficserver needs to be restarted on a single host, just run the script as root. Cluster-wide rolling restarts can be performed with cumin, see the following example:

sudo cumin -b1 -s30 A:cp-codfw ats-backend-restart

Be aware that ats-backend-restart also sleeps for some 60 seconds so the cumin command will take over an hour. however the ATS cache_upload and cache_text servers can be restarted in parallel as can different datacenters

Bacula

Before rebooting a storage host or the director make sure no backup run is currently in progress. This can be checked on helium via:

 sudo bconsole 
 status director

Bastions

To reboot those, it's best to announce the reboots in advance via the ops list, so that people are aware and no critical work is disrupted. There might still be people who overlook or forget it, so it's best to ping logged-in, non-idle users on IRC before proceeding with the reboots.

Cache proxies (varnish) (cp)

The Varnish servers will depool themselves on clean shutdown.

Alternatively you can run the depool and pool commands as root.

Or also:confctl select 'name=<fqdn>' set/pooled=yes On puppetmaster1001 as root.

<fqdn> can be a regex if tackling several machines, change yes to no for a depool.

When restarting nginx

 cumin 'foo*' -b 1 -s 15 'service nginx upgrade'

performs a graceful online restart with 15 second delay in between.

When restarting Varnishkafka

systemctl restart varnishkafka-webrequest

Important note: restarting Varnishkafka means that its sequence number internal variable is set to 0, affecting the JSON messages/event sent to Kafka (they all carry that field). This is usually not a big problem but if all the caching hosts are restarted in once it may cause alarms to Analytics for inconsistent data in Hadoop (hours after the restarts). Please do the restarts in small batches and alert the Analytics team in advance.

sudo cumin -s5 -b3 'C:varnishkafka' 'systemctl restart varnishkafka-webrequest.servicee'

Cassandra (as used in aqs, sessionstore and restbase)

Cassandra as used in restbase/sessionstore uses a multi-instance setup, i.e. one host runs multiple Cassandra processes, typically named "a", "b", etc. For each instance there is a corresponding nodetool-NAME binary that can be used, e.g nodetool-a status -r.

A restart of cassandra as used for restbase does not require a depooling of the server (restbase will pick a different cassandra node if the local one is unavailable).

Before starting with a reboot/restart, check whether there are ongoing maintenance tasks (e.g. with Eric Evans).

The restart of the Cassandra instances can be performed using the c-foreach-restart command, it figures out how many instances are running and proceeds step by step:

 sudo c-foreach-restart --delay 10 --attempts 20 --retry 12

If you want to reboot a Cassandra server, the instances can be drained using c-foreach-nt, after the instances are drained, the server can be restarted:

 sudo c-foreach-nt drain

Before proceeding with the next node (both for restarts and reboots), ensure the restarted node has correctly rejoined the cluster (the name of the tool is relative to the restarted service instance):

 c-any-nt status -r

Note: The c-foreach-restart utility ensures that restarted nodes are fully online before continuing; This should not be necessary when using c-foreach-restart.

Directly after the restart the tool might throw either an exception reading "No nodes are present in the cluster" or another reading "Failed to connect to 'localhost:7189'" but this usually sorts out within a few seconds. If the node has correctly rejoined the cluster, it should be listed with "UN" prefix, e.g.:

UN  xenon-a.eqiad.wmnet              224.65 GB  256     ?       0d691414-4132-4854-a00d-1d2671e15728  rack1

Ceph (WMCS CloudVPS)

The Ceph storage cluster is a critical component of CloudVPS. Any action on these service should be coordinated with the WMCS team either through IRC or Phabricator.

Detailed information on restarting Ceph services (should be handled by the WMCS team)

ChartMuseum

See ChartMuseum#Operations

cloudcontrol

nova-{api,conductor,scheduler}, neutron-server, keystone-all and glance-{api,registry} are all part of OpenStack and are very disruptive to restart. restarts to theses services are only preformed for serious updates and are generally bundled with a reboot or the server.

Any action should be coordinated with the WMCS team and communicated to WMCS users

clouddb (Wikireplicas)

These are Mariadb servers, but they have a peculiar failover setup on haproxy.

  • Depool using the hiera method described in WMCS docs. The docs for the legacy wikireplicas (named labsdb1*) are also on that page.
  • Wait for queries to finish or kill queries as needed on the instance
  • Stop replication. On a legacy wikireplica (labsdb*) run sudo -i mysql -e "STOP SLAVE". On the clouddb* wikireplicas, you need to connect using the appropriate socket (such as sudo -i mysql -e "STOP SLAVE" -S /var/run/mysqld/mysqld.s1.sock for the s1 instance).
  • Stop the server, if labsdb systemctl stop mariadb or stop both instances, if a clouddb, ie. systemctl stop mariadb@s1 and systemctl stop mariadb@s3 then reboot
  • When it comes back up start mariadb and then make sure you sudo -i mysql -e "START SLAVE" (on labsdb) or the same with the appropriate sockets (clouddb) and make sure that SHOW SLAVE STATUS\G looks healthy

Charon

Charon is used for ipsec you can sudo depool; sudo systemctl restart ipsec.service ; sudo pool do one at a time

Cumin

For reboots make sure noone is currently using the host. After a reboot, the keyholder needs to be rearmed:

 sudo keyholder arm

(The passphrase is in pwstore in the cumin-master-key-passphrase file).

Druid

Druid is used for the Analytics/Data_Lake. There are two druid clusters:

  • druid analytics - serves queries for Analytics UIs (Turnilo/Superset/etc..)
  • druid public - serves queries for the public AQS service

If Druid analytics goes down, the Analytics team will notice, and some Hadoop jobs may fail, but no real world service will be impacted. If Druid public goes down, API queries to AQS may fail since part of the API needs to fetch content from Druid (Mediawiki edit history related queries).

Please note two important things:

  • Zookeeper is running on the Druid hosts (used by the Druid daemons).
  • Turnilo (turnilo.wikimedia.org) and Superset (superset.wikimedia.org) uses Druid analytics as backend storage to show data.

To roll restart Druid daemons use the related cookbook:

$ sudo cookbook sre.druid.roll-restart-workers public

$ sudo cookbook sre.druid.roll-restart-workers analytics

Also see: Analytics/Cluster/Druid#Full Restart of services

Authoritative DNS

For restarting authdns servers (currently dns[12]001 and dns300[12]), homer changes are NOT required.

The following hosts are the current (March 2023) authoritative DNS servers:

DC Hostname AKA
eqiad dns100[123].wikimedia.org ns0.wikimedia.org
codfw dns200[123].wikimedia.org ns1.wikimedia.org
esams dns300[12].wikimedia.org ns2.wikimedia.org

DNS traffic needs to be routed to the closest authoritative server prior to rebooting. Each authoritative server has all 3 IP addresses (ns[012]) on the loopback interface.

Host to reboot Action
dns1001.wikimedia.org route ns0.wikimedia.org to dns2001 (codfw)
dns2001.wikimedia.org route ns1.wikimedia.org to dns1001 (eqiad)
dns300[12].wikimedia.org route ns2.wikimedia.org to dns1001 (eqiad)

For example, to route ns0 to dns2001, on cr1 and cr2-eqiad, apply the following changes:

# delete routing-options static route 208.80.154.238/32 next-hop 208.80.154.10
# set routing-options static route 208.80.154.238/32 next-hop 208.80.153.77
# commit

Where 208.80.154.238/32 is ns0 and 208.80.153.77 is dns2001. The VIPs are only IPv4 for now so no need to do the same with v6.

Make sure to revert the changes before proceeding to the next host.

The DNS dashboard can be used to confirm whether routing changes (and their revert when done rebooting) have taken place correctly.

DNS recursors

Rebooting the servers will cause some currently unavoidable BGP alerts, e.g. for a reboot in eqsin, simply give a headsup in #wikimedia-operations before you start:

PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status

The DNS servers in eqiad/codfw/esams co-host the authdns service, so when rebooting dns[12]001 and dns300[12] servers, see the instructions for authdns reboots above.

Docker-registry

depool server, systemctl restart docker-registry; systemctl restart nginx; pool.

Elasticsearch

The cluster continues to work fine as long as elasticsearch is only restarted on one node at a time (or the host rebooted). The overall cluster state can be queried from any node.

On an arbitrary elasticsearch node the following command returns the overall state of the elasticsearch cluster:

 curl -s localhost:9200/_cluster/health?pretty

Initially the "status" field should be "green". After elasticsearch has been stopped/rebooted, the "number_of_nodes" will go down by one and the "status" will switch to "yellow". The search cluster will resync, but it might take awhile hours to reach that state. Once it has recovered that next node can be restarted/rebooted. See search cluster administration for more details about elasticsearch administration.

Disgusting script to view Elasticsearch cluster health

Apologies in advance, but you can use es-maint-viewer.sh to view cluster health during a maintenance operation. Run the script from a cumin host and it will open several tmux windows to measure the health of all Elastic clusters in the DC (there are two small and one large clusters per DC).

Restart Procedure

For cluster-wide restarts, reboots, or upgrades, use the rolling-operation cookbook.

Node ban procedure (for DC maintenances)

To ban specific Elastic nodes from the cluster (such as for DC maintenances), use the node ban cookbook

etcd

Etcdmirror reads from the one datacenter and replicates its data to the etcd cluster in the other. This means that if you reboot the hosts in the master cluster (currently codfw) you need to downtime the etcdmirror service on the server in the slave cluster doing the replication, or that will page.

Log on the server that is replicating and check what is the source host for the replication:

elukey@conf2002:~$ sudo systemctl status etcdmirror-conftool-eqiad-wmnet.service
● etcdmirror-conftool-eqiad-wmnet.service - Etcd mirrormaker
   Loaded: loaded (/lib/systemd/system/etcdmirror-conftool-eqiad-wmnet.service; enabled)
   Active: active (running) since Mon 2017-07-17 14:52:18 UTC; 55min ago
 Main PID: 23540 (etcd-mirror)
   CGroup: /system.slice/etcdmirror-conftool-eqiad-wmnet.service
           └─23540 /usr/bin/python /usr/bin/etcd-mirror --strip --src-prefix /conftool --dst-prefix /conftool https://conf1001.eqiad.wmnet:2379 http://localhost:2378

In this case etcdmirror on conf2002 is pulling data from conf1001.

The etcd nodes are internal clustered and can be rebooted one at a time, or the etcd service can be restarted one host at a time with sudo systemctl restart etcd.service.

After a reboot, the cluster health can be checked via one of the following:

sudo etcdctl -C https://$(hostname -f):2379 cluster-health
 /usr/local/bin/nrpe_etcd_cluster_health --url https://conf1001.eqiad.wmnet:2379

More in depth info about the etcd cluster can be found in Etcd#Operations

Exim

The exim service/the mx* hosts can be restarted/rebooted individually without external impact; mail servers trying to deliver mails will simply re-try at a later point if the SMTP service is unavailable:

service exim4 restart

EventLogging

EventLogging is a python based service that reads/writes from Analytics Kafka (more info in Analytics/EventLogging). If you need to restart the service or reboot the host you can follow Analytics/EventLogging/Oncall#Restart EventLogging, but please reach out to the Analytics IRC channel first just to be sure (#wikimedia-analytics).

Event Schema Service

schema.svc.$site.wmnet is hosted on the schema* servers. This service is a very simple nginx http file server that allows for remote http requests for schemas cloned from git repositories. Rebooting these servers just requires the usual rolling depool; reboot; pool for each.

failoid

Failoid is used for DNS discovery to indicate that a service is failing. It's iptables setup rejects a connection immediately instead of letting the client run into a timeout. As such, Failoid instances can be rebooted one at a time unless there's currently an ongoing service outage.

Ganeti

Ganeti nodes can be upgraded without impact on the running VMs. To reboot a node, its virtual machines nodes need to be migrated to other hosts, with the master node needing special attention.

Gerrit

The restart should be pre-announced on #wikimedia-operations (for maybe 15 minutes) to give people a heads-up:

service gerrit restart

GitLab

The restart should be pre-announced on #wikimedia-gitlab (for maybe 15 minutes) to give people a heads-up. To restart all GitLab components run:

gitlab-ctl restart

GitLab needs some time to recover and answers with 5XX for some minutes. It is also possible to restart individual components of GitLab using gitlab-ctl restart nginx. Get a list of all services using gitlab-ctl status. The GitLab SSH daemon can be restarted using:

systemctl restart ssh-gitlab

GitLab runners

GitLab runners are separate from GitLab servers and, unlike servers, can be rebooted anytime. Use the cookbook sre.gitlab.reboot-runner to gracefully reboot Runners.

To restart gitlab-runner process:

systemctl restart gitlab-runner

They exist both in production and in cloud.

See also GitLab Cheat Sheet.

Hadoop workers

Please coordinate with the Analytics team before taking any action, there are multiple dependencies to consider before proceeding. For example, Gobblin might need to be stopped to prevent data loss/lag in HDFS.

Hadoop's master node (an-master1001.eqiad.wmnet) and its standby replica (an-master1002.eqiad.wmnet) are configured for automatic failover, but please read the following page: Analytics/Cluster/Hadoop/Administration#Manual Failover


Three of the Hadoop workers run an additional JournalNode process to ensure that the standby master node is kept in sync with the active one. These are configured in the puppet manifest. When rebooting JournalNode hosts it must be ensured that two additional JournalNode hosts are up and running.

systemctl restart hadoop-hdfs-journalnode

The other Hadoop workers are running two services (hadoop-hdfs-datanode and hadoop-yarn-nodemanager). The services on the Hadoop workers should be restarted in this order:

systemctl restart hadoop-yarn-nodemanager
systemctl restart hadoop-hdfs-datanode

The service restarts have no user-visible impact (and the machines can also be rebooted). It's best to wait a few minutes before proceeding with the next node.

The Yarn node managers support graceful reload, so all the Yarn containers running on the same node are not killed/restarted at the same time (the node manager dumps its state on disk and it is able to restore its config and running containers while starting). This means that until a container finishes, new package upgrades like open-jdk ones, will not be picked up and will show up in commands like lsof.

There is now a cumin cookbook to restart all the Hadoop workers (the master nodes are delicate and need to be done manually):

$ sudo cookbook sre.hadoop.roll-restart-workers --hdfs-dn-sleep-seconds 120 --hdfs-jn-sleep-seconds 120 analytics

HAProxy

Please check the DNS section to depool servers on at a time before restarting haproxy on DNS servers

HAProxy servers are used for routing misc servers. They are currently a SPOF, so if you need to restart them, make sure they are not in use by using a different proxy.

HAProxy configuration can be reloaded without stopping it. However, HAProxy need explicit configuration of the files used for config. If the name or number of files change (not only the contents itself), reload doesn't work, and it requires a full service restart.

Hive

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Since it is not a stateless service, please contact the Analytics team before restarting it to pause to avoid any failure in the Hadoop cluster. It is composed by two daemons, the server and the metastore.

On an-coord1001:

$ systemctl restart hive-server2

$ systemctl restart hive-metastore

Icinga

Icinga is externally monitored by a custom script. To downtime the external meta-monitoring of one of the Icinga hosts, comment the related crontab entry in the Wikitech-static host and re-enable it once the maintenance is completed.

IDP

When rebooting the servers first ensure you update DNS so the record for idp.wikimedia.,org is pointing to the correct address (the one not beeing rebooted) and wait for 5 mins for the TTL to expire. We also need to make sure we dump and restore the the memcached datastore

$ sudo /usr/local/sbin/memcached-dump dump -f  /srv/cas/memcached.$(date +%s).dump 
$ reboot
$ sudo /usr/local/sbin/memcached-dump restore -f  /srv/cas/memcached.1608203240.dump 
$ rm /srv/cas/memcached.1608203240.dump

Kafka brokers (analytics)

Several consumers might get upset by metadata changes due to broker restarts, please make sure that the Analytics team is alerted beforehand:

One Kafka broker can be restarted/rebooted at a time:

service kafka restart

It needs to be ensured that all replicas are fully replicated. After restarting a broker a replica election should be performed.

Kafka brokers (main clusters)

kafka-main100[123] and kafka-main200[123] were running also the EventBus service (EventBus/Administration) but now they are not.

Please sync with the Services team to coordinate the restart/reboot of kafka-main[123]00[123], since they might need to temporarily watch services like ChangeProp.

To roll restart Kafka daemons on them:

sudo cookbook sre.kafka.roll-restart-brokers main-eqiad

sudo cookbook sre.kafka.roll-restart-brokers main-codfw

Kubernetes masters

They are behind LVS so standard depool/repool methods.

Kubernetes workers

Further information can be found here: Kubernetes#Administration

Logstash

After rebooting a Logstash node running Elasticsearch (1004-1006), it needs to be waited until the cluster state has recovered to "green" (see the "Elasticsearch" section for details). 1001 to 1003 and 1007 to 1009 run Logstash, Kibana (Apache proxied via Varnish) and a data-less Elasticsearch node. The multiple logstash endpoints are behind LVS. They can also be rebooted/restarted one at a time after depooling them, from experience it could take ~5 min for logstash to listen again on its ports so allow enough time between (de)pools.

LVS

The LVS servers are configured in primary/backup pairs (configured on the routers and visible in puppet in modules/lvs/manifests/configuration.pp). To redirect the traffic from a primary to the backup, pybal can be stopped (traffic is then being redirected to the backup). Before stopping Pybal, make sure to disable Puppet first as a Puppet run might enable it.

maps

The maps servers can be depooled/repooled via conftool (one at a time) for the kartotherian service, and by stopping the tilerator service on the box itself. Before repooling a server, make sure cassandra is resynced via nodetool (see the Cassandra section for details). Restarts of postgres on the master require puppet to be disabled and the imposm service stopped beforehand.

Mcrouter

the mcrouter service can be restarted as long as the server has been depooled e.g.

sudo depool; sleep 5 ; sudo systemctl restart mcrouter ; sleep 5; sudo pool

MySQL/MariaDB

By Default mysql is not configured to start after a reboot so must be started manuly

Long running queries from mwmaint1001 maintenance, SPOF in certain mysql services (masters, specialized slave roles, etc.) prevent from easy restarts.

The procedure is, for a core production slave:

  • Depool from mediawiki
  • Wait for queries to finish
  • Stop replication mysql -e "STOP SLAVE"
  • Stop the server, /etc/init.d/mysql stop then reboot

For a core production master:

For a misc server:

  • Failover using HAProxy (dbproxy1***)
  • Some services need a reload due to long-running connections or persistent connections. This is documented on: MariaDB/misc

For wikireplicas servers look up at Service_restarts#clouddb_(Wikireplicas)

More info on ways to speed up this at MariaDB and MariaDB/troubleshooting

Memcached

Memcached is used as caching layer for MediaWiki and it is co-hosted with Redis on mcXXXX machines (eqiad and codfw). MediaWiki uses nutcracker (https://github.com/twitter/twemproxy) to abstract the connection to the memcached cluster with one local socket and to avoid "manual" data partitioning.

Restarting the service is very easy but please remember that the cache is only in memory and it is not persisted on disk before restarts. Direct consequences of a restart might be:

A complete restart of the memcached cluster must be coordinated carefully with ops and the performance team to establish a good procedure to avoid performance hits. If you need to stop memcached for a long maintenance (e.g. OS re-install, etc..) please remove the related host from Heira first (example https://gerrit.wikimedia.org/r/#/c/273430/).

If you want to rapidly check if memcached is working after a restart or an upgrade:

mc1008:~$ echo stats | nc localhost 11211

cmd_set, cmd_get, total_items and current_items should show values greater than zero (increasing over time). This is not exhaustive of course!

Please remember that memcached on mcXXXX hosts is co-hosted with Redis, read carefully its section on this page if you need to operate on the whole host rather than only memcached:

Redis is running with a special service name to allow its use as multi-instance (several Redis processes on the same node).

sudo service redis-instance-tcp_6379 restart

It is used in various places for different tasks like:

  • Storage of user sessions on mcXXXX hosts (co-hosted with Memcached)
  • Queue for Job tasks on rdbXXXX hosts

Restarting Redis is generally a safe operation since the daemon persists its data to disk before restarting (unlike Memcached). Please note that if you need to perform a complete stop of the service (e.g. OS re-install, etc..) you will need to depool the related host from service first (example https://gerrit.wikimedia.org/r/#/c/273430/). Useful references:

Please note the removing a mcXXXX host from the Redis pool will cause user sessions to be dropped. This is unavoidable since each mcXXXX host holds a partition of the sessions not replicated elsewhere (this will not be true when codfw replication will be fully working, but hopefully this page will be already updated). Please carefully plan a complete cluster maintenance to avoid a massive loss of user session in a short time window. Please also inform Wikitech Ambassadors (https://lists.wikimedia.org/pipermail/wikitech-ambassadors/) and the performance team with one day of advance.

Puppet will take time to rollout a change like de-pooling a Redis host from its pool because it won't update all the hosts at once. This means that it usually takes ~30 minutes for all the connections to drain from a host. In this timeframe you will see errors in logstash. Please also make sure that all the client connections drop to zero before operating on the host (rebooting, re-installing the OS, etc..) using commands like:

redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6379.conf)" client list | wc -l
redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6379.conf)" monitor

memcached on other services

  • graphite uses memcached to cache queries, it's safe to upgrade.
  • swift frontend servers use memcached to cache lookups for container/account existence and auth tokens. It's safe to upgrade, but the frontend servers should be depooled for the restart.

ncredir

ncredir is working in active/active mode, so to roll-restart you have to depool and repool in sequence. For example:

 confctl select name=ncredir2001.codfw.wmnet set/pooled=no
 ... reboot
 confctl select name=ncredir2001.codfw.wmnet set/pooled=yes

To get the current status:

 # confctl select 'cluster=ncredir' get
{"ncredir2002.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=ncredir,service=nginx"}
{"ncredir2001.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=codfw,cluster=ncredir,service=nginx"}
{"ncredir1001.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=ncredir,service=nginx"}
{"ncredir1002.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=ncredir,service=nginx"}

You can also use the restart-nginx script provided by profile::lvs::realserver to restart nginx safely:

vgutierrez@ncredir5001:~$ sudo -i restart-nginx 
2022-03-18 11:34:27,945 [INFO] Depooling currently pooled services
2022-03-18 11:34:28,435 [INFO] Waiting 3 seconds before restarting the service...
2022-03-18 11:34:31,439 [INFO] Restarting the service
2022-03-18 11:34:31,522 [INFO] Repooling previously pooled services

netflow

These hosts can be rebooted anytime unless there's currently on incident ongoing.

Netmon

After a reboot, the keyholder needs to be rearmed:

 sudo keyholder arm

(The passphrase is in pwstore in the rancid-key-passphrase file).

ntpd

We run four ntpd servers (chromium, hydrogen, acamar, achenar) and all of these are configured for use by the other servers in the cluster. As such, as long as only one server is restarted/rebooted at at time, everything is fine. The ntpd running locally on the individual servers can easily be restarted at any any time.

Oozie

Oozie is a workflow scheduler system to manage Apache Hadoop jobs written in Java. Since it is not a stateless service, please contact the Analytics team before restarting it to pause its Bundles/Coordinators/Workflows to avoid any failure in the Hadoop cluster.

On an-coord1001:

$ sudo systemctl restart oozie

Ores

Standard depool/repool mode for reboots. For service restarts to pick up e.g. a new library, the services uwsgi-ores.service and celery-ores-worker.service need to be restarted.

openldap

We run two openldap installations (the oit mirror and for labs). Both are using mirrormode replication and the respective clients (mails servers for oit mirror and (primarily) labs instances for openldap-labs). The openldap servers (or the slapd process) can be rebooted/restarted one at a time, the clients will transparently try to reconnect to the other host of the respective cluster. The number of connected clients are shown in grafana for openldap-labs.

Parsoid

Parsoid has moved from JS to PHP. This means nowadays a parsoid restart equals a php-fpm restart. Use "restart-php7.2-fpm" via cumin. See Service_restarts#Application_servers_(also_image/video_scalers_and_job_runners)

ping

For each reboot, the routers need to temporary disable/redirect the ICMP redirection for the instance to boot.

Pivot

Pivot is stateless and can be restarted at any time.

Pool counters

The pool counters in the inactive data center can be rebooted rightaway. For the active data center, they should be removed/readded from mediawiki-config one by one (example commit: https://gerrit.wikimedia.org/r/#/c/418870/)

Before rebooting it can be doublechecked with "ss" (on port 7531) that no further mediawikis are connected. Thumbor uses poolcounter1001 independant of the mediawiki-config, but it's acceptable for a reboot to miss it.

PKI

The pki-root* servers can simply be rebooted at any time

for pki* servers passive services can simply be rebooted. The active server (the server pointed to via pki.discovery.wmnet.) will need updating in DNS first. As this domain is only used internaly you can clear the DNS cache instead of waiting for the TTL to expire ( sudo cookbook sre.dns.wipe-cache pki.discovery.wmnet )

Postgres

puppetdb

Clients only talk to Java-based frontend processes, but during the postgres update puppet runs will fail, so follow the approach described for puppetdb reboots.

Prometheus

Prometheus is ran in active/active mode, so to roll-restart you have to depool and repool in sequence. For example:

 confctl select name=<name>.codfw.wmnet set/pooled=no
 ... reboot
 confctl select name=<name>.codfw.wmnet set/pooled=yes

To get the current status:

 # confctl select service=prometheus get
 {"prometheus2003.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": 
 "dc=codfw,cluster=prometheus,service=prometheus"}
 {"prometheus2004.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": 
 "dc=codfw,cluster=prometheus,service=prometheus"}
 {"prometheus1003.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": 
 "dc=eqiad,cluster=prometheus,service=prometheus"}
 {"prometheus1004.eqiad.wmnet": {"pooled": "yes", "weight": 10}, "tags": 
 "dc=eqiad,cluster=prometheus,service=prometheus"}

The Prometheus servers in easms/ulsfo (running on bastions) are not redundant, when restarting/rebooting a temporary loss of metrics data is acceptable.

puppetdb

Restarting the puppetdb.service and reboots of the instance will cause significant Puppet errors (about ~10% of all hosts), so it's best to disable Puppet/re-enable Puppet for the restart/reboot:

Disable Puppet

Enable Puppet

Puppet masters

For short maintenance tasks (service restarts, reboots), it's best to disable Puppet temporarily, otherwise there'll be a shower of alerts.

For anything which takes a little longer (reimages, hardware maintenance by DC ops etc.), it's best to depool the puppet masters. This requires the following Puppet/DNS changes:

Frontends are found by Puppet clients via DNS resolution, so we need to redirect the (site-specific) CNAMEs to the other puppetmaster frontend (example commit: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/543280/). If the server is taken down which currently holds the Puppet CA, we also need to fail that over to the other frontend (example commit: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/540392/)

Puppetboard

Puppetboard service (puppetboard.wikimedia.org) is a python application that runs with uWSGI behind Apache2. It is configured as active/active at the traffic layer and is a completely stateless service, so each of its components can be restarted/rebooted anytime also because it is used only by the SRE team and is not a public service. In case of longer maintenance one host at a time can be depooled in the varnish director's hiera configuration.

Redis

(Redis is also used on the memcached servers, please see that section for details.

rdb* servers hosting the job queue for redis.

Redis is running with a special service name to allow its use as multi-instance (several Redis processes on the same node).

sudo systemctl restart redis-instance-tcp_6378
sudo systemctl restart redis-instance-tcp_6379
sudo systemctl restart redis-instance-tcp_6380
sudo systemctl restart redis-instance-tcp_6381

oresrdb* servers

Theses servers can be restated with the following command. however restarting will cayuse an outage as such you should create a ticket and liaise with the engineers responsible for the ores services

sudo systemctl restart redis-instance-tcp_6379
sudo systemctl restart redis-instance-tcp_6380

mwlog* servers

sudo systemctl restart redis-instance-tcp_6379


Restarting Redis is generally a safe operation since the daemon persists its data to disk before restarting.

Putting redis entirely out of service for a long period, is a little more complex: The rdb hosts have a fallback slave (using the subsequent number), e.g. rdb1001 has rdb1002 as it's fallback. The fallback hosts can be rebooted without impact. For the time being (Jul 2019), Redis is used by changeprop and docker registry. It is safe to reboot reb* servers at any time. Do give a shout to service owners

relforge

The relforge* cluster is very similar to the elastic* search clusters, but only consists of two hosts, so rebooting/restarting the master causes a service interruption (the service is only used internally), so #wikimedia-discovery should be notified. For the restart of relforge, replication should be stopped and both nodes rebooted at the same time:

 sudo systemctl restart elasticsearch_6@relforge-eqiad.service elasticsearch_6@relforge-eqiad-small-alpha.service

restbase

The restbase service should be restarted one host at a time. Before restarting it, depool it with sudo depool-restbase. Repool it with sudo pool-restbase only once restbase is listening on it's external port again:

 ss -tl sport eq 7231

RPKI

The RPKI servers can be restarted anytime for routine maintenance as long as there is at least one with all its Icinga checks green.

sca servers

The sca servers can be depooled/repooled via conftool (one at a time). They run multiple services, so better use "confctl pool/depool".

scb servers

The scb servers can be depooled/repooled via conftool (one at a time). They run multiple services, so better use "confctl pool/depool".

sudo service pdfrender restart
sudo service apertium-apy restart 
sudo service changeprop restart
sudo service citoid restart
sudo service cpjobqueue restart
sudo service cxserver restart
sudo service eventstreams restart
sudo service graphoid restart
sudo service mobileapps restart
sudo service recommendation_api restart

stat* servers

Some users might have long-running scripts on those servers, in case of a reboot, it's best send a heads-up mail to analytics@lists.wikimedia.org a day ahead.

sessionstore

These run a Casssandra cluster, see the section for details on restarts/reboots.

Swift

Frontend servers (ms-fe*) should be depooled via pybal when making service restarts or reboots. Whether a server has been correctly depooled can be checked by tailing /var/log/swift/proxy-access.log.

Backend servers can simply be rebooted/restarted one at a time (with a 30 second delay in between when restarting); an unresponsive host is automatically handled by the frontend servers.

Backends service restarts
systemctl restart swift*

Frontends service restarts
systemctl restart swift-proxy

As an example, to restart the swift-proxy in codfw:

sudo cumin -b 1 -s 5 'A:codfw and P{O:swift::proxy}' 'depool && sleep 3 && systemctl restart swift-proxy && sleep 3 && pool'

Thanos

Thanos has both a frontend and backend. For backends the same rules as Swift above apply. Ditto for frontends, except that there are two services for host: swift-proxy and thanos-query. For thanos-swift.discovery.wmnet and thanos-query.discovery.wmnet respectively.

Thumbor

Thumbor nodes must be depooled/repooled when making service restarts or reboots. The restart occurs via the "thumbor-instances" service. use the following to test if a server has been (de)pooled

tail -f /srv/log/thumbor/thumbor.404.log

To restart thumbor instances

systemctl restart 'thumbor*'

Volunteer Response Team (VRTS/OTRS)

Service and server restarts should be annouced in IRC channel first (#wikimedia-vrt). Please note VRTS is also known as OTRS (old name) or znuny (ticket system).

Service Restart

VRTS runs multiple services:

cron:

sudo systemctl restart cron.service

exim4 (email):

sudo systemctl restart exim4.service

apache2 (webserver):

sudo systemctl restart apache2.service

vrts-daemon (znuny ticket system):

sudo systemctl restart vrts-daemon.service

Server Reboot

To reboot use the sre.hosts.reboot-single cookbook:

sudo cookbook sre.hosts.reboot-single otrs1001.eqiad.wmnet

Wikidata Query Service (WDQS)

WDQS nodes can be restarted one at a time at any time. They must be depooled/pooled.

The following services are part of WDQS:

  • wdqs-blazegraph
  • wdqs-categories
  • wdqs-updater

Wikidough DoH/DoT Service

Service Restart

Wikidough hosts: doh[123456]00[12]

Wikidough has two components, dnsdist and pdns-recursor. To restart,

sudo systemctl restart dnsdist.service
sudo systemctl restart pdns-recursor.service

Note that restarting the services clears the in-memory cache for both dnsdist and pdns-recursor.

Server Reboot

As Wikidough is anycasted, the following steps need to be performed to depool a server:

  • Disable Puppet on the host:
sudo disable-puppet "depooling host"
  • Stop the bird service:
sudo systemctl stop bird.service

Yubico authentication servers

The authentication servers can be rebooted one at a time. After each reboot the keystore on the YubiHSM needs to be unlocked using

 sudo yhsm-keystore-unlock

Zookeeper

Zookeeper is used by Kafka and Hadoop for configuration management and leader election.

There are multiple clusters in Wikimedia:

  • main eqiad (condf1004-6) - SRE cluster to support Kafka main eqiad and logging eqiad.
  • main codfw (conf2001-3) - SRE cluster to support Kafka main codfw and logging codfw.
  • analytics (an-conf1001-3) - Analytics cluster to support Hadoop.
  • druid-public/analytics (druid1001-3, druid1004-6) - Analytics clusters to support Druid (co-located with the service).

Zookeeper nodes can be restarted one at a time via systemctl restart zookeeper. Once a node is restarted, before proceeding with the next one please verify its status using the following commands:

elukey@conf2001:~$ echo ruok | nc localhost 2181
imok

elukey@conf2001:~$ echo ruok | nc localhost 2181
imokelukey@conf2001:~$ echo stats | nc localhost 2181
Zookeeper version: 3.4.5--1, built on 05/31/2017 10:10 GMT
Clients:
[...]

Latency min/avg/max: 0/0/947
Received: 91891
Sent: 91894
Connections: 4
Outstanding: 0
Zxid: $someid
Mode: [follower|leader]
Node count: 1473

Please also verify that there is an active leader using Cumin:

elukey@cumin1001:~$ sudo cumin 'conf2*' 'echo stats | nc localhost 2181'

Please note that only executing "ruok" is not enough since it will only tell you what is the status of the daemon, not the health of the cluster! This caused Incident documentation/20170831-Zookeeper.

Please also double check https://grafana.wikimedia.org/d/000000261/zookeeper after each restart.

There is also a cumin cookbook:

$ sudo cookbook sre.zookeeper.roll-restart-zookeeper analytics

Zuul

See instructions at mediawiki.org/CI

One-off hosts

sodium

Package installations/upgrades will fail, so this needs to be announced briefly ahead.