You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Service restarts
This page collects procedures to restart services (or reboot the underlying server) in the WMF production cluster.
Application servers (also image/video scalers and job runners)
When rebooting an application server it should be before the reboot. Whether a server has been correctly depooled can be checked by tailing /var/log/apache2/other_vhosts_access.log.
Restarts of HHVM should be spread out a little, e.g. by waiting 30 seconds between each restart:
salt -b1 'mw1*' cmd.run 'service hhvm restart; sleep 30;'
While this is ok, it will take a lot of time. I suggest to run this in parallel on the various clusters:
for $cluster in appserver api_appserver imagescaler jobrunner videoscaler;
do
for $site in eqiad codfw;
salt -b1 -C "G@cluster:${cluster} and G@site:${site}" cmd.run 'service hhvm restart; sleep 30;'
done
done
aqs
The aqs servers can be depooled/repooled via conftool (one at a time). Before repooling a server, make sure cassandra is resynced via nodetool (see the Cassandra section for details).
Bacula
Before rebooting a storage host or the director make sure no backup run is currently in progress. This can be checked on helium via:
sudo bconsole status director
Cache proxies (varnish) (cp)
The aqs servers can be depooled/repooled via conftool.
Cassandra (as used in aqs and restbase)
Cassandra as used in restbase uses a multi-instance setup, i.e. one host runs multiple cassandra processes, typically named "a", "b", etc. For each instance there is a corresponding nodetool-NAME binary that can be used, e.g nodetool-a status -r. The aqs Cassandra cluster doesn't use multi-instance, in that case the name of the tool is simply nodetool (but the commands are equivalent):
Before restarting an instance it is a good idea to drain it first.
nodetool-a drain && systemctl restart cassandra-a nodetool-b drain && systemctl restart cassandra-b
Before proceeding with the next node, you should check whether the restarted node has correctly rejoined the cluster (the name of the tool is relative to the restarted service instance):
nodetool-a status -r
(Directly after the restart the tool might throw an exception "No nodes are present in the cluster". This usually sorts out within a few seconds. If the node has correctlt rejoined the cluster, it should be listed with "UN" prefix, e.g.:
UN xenon-a.eqiad.wmnet 224.65 GB 256 ? 0d691414-4132-4854-a00d-1d2671e15728 rack1
Elasticsearch
The cluster continues to work fine as long as elasticsearch is only restarted on one node at a time (or the host rebooted). The overall cluster state can be queried from the master node. First we need to find the master node by running the following command on an arbitrary elasticsearch host:
curl 'localhost:9200/_cat/master?v'
On the elasticsearch master node the following command returns the overall state of the elasticsearch cluster:
curl -s localhost:9200/_cluster/health?pretty
Initially the "status" field should be "green". After elasticsearch has been stopped/rebooted, the "number_of_nodes" will go down by one and the "status" will switch to "yellow". The search cluster will resync, but it might take 1-2 hours to reach that state. Once it has recovered that next node can be restarted/rebooted. See search cluster administration for more details about elasticsearch administration.
Exim
The exim service/the mx* hosts can be restarted/rebooted individually without external impact; mail servers trying to deliver mails will simply re-try at a later point if the SMTP service is unavailable:
service exim4 restart
Ganeti
Ganeti nodes can be upgraded without impact on the running VMs. To reboot a node, its virtual machines nodes need to be migrated to other hosts, with the master node needing special attention.
Gerrit
The restart should be pre-announced on #wikimedia-operations (for maybe 15 minutes) to give people a headsup:
service gerrit restart
Hadoop workers
Please coordinate with the Analytics team before taking any action, there are multiple dependencies to consider before proceeding. For example, Camus might need to be stopped to prevent data loss/lag in hdfs.
Hadoop's master node (analytics1001.eqiad.wmnet) and its standby replica (analytics1002.eqiad.wmnet) are not configured for automatic failover yet, so extra steps must be taken to avoid an outage: Analytics/Cluster/Hadoop/Administration#Manual Failover
Three of the Hadoop workers run an additional JournalNode process to ensure that the standby master node is kept in sync with the active one. These are configured in the puppet manifest. When rebooting JournalNode hosts it must be ensured that two additional JournalNode hosts are up and running.
service hadoop-hdfs-journalnode restart
The other Hadoop workers are running two services (hadoop-hdfs-datanode and hadoop-yarn-nodemanager). The services on the Hadoop workers should be restarted in this order:
service hadoop-yarn-nodemanager restart service hadoop-hdfs-datanode restart
The service restarts have no user-visible impact (and the machines can also be rebooted). It's best to wait a few minutes before proceeding with the next node.
Kafka brokers
Please do not proceed without taking a look to https://phabricator.wikimedia.org/T125084 first!
Several consumers might get upset by metadata changes due to broker restarts, please make sure that the Analytics team is alerted beforehand:
- EventLogging
- Graphite#statsv (running on hafnium)
One Kafka broker can be restarted/rebooted at a time:
service kafka restart
It needs to be ensured that all replicas are fully replicated. After restarting a broker a replica election should be performed.
kafka100[12] and kafka200[12] are not Analytics brokers but they are part of EventBus, so you will need to follow EventBus/Administration
LVS
The LVS servers are configured in primary/backup pairs (configured on the routers and visible in puppet in modules/lvs/manifests/configuration.pp). To redirect the traffic from a primary to the backup, pybal can be stopped (traffic is then being redirected to the backup).
maps
The aqs servers can be depooled/repooled via conftool (one at a time). Before repooling a server, make sure cassandra is resynced via nodetool (see the Cassandra section for details). Restarts of postgres on the master should be avoided while the download of the OSM data is in progress.
MySQL/MariaDB
Long running queries from terbium maintenance, SPOF in certain mysql services (masters, specialized slave roles, etc.) prevent from easy restarts.
The procedure is, for a core production slave:
- Depool from mediawiki
- Wait for queries to finish
- Stop replication
mysql -e "STOP SLAVE"
- Stop the server,
/etc/init.d/mysql stop
then reboot
For a core production master:
- Failover to another server first (which is not easy)
For a misc server:
- Failover using HAProxy (dbproxy1***)
More info on ways to speed up this at MariaDB and MariaDB/troubleshooting
Memcached
Memcached is used as caching layer for MediaWiki and it is co-hosted with Redis on mcXXXX machines (eqiad and soon codfw). MediaWiki uses nutcracker (https://github.com/twitter/twemproxy) to abstract the connection to the memcached cluster with one local socket and to avoid "manual" data partitioning.
Restarting the service is very easy but please remember that the cache is only in memory and it is not persisted on disk before restarts. Direct consequences of a restart might be:
- errors logged in https://logstash.wikimedia.org/#/dashboard/elasticsearch/memcached
- errors logged in https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm
- memory usage variations in https://ganglia.wikimedia.org/latest/?c=Memcached%20eqiad
A complete restart of the memcached cluster must be coordinated carefully with ops and the performance team to establish a good procedure to avoid performance hits. If you need to stop memcached for a long maintenance (e.g. OS re-install, etc..) please remove the related host from Heira first (example https://gerrit.wikimedia.org/r/#/c/273430/).
Please remember that memcached on mcXXXX hosts is co-hosted with Redis, read carefully its section on this page if you need to operate on the whole host rather than only memcached.
ntpd
We run four ntpd servers (chromium, hydrogen, acamar, achenar) and all of these are configured for use by the other servers in the cluster. As such, as long as only one server is restarted/rebooted at at time, everything is fine. The ntpd running locally on the individual servers can easily be restarted at any any time.
ocg
The ocg servers can be depooled/repooled via conftool (one at a time). Note https://phabricator.wikimedia.org/T120077 , though.
openldap
We run two openldap installations (the oit mirror and for labs). Both are using mirrormode replication and the respective clients (mails servers for oit mirror and (primarily) labs instances for openldap-labs). The openldap servers (or the slapd process) can be rebooted/restarted one at a time, the clients will transparently try to reconnect to the other host of the respective cluster. The number of connected clients are shown in grafana for openldap-labs.
Parsoid
When rebooting one of the wtp* hosts, they should be depooled via pybal/conftool (two systems at at time). Whether a server has been correctly depooled can be checked by tailing /var/log/parsoid/parsoid.log.
Redis
Redis is running with a special service name to allow its use as multi-instance (several Redis processes on the same node).
sudo service redis-instance-tcp_6379 restart
It is used in various places for different tasks like:
- Storage of user sessions on mcXXXX hosts (co-hosted with Memcached)
- Queue for Job tasks on rdbXXXX hosts
Restarting Redis is generally a safe operation since the daemon persists its data to disk before restarting (unlike Memcached). Please note that if you need to perform a complete stop of the service (e.g. OS re-install, etc..) you will need to depool the related host from service first (example https://gerrit.wikimedia.org/r/#/c/273430/). Useful references:
- https://phabricator.wikimedia.org/T123675 (Reinstall redis servers (Job queues) with Jessie)
- https://phabricator.wikimedia.org/T123711 (Reinstall eqiad memcache servers with Jessie)
Please note the removing a mcXXXX host from the Redis pool will cause user sessions to be dropped. This is unavoidable since each mcXXXX host holds a partition of the sessions not replicated elsewhere (this will not be true when codfw replication will be fully working, but hopefully this page will be already updated). Please carefully plan a complete cluster maintenance to avoid a massive loss of user session in a short time window. Please also inform Wikitech Ambassadors (https://lists.wikimedia.org/pipermail/wikitech-ambassadors/) and the performance team with one day of advance.
Puppet will take time to rollout a change like de-pooling a Redis host from its pool because it won't update all the hosts at once. This means that it usually takes ~30 minutes for all the connections to drain from a host. In this timeframe you will see errors in logstash. Please also make sure that all the client connections drop to zero before operating on the host (rebooting, re-installing the OS, etc..) using commands like:
redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6379.conf)" client list | wc -l redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6379.conf)" monitor
sca servers
The sca servers can be depooled/repooled via conftool (one at a time). They run multiple services, so better use "confctl --find".
scb servers
The scb servers can be depooled/repooled via conftool (one at a time). They run multiple services, so better use "confctl --find".
stat* servers
Some users might have long-running scripts on those servers, in case of a reboot, it's best send a heads-up mail to analytics@lists.wikimedia.org a day ahead.
Swift
Frontend servers (ms-fe*) should be depooled via pybal when making service restarts or reboots. Backend servers can simply be rebooted/restarted one at a time; an unresponsive host is automatically handled by the frontend servers.
The Swift proxy on the frontend servers should be restarted with
swift-init all restart