You are browsing a read-only backup copy of Wikitech. The live site can be found at

Service restarts

From Wikitech-static
Revision as of 14:18, 3 December 2015 by imported>Muehlenhoff (Add gerrit)
Jump to navigation Jump to search

This page collects procedures to restart services (or reboot the underlying server) in the WMF production cluster.

Cassandra (as used in aqs and restbase)

Cassandra as used in restbase uses a multi-instance setup, i.e. one host runs multiple cassandra processes, typically named "a", "b", etc. For each instance there is a corresponding nodetool-NAME binary that can be used, e.g nodetool-a status -r. The aqs Cassandra cluster doesn't use multi-instance, in that case the name of the tool is simply nodetool (but the commands are equivalent):

Before restarting an instance it is a good idea to drain it first.

 nodetool-a drain && systemctl restart cassandra-a
 nodetool-b drain && systemctl restart cassandra-b

Before proceeding with the next node, you should check whether the restarted node has correctly rejoined the cluster (the name of the tool is relative to the restarted service instance):

nodetool-a status -r

(Directly after the restart the tool might throw an exception "No nodes are present in the cluster". This usually sorts out within a few seconds. If the node has correctlt rejoined the cluster, it should be listed with "UN" prefix, e.g.:

UN  xenon-a.eqiad.wmnet              224.65 GB  256     ?       0d691414-4132-4854-a00d-1d2671e15728  rack1


The exim service/the mx* hosts can be restarted/rebooted individually without external impact; mail servers trying to deliver mails will simply re-try at a later point if the SMTP service is unavailable:

service exim4 restart


The restart should be pre-announced on #wikimedia-operations (for maybe 15 minutes) to give people a headsup:

service gerrit restart

Hadoop workers

Three of the hadoop workers run an additional JournalNode process. These are configured in the puppet manifest:

  • The other Hadoop workers are running two services (hadoop-hdfs-datanode and hadoop-yarn-nodemanager). The services on the Hadoop workers can be restarted in arbitrary orde.
service hadoop-hdfs-datanode restart
service hadoop-yarn-nodemanager restart

The service restarts have no user-visible impact (and the machines can also be rebooted). It's best to wait a few minutes before proceeding with the next node.

  • TODO: Add notes for JournalNode hosts

Kafka brokers

One Kafka broker can be restarted/rebooted at a time:

service kafka restart

It needs to be ensured that all replicas arefully replicated. After restarting a broker a replica election should be performed.


We run four ntpd servers (chromium, hydrogen, acamar, achenar) and all of these are configured for use by the other servers in the cluster. As such, as long as only one server is restarted/rebooted at at time, everything is fine. The ntpd running locally on the individual servers can easily be restarted at any any time.