You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Service restarts: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Muehlenhoff
(→‎Hadoop workers: Note on hadoop journal nodes)
imported>Jcrespo
(MySQL)
Line 43: Line 43:
= Hadoop workers =
= Hadoop workers =


Three of the hadoop workers run an additional JournalNode process. These are [https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/analytics/hadoop.pp#L88 configured in the puppet manifest]:
Please coordinate with the Analytics team before taking any action, there are multiple dependencies to consider before proceeding. For example, Camus might need to be stopped to prevent data loss/lag in hdfs.
* The other Hadoop workers are running two services (hadoop-hdfs-datanode and hadoop-yarn-nodemanager). The services on the Hadoop workers can be restarted in arbitrary order.
 
Hadoop's master node (analytics1001.eqiad.wmnet) and its standby replica (analytics1002.eqiad.wmnet) are not configured for automatic failover yet, so extra steps must be taken to avoid an outage: [[Analytics/Cluster/Hadoop/Administration|Analytics/Cluster/Hadoop/Administration#Manual Failover]]
 
Three of the Hadoop workers run an additional JournalNode process to ensure that the standby master node is kept in sync with the active one. These are [https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/analytics/hadoop.pp#L88 configured in the puppet manifest].
 
The other Hadoop workers are running two services (hadoop-hdfs-datanode and hadoop-yarn-nodemanager). The services on the Hadoop workers should be restarted in this order:
service hadoop-yarn-nodemanager restart
  service hadoop-hdfs-datanode restart
  service hadoop-hdfs-datanode restart
service hadoop-yarn-nodemanager restart
The service restarts have no user-visible impact (and the machines can also be rebooted). It's best to wait a few minutes before proceeding with the next node. When rebooting JournalNode hosts it must be ensured that two additional JournalNode hosts are up and running.
The service restarts have no user-visible impact (and the machines can also be rebooted). It's best to wait a few minutes before proceeding with the next node. When rebooting JournalNode hosts it must be ensured that two additional JournalNode hosts are up and running.


= Kafka brokers =
= Kafka brokers =
Please do not proceed without taking a look to [[phab:T125084|https://phabricator.wikimedia.org/T125084]] first!


One Kafka broker can be restarted/rebooted at a time:
One Kafka broker can be restarted/rebooted at a time:
Line 60: Line 67:


We run four ntpd servers (chromium, hydrogen, acamar, achenar) and all of these are configured for use by the other servers in the cluster. As such, as long as only one server is restarted/rebooted at at time, everything is fine. The ntpd running locally on the individual servers can easily be restarted at any any time.
We run four ntpd servers (chromium, hydrogen, acamar, achenar) and all of these are configured for use by the other servers in the cluster. As such, as long as only one server is restarted/rebooted at at time, everything is fine. The ntpd running locally on the individual servers can easily be restarted at any any time.
= Parsoid =
When rebooting one of the wtp* hosts, they should be depooled via pybal/conftool (two systems at at time).


= stat* servers =
= stat* servers =


Some users might have long-running scripts on those servers, in case of a reboot, it's best send a heads-up mail to analytics@lists.wikimedia.org a day ahead.
Some users might have long-running scripts on those servers, in case of a reboot, it's best send a heads-up mail to analytics@lists.wikimedia.org a day ahead.
= MySQL/MariaDB =
Long running queries from terbium maintenance, SPOF in certain mysql services (masters, specialized slave roles, etc.) [https://phabricator.wikimedia.org/T119626 prevent from easy restarts].
The procedure is, for a core production slave:
* [[MariaDB/troubleshooting#Depooling_a_slave|Depool from mediawiki]]
* Wait for queries to finish
* Stop replication <code>mysql -e "STOP SLAVE"</code>
* Stop the server, <code>/etc/init.d/mysql stop</code> then reboot
For a core production master:
* [[MariaDB/troubleshooting#Depooling_a_master_.28a.k.a._promoting_a_new_slave_to_master.29|Failover to another server]] first (which is not easy)
For a misc server:
* Failover using HAProxy (dbproxy1***)
More info on ways to speed up this at [[MariaDB]] and [[MariaDB/troubleshooting]]

Revision as of 17:37, 28 January 2016

This page collects procedures to restart services (or reboot the underlying server) in the WMF production cluster.


Application servers (also image/video scalers and job runners)

When rebooting an application server it should be depooled before the reboot.

Restarts of HHVM should be spread out a little, e.g. by waiting 30 seconds between each restart:

salt -b1 'mw1*' cmd.run 'service hhvm restart; sleep 30;'

While this is ok, it will take a lot of time. I suggest to run this in parallel on the various clusters:

for $cluster in appserver api_appserver imagescaler jobrunner videoscaler;
    do
    for $site in eqiad codfw;
        salt -b1 -C "G@cluster:${cluster} and G@site:${site}" cmd.run 'service hhvm restart; sleep 30;'
    done
done

Cassandra (as used in aqs and restbase)

Cassandra as used in restbase uses a multi-instance setup, i.e. one host runs multiple cassandra processes, typically named "a", "b", etc. For each instance there is a corresponding nodetool-NAME binary that can be used, e.g nodetool-a status -r. The aqs Cassandra cluster doesn't use multi-instance, in that case the name of the tool is simply nodetool (but the commands are equivalent):

Before restarting an instance it is a good idea to drain it first.

 nodetool-a drain && systemctl restart cassandra-a
 nodetool-b drain && systemctl restart cassandra-b

Before proceeding with the next node, you should check whether the restarted node has correctly rejoined the cluster (the name of the tool is relative to the restarted service instance):

nodetool-a status -r

(Directly after the restart the tool might throw an exception "No nodes are present in the cluster". This usually sorts out within a few seconds. If the node has correctlt rejoined the cluster, it should be listed with "UN" prefix, e.g.:

UN  xenon-a.eqiad.wmnet              224.65 GB  256     ?       0d691414-4132-4854-a00d-1d2671e15728  rack1

Exim

The exim service/the mx* hosts can be restarted/rebooted individually without external impact; mail servers trying to deliver mails will simply re-try at a later point if the SMTP service is unavailable:

service exim4 restart

Ganeti

Ganeti nodes can be upgraded without impact on the running VMs. To reboot a node, its virtual machines nodes need to be migrated to other hosts, with the master node needing special attention.

Gerrit

The restart should be pre-announced on #wikimedia-operations (for maybe 15 minutes) to give people a headsup:

service gerrit restart

Hadoop workers

Please coordinate with the Analytics team before taking any action, there are multiple dependencies to consider before proceeding. For example, Camus might need to be stopped to prevent data loss/lag in hdfs.

Hadoop's master node (analytics1001.eqiad.wmnet) and its standby replica (analytics1002.eqiad.wmnet) are not configured for automatic failover yet, so extra steps must be taken to avoid an outage: Analytics/Cluster/Hadoop/Administration#Manual Failover

Three of the Hadoop workers run an additional JournalNode process to ensure that the standby master node is kept in sync with the active one. These are configured in the puppet manifest.

The other Hadoop workers are running two services (hadoop-hdfs-datanode and hadoop-yarn-nodemanager). The services on the Hadoop workers should be restarted in this order:

service hadoop-yarn-nodemanager restart
service hadoop-hdfs-datanode restart

The service restarts have no user-visible impact (and the machines can also be rebooted). It's best to wait a few minutes before proceeding with the next node. When rebooting JournalNode hosts it must be ensured that two additional JournalNode hosts are up and running.

Kafka brokers

Please do not proceed without taking a look to https://phabricator.wikimedia.org/T125084 first!

One Kafka broker can be restarted/rebooted at a time:

service kafka restart

It needs to be ensured that all replicas are fully replicated. After restarting a broker a replica election should be performed.

ntpd

We run four ntpd servers (chromium, hydrogen, acamar, achenar) and all of these are configured for use by the other servers in the cluster. As such, as long as only one server is restarted/rebooted at at time, everything is fine. The ntpd running locally on the individual servers can easily be restarted at any any time.

Parsoid

When rebooting one of the wtp* hosts, they should be depooled via pybal/conftool (two systems at at time).

stat* servers

Some users might have long-running scripts on those servers, in case of a reboot, it's best send a heads-up mail to analytics@lists.wikimedia.org a day ahead.

MySQL/MariaDB

Long running queries from terbium maintenance, SPOF in certain mysql services (masters, specialized slave roles, etc.) prevent from easy restarts.

The procedure is, for a core production slave:

  • Depool from mediawiki
  • Wait for queries to finish
  • Stop replication mysql -e "STOP SLAVE"
  • Stop the server, /etc/init.d/mysql stop then reboot

For a core production master:

For a misc server:

  • Failover using HAProxy (dbproxy1***)

More info on ways to speed up this at MariaDB and MariaDB/troubleshooting