You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Portal:Cloud VPS/Admin/Rabbitmq: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Majavah
(→‎HA setup: fix link syntax)
imported>FNegri
(s/cloudcontrol/cloudrabbit (Rabbit was moved to new hostnames))
 
(2 intermediate revisions by 2 users not shown)
Line 7: Line 7:
- etc.
- etc.


When VM creation is failing, very often the issue is with rabbitmq.  Typically rabbit can be restarted with minimal harm, which will prompt all clients to reconnect:
When VM creation is failing, very often the issue is with rabbitmq.  Typically rabbit can be restarted with minimal harm, which will prompt all clients to reconnect. '''Restart each rabbit service one at a time and wait several minutes for it to stabilize before restarting the next. '''


  root@cloudcontrol1003:~# service rabbitmq-server restart
  root@cloudrabbit1001:~# systemctl restart rabbitmq-server
  root@cloudcontrol1004:~# service rabbitmq-server restart
root@cloudrabbit1002:~# systemctl restart rabbitmq-server
  root@cloudrabbit1003:~# systemctl restart rabbitmq-server


== Operations ==
== Operations ==
Line 17: Line 18:


<syntaxhighlight lang="shell-session">
<syntaxhighlight lang="shell-session">
root@cloudcontrolXXXX:~# rabbitmqctl add_user username password
root@cloudrabbitXXXX:~# rabbitmqctl add_user username password
root@cloudcontrolXXXX:~# rabbitmqctl set_permissions "username" ".*" ".*" ".*"
root@cloudrabbitXXXX:~# rabbitmqctl set_permissions "username" ".*" ".*" ".*"
</syntaxhighlight>
</syntaxhighlight>


Line 27: Line 28:
For redundancy we use a cluster of two rabbitmq servers in a primary/secondary relationship.  Some documentation about how this is set up can be found [https://docs.openstack.org/ha-guide/control-plane-stateful.html#messaging-service-for-high-availability here].  Most of the pieces of this are puppetized, but when standing up a new pair a couple of manual steps are needed.
For redundancy we use a cluster of two rabbitmq servers in a primary/secondary relationship.  Some documentation about how this is set up can be found [https://docs.openstack.org/ha-guide/control-plane-stateful.html#messaging-service-for-high-availability here].  Most of the pieces of this are puppetized, but when standing up a new pair a couple of manual steps are needed.


On the secondary host (where the primary host is cloudcontrol1003):
On the secondary host (where the primary host is cloudrabbit1001):


   root@cloudcontrol1004:~# rabbitmqctl stop_app
   root@cloudrabbit1002:~# rabbitmqctl stop_app
   root@cloudcontrol1004:~# rabbitmqctl join_cluster rabbit@cloudcontrol1003
   root@cloudrabbit1002:~# rabbitmqctl join_cluster rabbit@cloudrabbit1001
   root@cloudcontrol1004:~# rabbitmqctl start_app
   root@cloudrabbit1002:~# rabbitmqctl start_app
   root@cloudcontrol1004:~# rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'
   root@cloudrabbit1002:~# rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'
 
=== Resetting the HA setup ===
 
Several times, we have run into issues with the HA setup (e.g. [https://phabricator.wikimedia.org/T320232 T320232]), that we could only fix by resetting the cluster completely, with the following procedure:
 
* You'll want a shell on all three rabbit nodes, for starters: <code>cloudrabbit100[123].wikimedia.org</code>
 
* On all nodes, stop puppet from messing with us by running <code>sudo disable-puppet</code>
 
* <code>sudo rabbitmqctl cluster_status</code> on <code>cloudrabbit1001</code> should claim that all three nodes are up (the 'Running Nodes' section is the interesting bit)
 
* First on 1003, then on 1002, run <code>sudo rabbitmqctl stop_app</code> and then <code>sudo rabbitmqctl reset</code>, then confirm that 1001 agrees that the node is no longer part of the cluster by running <code>sudo rabbitmqctl cluster_status</code> on 1001.
 
* Resetting the last node (1001) sometimes just works and sometimes is weird... it might say something like "I can't reset when nothing is running" or it might work (not clear why)
 
* On 1001, you can now run <code>sudo rabbitmqctl start_app</code>, then run <code>enable-puppet</code>, and <code>run-puppet-agent</code>. You should see a bunch of puppet output about creating Rabbit users
 
* On 1002, then on 1003, run <code>rabbitmqctl join_cluster rabbit@cloudrabbit1001</code>, then <code>rabbitmqctl start_app</code>, then <code>rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'</code>
 
* On 1002, then on 1003, run <code>enable-puppet</code> and <code>run-puppet-agent</code>
 
* Check all 3 nodes are part of the cluster by running <code>sudo rabbitmqctl cluster_status</code> on 1001


== Troubleshooting ==
== Troubleshooting ==

Latest revision as of 15:11, 7 October 2022

Many OpenStack services communicate with one another via rabbitmq. For example:

- Nova services relay messages to nova-conductor via rabbitmq, and nova-conductor marshals database reads and writes - When an instance is created via nova-api, nova-api passes a rabbitmq message to nova-scheduler, who then schedules a VM (again via a rabbit message) on a nova-compute node - nova-scheduler assesses capacity of compute nodes via rabbitmq messages - designate-sink subscribes to rabbitmq notifications in order to detect and respond to VM creation/deletion - etc.

When VM creation is failing, very often the issue is with rabbitmq. Typically rabbit can be restarted with minimal harm, which will prompt all clients to reconnect. Restart each rabbit service one at a time and wait several minutes for it to stabilize before restarting the next.

root@cloudrabbit1001:~# systemctl restart rabbitmq-server
root@cloudrabbit1002:~# systemctl restart rabbitmq-server
root@cloudrabbit1003:~# systemctl restart rabbitmq-server

Operations

  • Create rabbitmq user (for example, for a new openstack service), and granting access privileges.
root@cloudrabbitXXXX:~# rabbitmqctl add_user username password
root@cloudrabbitXXXX:~# rabbitmqctl set_permissions "username" ".*" ".*" ".*"

It should be enough to create the user in one of the rabbit nodes, it will be replicated to the others.

HA setup

For redundancy we use a cluster of two rabbitmq servers in a primary/secondary relationship. Some documentation about how this is set up can be found here. Most of the pieces of this are puppetized, but when standing up a new pair a couple of manual steps are needed.

On the secondary host (where the primary host is cloudrabbit1001):

 root@cloudrabbit1002:~# rabbitmqctl stop_app
 root@cloudrabbit1002:~# rabbitmqctl join_cluster rabbit@cloudrabbit1001
 root@cloudrabbit1002:~# rabbitmqctl start_app
 root@cloudrabbit1002:~# rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'

Resetting the HA setup

Several times, we have run into issues with the HA setup (e.g. T320232), that we could only fix by resetting the cluster completely, with the following procedure:

  • You'll want a shell on all three rabbit nodes, for starters: cloudrabbit100[123].wikimedia.org
  • On all nodes, stop puppet from messing with us by running sudo disable-puppet
  • sudo rabbitmqctl cluster_status on cloudrabbit1001 should claim that all three nodes are up (the 'Running Nodes' section is the interesting bit)
  • First on 1003, then on 1002, run sudo rabbitmqctl stop_app and then sudo rabbitmqctl reset, then confirm that 1001 agrees that the node is no longer part of the cluster by running sudo rabbitmqctl cluster_status on 1001.
  • Resetting the last node (1001) sometimes just works and sometimes is weird... it might say something like "I can't reset when nothing is running" or it might work (not clear why)
  • On 1001, you can now run sudo rabbitmqctl start_app, then run enable-puppet, and run-puppet-agent. You should see a bunch of puppet output about creating Rabbit users
  • On 1002, then on 1003, run rabbitmqctl join_cluster rabbit@cloudrabbit1001, then rabbitmqctl start_app, then rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'
  • On 1002, then on 1003, run enable-puppet and run-puppet-agent
  • Check all 3 nodes are part of the cluster by running sudo rabbitmqctl cluster_status on 1001

Troubleshooting

Lists

 $ sudo rabbitmqctl list_exchanges
 $ sudo rabbitmqctl list_channels
 $ sudo rabbitmqctl list_connections
 $ sudo rabbitmqctl list_consumers
 $ sudo rabbitmqctl list_queues

Logs

 /var/log/rabbitmq/rabbit@<hostname>.log

Check local server health

 $ sudo rabbitmqctl status

Checking cluster health

If cluster_status hangs check for stuck processes.

 $ sudo rabbitmqctl cluster_status
 Cluster status of node rabbit@cloudcontrol1003 ...
 [{nodes,[{disc,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]}]},
  {running_nodes,[rabbit@cloudcontrol1004,rabbit@cloudcontrol1003]},
  {cluster_name,<<"rabbit@cloudcontrol1003.wikimedia.org">>},
  {partitions,[]},
  {alarms,[{rabbit@cloudcontrol1004,[]},{rabbit@cloudcontrol1003,[]}]}]

Viewing stuck/suspicious processes

Note: Suspicious processes are not always a problem. However, if you find a large number of suspicious processes that are not decreasing this usually indicates a larger issue.

 $ sudo rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
 2019-07-23 21:34:54 There are 2247 processes.
 2019-07-23 21:34:54 Investigated 0 processes this round, 5000ms to go.
 ...
 2019-07-23 21:34:58 Investigated 0 processes this round, 500ms to go.
 2019-07-23 21:34:59 Found 0 suspicious processes.

Viewing unacknowledged messages

 $ sudo rabbitmqctl list_channels connection messages_unacknowledged

Recovering from split-brain partitioned nodes

When cluster members lose connectivity with each other they can become partitioned (split-brain). You can check for partitioned hosts with the following command:

$ sudo rabbitmqctl cluster_status
Cluster status of node rabbit@cloudcontrol1003 ...
[{nodes,[{disc,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]}]},
 {running_nodes,[rabbit@cloudcontrol1003]},                          # this line should have both hosts listed
 {cluster_name,<<"rabbit@cloudcontrol1003.wikimedia.org">>},
 {partitions,[{rabbit@cloudcontrol1003,[rabbit@cloudcontrol1004]}]}, # this line should NOT have any hosts listed
 {alarms,[{rabbit@cloudcontrol1003,[]}]}]

When this happens you will typically see log messages like: File: /var/log/rabbitmq/rabbit@<hostname>.log

=ERROR REPORT==== 2-Dec-2019::00:08:08 ===
Channel error on connection <0.27123.2916> (208.80.154.23:52790 -> 208.80.154.23:5672, vhost: '/', user: 'nova'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_f48d171794a340' in vhost '/'"

To recover the cluster you will need to restart rabbitmq

sudo systemctl restart rabbitmq-server


Failed to consume message - access to vhost refused

When this error happens on any of the openstack components:

 Failed to consume message from queue: Connection.open: (541) INTERNAL_ERROR - access to vhost '/' refused for user 'nova': vhost '/' is down: amqp.exceptions.InternalError: Connection.open: (541) INTERNAL_ERROR - access to vhost '/' refused for user 'nova': vhost '/' is down

You can try restarting the app, taking the node out and in the cluster and restart and the vhost:

 sudo rabbitmqctl stop_app
 sudo rabbitmqctl reset
 sudo rabbitmqctl start_app
 sudo rabbitmqctl restart_vhost