You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Portal:Cloud VPS/Admin/Rabbitmq

From Wikitech-static
< Portal:Cloud VPS‎ | Admin
Revision as of 12:47, 13 May 2022 by imported>Nskaggs
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Many OpenStack services communicate with one another via rabbitmq. For example:

- Nova services relay messages to nova-conductor via rabbitmq, and nova-conductor marshals database reads and writes - When an instance is created via nova-api, nova-api passes a rabbitmq message to nova-scheduler, who then schedules a VM (again via a rabbit message) on a nova-compute node - nova-scheduler assesses capacity of compute nodes via rabbitmq messages - designate-sink subscribes to rabbitmq notifications in order to detect and respond to VM creation/deletion - etc.

When VM creation is failing, very often the issue is with rabbitmq. Typically rabbit can be restarted with minimal harm, which will prompt all clients to reconnect. Restart each rabbit service one at a time and wait several minutes for it to stabilize before restarting the next.

root@cloudcontrol1003:~# systemctl restart rabbitmq-server
root@cloudcontrol1004:~# systemctl restart rabbitmq-server
root@cloudcontrol1005:~# systemctl restart rabbitmq-server

Operations

  • Create rabbitmq user (for example, for a new openstack service), and granting access privileges.
root@cloudcontrolXXXX:~# rabbitmqctl add_user username password
root@cloudcontrolXXXX:~# rabbitmqctl set_permissions "username" ".*" ".*" ".*"

It should be enough to create the user in one of the rabbit nodes, it will be replicated to the others.

HA setup

For redundancy we use a cluster of two rabbitmq servers in a primary/secondary relationship. Some documentation about how this is set up can be found here. Most of the pieces of this are puppetized, but when standing up a new pair a couple of manual steps are needed.

On the secondary host (where the primary host is cloudcontrol1003):

 root@cloudcontrol1004:~# rabbitmqctl stop_app
 root@cloudcontrol1004:~# rabbitmqctl join_cluster rabbit@cloudcontrol1003
 root@cloudcontrol1004:~# rabbitmqctl start_app
 root@cloudcontrol1004:~# rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'

Troubleshooting

Lists

 $ sudo rabbitmqctl list_exchanges
 $ sudo rabbitmqctl list_channels
 $ sudo rabbitmqctl list_connections
 $ sudo rabbitmqctl list_consumers
 $ sudo rabbitmqctl list_queues

Logs

 /var/log/rabbitmq/rabbit@<hostname>.log

Check local server health

 $ sudo rabbitmqctl status

Checking cluster health

If cluster_status hangs check for stuck processes.

 $ sudo rabbitmqctl cluster_status
 Cluster status of node rabbit@cloudcontrol1003 ...
 [{nodes,[{disc,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]}]},
  {running_nodes,[rabbit@cloudcontrol1004,rabbit@cloudcontrol1003]},
  {cluster_name,<<"rabbit@cloudcontrol1003.wikimedia.org">>},
  {partitions,[]},
  {alarms,[{rabbit@cloudcontrol1004,[]},{rabbit@cloudcontrol1003,[]}]}]

Viewing stuck/suspicious processes

Note: Suspicious processes are not always a problem. However, if you find a large number of suspicious processes that are not decreasing this usually indicates a larger issue.

 $ sudo rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
 2019-07-23 21:34:54 There are 2247 processes.
 2019-07-23 21:34:54 Investigated 0 processes this round, 5000ms to go.
 ...
 2019-07-23 21:34:58 Investigated 0 processes this round, 500ms to go.
 2019-07-23 21:34:59 Found 0 suspicious processes.

Viewing unacknowledged messages

 $ sudo rabbitmqctl list_channels connection messages_unacknowledged

Recovering from split-brain partitioned nodes

When cluster members lose connectivity with each other they can become partitioned (split-brain). You can check for partitioned hosts with the following command:

$ sudo rabbitmqctl cluster_status
Cluster status of node rabbit@cloudcontrol1003 ...
[{nodes,[{disc,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]}]},
 {running_nodes,[rabbit@cloudcontrol1003]},                          # this line should have both hosts listed
 {cluster_name,<<"rabbit@cloudcontrol1003.wikimedia.org">>},
 {partitions,[{rabbit@cloudcontrol1003,[rabbit@cloudcontrol1004]}]}, # this line should NOT have any hosts listed
 {alarms,[{rabbit@cloudcontrol1003,[]}]}]

When this happens you will typically see log messages like: File: /var/log/rabbitmq/rabbit@<hostname>.log

=ERROR REPORT==== 2-Dec-2019::00:08:08 ===
Channel error on connection <0.27123.2916> (208.80.154.23:52790 -> 208.80.154.23:5672, vhost: '/', user: 'nova'), channel 1:
operation basic.publish caused a channel exception not_found: "no exchange 'reply_f48d171794a340' in vhost '/'"

To recover the cluster you will need to restart rabbitmq

sudo systemctl restart rabbitmq-server


Failed to consume message - access to vhost refused

When this error happens on any of the openstack components:

 Failed to consume message from queue: Connection.open: (541) INTERNAL_ERROR - access to vhost '/' refused for user 'nova': vhost '/' is down: amqp.exceptions.InternalError: Connection.open: (541) INTERNAL_ERROR - access to vhost '/' refused for user 'nova': vhost '/' is down

You can try restarting the app, taking the node out and in the cluster and restart and the vhost:

 sudo rabbitmqctl stop_app
 sudo rabbitmqctl reset
 sudo rabbitmqctl start_app
 sudo rabbitmqctl restart_vhost