You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Portal:Cloud VPS/Admin/Rabbitmq
Many OpenStack services communicate with one another via rabbitmq. For example:
- Nova services relay messages to nova-conductor via rabbitmq, and nova-conductor marshals database reads and writes - When an instance is created via nova-api, nova-api passes a rabbitmq message to nova-scheduler, who then schedules a VM (again via a rabbit message) on a nova-compute node - nova-scheduler assesses capacity of compute nodes via rabbitmq messages - designate-sink subscribes to rabbitmq notifications in order to detect and respond to VM creation/deletion - etc.
When VM creation is failing, very often the issue is with rabbitmq. Typically rabbit can be restarted with minimal harm, which will prompt all clients to reconnect:
root@cloudcontrol1003:~# service rabbitmq-server restart root@cloudcontrol1004:~# service rabbitmq-server restart
Operations
- Create rabbitmq user (for example, for a new openstack service), and granting access privileges.
root@cloudcontrolXXXX:~# rabbitmqctl add_user username password
root@cloudcontrolXXXX:~# rabbitmqctl set_permissions "username" ".*" ".*" ".*"
It should be enough to create the user in one of the rabbit nodes, it will be replicated to the others.
HA setup
For redundancy we use a cluster of two rabbitmq servers in a primary/secondary relationship. Some documentation about how this is set up can be found here. Most of the pieces of this are puppetized, but when standing up a new pair a couple of manual steps are needed.
On the secondary host (where the primary host is cloudcontrol1003):
root@cloudcontrol1004:~# rabbitmqctl stop_app root@cloudcontrol1004:~# rabbitmqctl join_cluster rabbit@cloudcontrol1003 root@cloudcontrol1004:~# rabbitmqctl start_app root@cloudcontrol1004:~# rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'
Troubleshooting
Lists
$ sudo rabbitmqctl list_exchanges $ sudo rabbitmqctl list_channels $ sudo rabbitmqctl list_connections $ sudo rabbitmqctl list_consumers $ sudo rabbitmqctl list_queues
Logs
/var/log/rabbitmq/rabbit@<hostname>.log
Check local server health
$ sudo rabbitmqctl status
Checking cluster health
If cluster_status hangs check for stuck processes.
$ sudo rabbitmqctl cluster_status Cluster status of node rabbit@cloudcontrol1003 ... [{nodes,[{disc,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]}]}, {running_nodes,[rabbit@cloudcontrol1004,rabbit@cloudcontrol1003]}, {cluster_name,<<"rabbit@cloudcontrol1003.wikimedia.org">>}, {partitions,[]}, {alarms,[{rabbit@cloudcontrol1004,[]},{rabbit@cloudcontrol1003,[]}]}]
Viewing stuck/suspicious processes
Note: Suspicious processes are not always a problem. However, if you find a large number of suspicious processes that are not decreasing this usually indicates a larger issue.
$ sudo rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' 2019-07-23 21:34:54 There are 2247 processes. 2019-07-23 21:34:54 Investigated 0 processes this round, 5000ms to go. ... 2019-07-23 21:34:58 Investigated 0 processes this round, 500ms to go. 2019-07-23 21:34:59 Found 0 suspicious processes.
Viewing unacknowledged messages
$ sudo rabbitmqctl list_channels connection messages_unacknowledged
Recovering from split-brain partitioned nodes
When cluster members lose connectivity with each other they can become partitioned (split-brain). You can check for partitioned hosts with the following command:
$ sudo rabbitmqctl cluster_status Cluster status of node rabbit@cloudcontrol1003 ... [{nodes,[{disc,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]}]}, {running_nodes,[rabbit@cloudcontrol1003]}, # this line should have both hosts listed {cluster_name,<<"rabbit@cloudcontrol1003.wikimedia.org">>}, {partitions,[{rabbit@cloudcontrol1003,[rabbit@cloudcontrol1004]}]}, # this line should NOT have any hosts listed {alarms,[{rabbit@cloudcontrol1003,[]}]}]
When this happens you will typically see log messages like: File: /var/log/rabbitmq/rabbit@<hostname>.log
=ERROR REPORT==== 2-Dec-2019::00:08:08 === Channel error on connection <0.27123.2916> (208.80.154.23:52790 -> 208.80.154.23:5672, vhost: '/', user: 'nova'), channel 1: operation basic.publish caused a channel exception not_found: "no exchange 'reply_f48d171794a340' in vhost '/'"
To recover the cluster you will need to restart rabbitmq
sudo systemctl restart rabbitmq-server
Failed to consume message - access to vhost refused
When this error happens on any of the openstack components:
Failed to consume message from queue: Connection.open: (541) INTERNAL_ERROR - access to vhost '/' refused for user 'nova': vhost '/' is down: amqp.exceptions.InternalError: Connection.open: (541) INTERNAL_ERROR - access to vhost '/' refused for user 'nova': vhost '/' is down
You can try restarting the app, taking the node out and in the cluster and restart and the vhost:
sudo rabbitmqctl stop_app sudo rabbitmqctl reset sudo rabbitmqctl start_app sudo rabbitmqctl restart_vhost