You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Portal:Cloud VPS/Admin/Rabbitmq"

From Wikitech-static
Jump to navigation Jump to search
imported>Jhedden
imported>Jhedden
Line 22: Line 22:
   root@cloudcontrol1004:~# rabbitmqctl start_app
   root@cloudcontrol1004:~# rabbitmqctl start_app
   root@cloudcontrol1004:~# rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'
   root@cloudcontrol1004:~# rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'
== Troubleshooting ==
==== Lists ====
  $ sudo rabbitmqctl list_exchanges
  $ sudo rabbitmqctl list_channels
  $ sudo rabbitmqctl list_connections
  $ sudo rabbitmqctl list_consumers
  $ sudo rabbitmqctl list_queues
==== Logs ====
  /var/log/rabbitmq/rabbit@<hostname>.log
==== Check local server health ====
  $ sudo rabbitmqctl status
==== Checking cluster health ====
If cluster_status hangs check for stuck processes.
  $ sudo rabbitmqctl cluster_status
  Cluster status of node rabbit@cloudcontrol1003 ...
  [{nodes,[{disc,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]}]},
  {running_nodes,[rabbit@cloudcontrol1004,rabbit@cloudcontrol1003]},
  {cluster_name,<<"rabbit@cloudcontrol1003.wikimedia.org">>},
  {partitions,[]},
  {alarms,[{rabbit@cloudcontrol1004,[]},{rabbit@cloudcontrol1003,[]}]}]
==== Viewing stuck/suspicious processes ====
Note: Suspicious processes are not always a problem. However, if you find a large number of suspicious processes that are not decreasing this usually indicates a larger issue.
  $ sudo rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
  2019-07-23 21:34:54 There are 2247 processes.
  2019-07-23 21:34:54 Investigated 0 processes this round, 5000ms to go.
  ...
  2019-07-23 21:34:58 Investigated 0 processes this round, 500ms to go.
  2019-07-23 21:34:59 Found 0 suspicious processes.
==== Viewing unacknowledged messages ====
  $ sudo rabbitmqctl list_channels connection messages_unacknowledged

Revision as of 21:54, 23 July 2019

Many OpenStack services communicate with one another via rabbitmq. For example:

- Nova services relay messages to nova-conductor via rabbitmq, and nova-conductor marshals database reads and writes - When an instance is created via nova-api, nova-api passes a rabbitmq message to nova-scheduler, who then schedules a VM (again via a rabbit message) on a nova-compute node - nova-scheduler assesses capacity of compute nodes via rabbitmq messages - designate-sink subscribes to rabbitmq notifications in order to detect and respond to VM creation/deletion - etc.

When VM creation is failing, very often the issue is with rabbitmq. Typically rabbit can be restarted with minimal harm, which will prompt all clients to reconnect:

root@cloudcontrol1003:~# service rabbitmq-server restart
root@cloudcontrol1004:~# service rabbitmq-server restart

HA setup

For redundancy we use a cluster of two rabbitmq servers in a primary/secondary relationship. Some documentation about how this is set up can be found [| here]. Most of the pieces of this are puppetized, but when standing up a new pair a couple of manual steps are needed.

On the secondary host (where the primary host is cloudcontrol1003):

 root@cloudcontrol1004:~# rabbitmqctl stop_app
 root@cloudcontrol1004:~# rabbitmqctl join_cluster rabbit@cloudcontrol1003
 root@cloudcontrol1004:~# rabbitmqctl start_app
 root@cloudcontrol1004:~# rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'

Troubleshooting

Lists

 $ sudo rabbitmqctl list_exchanges
 $ sudo rabbitmqctl list_channels
 $ sudo rabbitmqctl list_connections
 $ sudo rabbitmqctl list_consumers
 $ sudo rabbitmqctl list_queues

Logs

 /var/log/rabbitmq/rabbit@<hostname>.log

Check local server health

 $ sudo rabbitmqctl status

Checking cluster health

If cluster_status hangs check for stuck processes.

 $ sudo rabbitmqctl cluster_status
 Cluster status of node rabbit@cloudcontrol1003 ...
 [{nodes,[{disc,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]}]},
  {running_nodes,[rabbit@cloudcontrol1004,rabbit@cloudcontrol1003]},
  {cluster_name,<<"rabbit@cloudcontrol1003.wikimedia.org">>},
  {partitions,[]},
  {alarms,[{rabbit@cloudcontrol1004,[]},{rabbit@cloudcontrol1003,[]}]}]

Viewing stuck/suspicious processes

Note: Suspicious processes are not always a problem. However, if you find a large number of suspicious processes that are not decreasing this usually indicates a larger issue.

 $ sudo rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
 2019-07-23 21:34:54 There are 2247 processes.
 2019-07-23 21:34:54 Investigated 0 processes this round, 5000ms to go.
 ...
 2019-07-23 21:34:58 Investigated 0 processes this round, 500ms to go.
 2019-07-23 21:34:59 Found 0 suspicious processes.

Viewing unacknowledged messages

 $ sudo rabbitmqctl list_channels connection messages_unacknowledged