[BUG] The RabbitMQ deployment - problem with dns #1530

ar3ndt · 2020-08-06T12:38:48Z

Describe the bug
There is an issue with cluster formation.

To Reproduce
Steps to reproduce the behavior:

Deploy the cluster with the RabbitMQ application
Run kubectl exec -it -n=queue rabbitmq-cluster-0 -- rabbitmqctl cluster_status and kubectl exec -it -n=queue rabbitmq-cluster-1 -- rabbitmqctl cluster_status

Expected behavior
The RabbitMQ deployment is clustered.
The command result shows 2 running nodes.

Additional context
Actual command result:

[ec2-user@ec2-15-236-60-122 ~]$ kubectl exec -it -n=queue rabbitmq-cluster-0 -- rabbitmqctl cluster_status
Cluster status of node rabbit@10.244.3.16 ...
[{nodes,[{disc,['rabbit@10.244.3.16']}]},
 {running_nodes,['rabbit@10.244.3.16']},
 {cluster_name,<<"rabbit@rabbitmq-cluster-0.rabbitmq-cluster.queue.svc.cluster.local">>},
 {partitions,[]},
 {alarms,[{'rabbit@10.244.3.16',[]}]}]

[ec2-user@ec2-15-236-60-122 ~]$ kubectl exec -it -n=queue rabbitmq-cluster-1 -- rabbitmqctl cluster_status
Cluster status of node rabbit@10.244.2.15 ...
[{nodes,[{disc,['rabbit@10.244.1.10','rabbit@10.244.1.14',
                'rabbit@10.244.2.11','rabbit@10.244.2.15',
                'rabbit@10.244.3.12','rabbit@10.244.3.8']}]},
 {running_nodes,['rabbit@10.244.2.15']},
 {cluster_name,<<"rabbit@rabbitmq-cluster-0.rabbitmq-cluster.queue.svc.cluster.local">>},
 {partitions,[]},
 {alarms,[{'rabbit@10.244.2.15',[]}]}]

[ec2-user@ec2-15-236-60-122 ~]$ kubectl get pods -n=queue -o wide
NAME                 READY   STATUS    RESTARTS   AGE     IP            NODE                                                 NOMINATED NODE   READINESS GATES
rabbitmq-cluster-0   1/1     Running   0          4h12m   10.244.3.16   ec2-15-236-95-83.eu-west-3.compute.amazonaws.com     <none>           <none>
rabbitmq-cluster-1   1/1     Running   0          4h13m   10.244.2.15   ec2-15-236-203-152.eu-west-3.compute.amazonaws.com   <none>           <none>

[ec2-user@ec2-15-236-60-122 ~]$ kubectl logs -n=queue rabbitmq-cluster-0

  ##  ##
  ##  ##      RabbitMQ 3.7.10. Copyright (C) 2007-2018 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See http://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2020-06-29 10:00:35.281 [info] <0.211.0>
 Starting RabbitMQ 3.7.10 on Erlang 21.2.3
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/
2020-06-29 10:00:35.288 [info] <0.211.0>
 node           : rabbit@10.244.3.16
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : GRtFqfufy0A8wfweQSYTgA==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@10.244.3.16
2020-06-29 10:00:36.824 [info] <0.219.0> Memory high watermark set to 1864 MiB (1955262464 bytes) of 3729 MiB (3910524928 bytes) total
2020-06-29 10:00:36.828 [info] <0.221.0> Enabling free disk space monitoring
2020-06-29 10:00:36.828 [info] <0.221.0> Disk free limit set to 50MB
2020-06-29 10:00:36.832 [info] <0.224.0> Limiting to approx 1048476 file handles (943626 sockets)
2020-06-29 10:00:36.832 [info] <0.225.0> FHC read buffering:  OFF
2020-06-29 10:00:36.832 [info] <0.225.0> FHC write buffering: ON
2020-06-29 10:00:36.833 [info] <0.211.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.244.3.16 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2020-06-29 10:00:36.833 [info] <0.211.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2020-06-29 10:00:36.833 [info] <0.211.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2020-06-29 10:00:36.833 [info] <0.211.0> Peer discovery backend does not support locking, falling back to randomized delay
2020-06-29 10:00:36.833 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
**2020-06-29 10:00:36.852 [info] <0.211.0> k8s endpoint listing returned nodes not yet ready: 10.244.2.15, 10.244.3.16**
2020-06-29 10:00:36.852 [info] <0.211.0> All discovered existing cluster peers:
2020-06-29 10:00:36.852 [info] <0.211.0> Discovered no peer nodes to cluster with
2020-06-29 10:00:36.854 [info] <0.43.0> Application mnesia exited with reason: stopped
2020-06-29 10:00:36.918 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 10:00:36.943 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 10:00:36.968 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 10:00:36.968 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping registration.
2020-06-29 10:00:36.969 [info] <0.211.0> Priority queues enabled, real BQ is rabbit_variable_queue
2020-06-29 10:00:36.971 [info] <0.396.0> Starting rabbit_node_monitor
2020-06-29 10:00:36.994 [info] <0.211.0> message_store upgrades: 1 to apply
2020-06-29 10:00:36.994 [info] <0.211.0> message_store upgrades: Applying rabbit_variable_queue:move_messages_to_vhost_store
2020-06-29 10:00:36.995 [info] <0.211.0> message_store upgrades: No durable queues found. Skipping message store migration
2020-06-29 10:00:36.995 [info] <0.211.0> message_store upgrades: Removing the old message store data
2020-06-29 10:00:36.995 [info] <0.211.0> message_store upgrades: All upgrades applied successfully
2020-06-29 10:00:37.020 [info] <0.211.0> Management plugin: using rates mode 'basic'
2020-06-29 10:00:37.021 [info] <0.211.0> Adding vhost '/'
2020-06-29 10:00:37.032 [info] <0.436.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit@10.244.3.16/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists
2020-06-29 10:00:37.036 [info] <0.436.0> Starting message stores for vhost '/'
2020-06-29 10:00:37.036 [info] <0.440.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2020-06-29 10:00:37.037 [info] <0.436.0> Started message store of type transient for vhost '/'
2020-06-29 10:00:37.037 [info] <0.443.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2020-06-29 10:00:37.037 [warning] <0.443.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2020-06-29 10:00:37.038 [info] <0.436.0> Started message store of type persistent for vhost '/'
2020-06-29 10:00:37.039 [info] <0.211.0> Creating user 'guest'
2020-06-29 10:00:37.040 [info] <0.211.0> Setting user tags for user 'guest' to [administrator]
2020-06-29 10:00:37.042 [info] <0.211.0> Setting permissions for 'guest' in '/' to '.*', '.*', '.*'
2020-06-29 10:00:37.044 [warning] <0.467.0> Setting Ranch options together with socket options is deprecated. Please use the new map syntax that allows specifying socket options separately from other options.
2020-06-29 10:00:37.045 [info] <0.481.0> started TCP listener on [::]:5672
2020-06-29 10:00:37.048 [info] <0.211.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@10.244.3.16'
2020-06-29 10:00:37.050 [info] <0.211.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@10.244.3.16'
2020-06-29 10:00:37.051 [info] <0.495.0> Peer discovery: enabling node cleanup (will only log warnings). Check interval: 30 seconds.
2020-06-29 10:00:37.077 [info] <0.545.0> Management plugin: HTTP (non-TLS) listener started on port 15672
2020-06-29 10:00:37.077 [info] <0.651.0> Statistics database started.
2020-06-29 10:00:37.156 [info] <0.8.0> Server startup complete; 5 plugins started.
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * rabbitmq_management_agent
 * rabbitmq_peer_discovery_k8s
 * rabbitmq_peer_discovery_common
 completed with 5 plugins.

[ec2-user@ec2-15-236-60-122 ~]$ kubectl logs -n=queue rabbitmq-cluster-1

  ##  ##
  ##  ##      RabbitMQ 3.7.10. Copyright (C) 2007-2018 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See http://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2020-06-29 09:59:35.287 [info] <0.211.0>
 Starting RabbitMQ 3.7.10 on Erlang 21.2.3
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/
2020-06-29 09:59:35.292 [info] <0.211.0>
 node           : rabbit@10.244.2.15
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : GRtFqfufy0A8wfweQSYTgA==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@10.244.2.15
2020-06-29 09:59:36.555 [info] <0.219.0> Memory high watermark set to 1864 MiB (1955258368 bytes) of 3729 MiB (3910516736 bytes) total
2020-06-29 09:59:36.558 [info] <0.221.0> Enabling free disk space monitoring
2020-06-29 09:59:36.559 [info] <0.221.0> Disk free limit set to 50MB
2020-06-29 09:59:36.562 [info] <0.224.0> Limiting to approx 1048476 file handles (943626 sockets)
2020-06-29 09:59:36.562 [info] <0.225.0> FHC read buffering:  OFF
2020-06-29 09:59:36.562 [info] <0.225.0> FHC write buffering: ON
2020-06-29 09:59:36.562 [info] <0.211.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.244.2.15 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2020-06-29 09:59:36.562 [info] <0.211.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2020-06-29 09:59:36.563 [info] <0.211.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2020-06-29 09:59:36.563 [info] <0.211.0> Peer discovery backend does not support locking, falling back to randomized delay
2020-06-29 09:59:36.563 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2020-06-29 09:59:36.581 [info] <0.211.0> k8s endpoint listing returned nodes not yet ready: 10.244.2.15
2020-06-29 09:59:36.581 [info] <0.211.0> All discovered existing cluster peers: rabbit@10.244.1.14
2020-06-29 09:59:36.581 [info] <0.211.0> Peer nodes we can cluster with: rabbit@10.244.1.14
2020-06-29 09:59:36.588 [info] <0.211.0> Node 'rabbit@10.244.1.14' selected for auto-clustering
2020-06-29 09:59:48.634 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 09:59:48.833 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 09:59:48.861 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 09:59:48.867 [info] <0.211.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.872 [info] <0.211.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.873 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping registration.
2020-06-29 09:59:48.874 [info] <0.211.0> Priority queues enabled, real BQ is rabbit_variable_queue
2020-06-29 09:59:48.877 [info] <0.424.0> Starting rabbit_node_monitor
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: 1 to apply
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: Applying rabbit_variable_queue:move_messages_to_vhost_store
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: No durable queues found. Skipping message store migration
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: Removing the old message store data
2020-06-29 09:59:48.902 [info] <0.211.0> message_store upgrades: All upgrades applied successfully
2020-06-29 09:59:48.927 [info] <0.211.0> Management plugin: using rates mode 'basic'
2020-06-29 09:59:48.930 [info] <0.456.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit@10.244.2.15/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists
2020-06-29 09:59:48.933 [info] <0.456.0> Starting message stores for vhost '/'
2020-06-29 09:59:48.933 [info] <0.460.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2020-06-29 09:59:48.934 [info] <0.456.0> Started message store of type transient for vhost '/'
2020-06-29 09:59:48.934 [info] <0.463.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2020-06-29 09:59:48.935 [warning] <0.463.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2020-06-29 09:59:48.936 [info] <0.456.0> Started message store of type persistent for vhost '/'
2020-06-29 09:59:48.938 [warning] <0.484.0> Setting Ranch options together with socket options is deprecated. Please use the new map syntax that allows specifying socket options separately from other options.
2020-06-29 09:59:48.939 [info] <0.498.0> started TCP listener on [::]:5672
2020-06-29 09:59:48.940 [info] <0.211.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.941 [info] <0.211.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.941 [info] <0.424.0> rabbit on node 'rabbit@10.244.1.14' up
2020-06-29 09:59:48.942 [info] <0.506.0> Peer discovery: enabling node cleanup (will only log warnings). Check interval: 30 seconds.
2020-06-29 09:59:48.973 [info] <0.557.0> Management plugin: HTTP (non-TLS) listener started on port 15672
2020-06-29 09:59:48.973 [info] <0.663.0> Statistics database started.
 completed with 5 plugins.
2020-06-29 09:59:49.058 [info] <0.8.0> Server startup complete; 5 plugins started.
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * rabbitmq_management_agent
 * rabbitmq_peer_discovery_k8s
 * rabbitmq_peer_discovery_common
2020-06-29 10:00:18.948 [info] <0.506.0> k8s endpoint listing returned nodes not yet ready: 10.244.2.15
2020-06-29 10:00:18.948 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.10 is unreachable
2020-06-29 10:00:18.949 [warning] <0.506.0> Peer discovery: node rabbit@10.244.2.11 is unreachable
2020-06-29 10:00:18.949 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.12 is unreachable
2020-06-29 10:00:18.949 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.8 is unreachable
2020-06-29 10:00:21.464 [info] <0.424.0> rabbit on node 'rabbit@10.244.1.14' down
2020-06-29 10:00:21.468 [info] <0.424.0> Keeping rabbit@10.244.1.14 listeners: the node is already back
2020-06-29 10:00:21.493 [info] <0.424.0> node 'rabbit@10.244.1.14' down: connection_closed
2020-06-29 10:00:48.976 [info] <0.506.0> k8s endpoint listing returned nodes not yet ready: 10.244.3.16
2020-06-29 10:00:48.976 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.10 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.14 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.2.11 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.12 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.8 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.10 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.14 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.2.11 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.12 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.8 is unreachable

The text was updated successfully, but these errors were encountered:

ar3ndt · 2020-08-06T12:39:30Z

Some analysis are available in already closed issue:
#1395

przemyslavic · 2020-10-02T10:10:21Z

I can't reproduce the problem on the current develop branch. It's quite possible that updating the version of kubernetes and network plugins may have fixed it.

plirglo · 2020-10-16T07:35:40Z

Please run and verify tests if kubernetes upgrade solved this issue and/or confirm with @przemyslavic actual status

przemyslavic · 2022-03-24T10:47:42Z

It looks like the problem was noticed in version 0.7 or 0.8. Since then, we have updated both kubernetes and rabbitmq, and the problem could not be reproduced.

ar3ndt added type/bug area/rabbit status/hold type/investigation labels Aug 6, 2020

mkyc added the status/grooming-needed label Aug 11, 2020

przemyslavic removed the status/grooming-needed label Oct 2, 2020

mkyc added area/testing priority/low Task with low priority and removed status/hold type/investigation labels Oct 23, 2020

przemyslavic closed this as completed Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] The RabbitMQ deployment - problem with dns #1530

[BUG] The RabbitMQ deployment - problem with dns #1530

ar3ndt commented Aug 6, 2020

ar3ndt commented Aug 6, 2020

przemyslavic commented Oct 2, 2020

plirglo commented Oct 16, 2020

przemyslavic commented Mar 24, 2022

[BUG] The RabbitMQ deployment - problem with dns #1530

[BUG] The RabbitMQ deployment - problem with dns #1530

Comments

ar3ndt commented Aug 6, 2020

ar3ndt commented Aug 6, 2020

przemyslavic commented Oct 2, 2020

plirglo commented Oct 16, 2020

przemyslavic commented Mar 24, 2022