[BUG] The RabbitMQ deployment - clustering is not working properly #1395

przemyslavic · 2020-06-29T14:19:39Z

Describe the bug
There is an issue with cluster formation.

To Reproduce
Steps to reproduce the behavior:

Deploy the cluster with the RabbitMQ application
Run kubectl exec -it -n=queue rabbitmq-cluster-0 -- rabbitmqctl cluster_status and kubectl exec -it -n=queue rabbitmq-cluster-1 -- rabbitmqctl cluster_status

Expected behavior
The RabbitMQ deployment is clustered.
The command result shows 2 running nodes.

Additional context
Actual command result:

[ec2-user@ec2-15-236-60-122 ~]$ kubectl exec -it -n=queue rabbitmq-cluster-0 -- rabbitmqctl cluster_status
Cluster status of node rabbit@10.244.3.16 ...
[{nodes,[{disc,['rabbit@10.244.3.16']}]},
 {running_nodes,['rabbit@10.244.3.16']},
 {cluster_name,<<"rabbit@rabbitmq-cluster-0.rabbitmq-cluster.queue.svc.cluster.local">>},
 {partitions,[]},
 {alarms,[{'rabbit@10.244.3.16',[]}]}]

[ec2-user@ec2-15-236-60-122 ~]$ kubectl exec -it -n=queue rabbitmq-cluster-1 -- rabbitmqctl cluster_status
Cluster status of node rabbit@10.244.2.15 ...
[{nodes,[{disc,['rabbit@10.244.1.10','rabbit@10.244.1.14',
                'rabbit@10.244.2.11','rabbit@10.244.2.15',
                'rabbit@10.244.3.12','rabbit@10.244.3.8']}]},
 {running_nodes,['rabbit@10.244.2.15']},
 {cluster_name,<<"rabbit@rabbitmq-cluster-0.rabbitmq-cluster.queue.svc.cluster.local">>},
 {partitions,[]},
 {alarms,[{'rabbit@10.244.2.15',[]}]}]

[ec2-user@ec2-15-236-60-122 ~]$ kubectl get pods -n=queue -o wide
NAME                 READY   STATUS    RESTARTS   AGE     IP            NODE                                                 NOMINATED NODE   READINESS GATES
rabbitmq-cluster-0   1/1     Running   0          4h12m   10.244.3.16   ec2-15-236-95-83.eu-west-3.compute.amazonaws.com     <none>           <none>
rabbitmq-cluster-1   1/1     Running   0          4h13m   10.244.2.15   ec2-15-236-203-152.eu-west-3.compute.amazonaws.com   <none>           <none>

[ec2-user@ec2-15-236-60-122 ~]$ kubectl logs -n=queue rabbitmq-cluster-0

  ##  ##
  ##  ##      RabbitMQ 3.7.10. Copyright (C) 2007-2018 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See http://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2020-06-29 10:00:35.281 [info] <0.211.0>
 Starting RabbitMQ 3.7.10 on Erlang 21.2.3
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/
2020-06-29 10:00:35.288 [info] <0.211.0>
 node           : rabbit@10.244.3.16
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : GRtFqfufy0A8wfweQSYTgA==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@10.244.3.16
2020-06-29 10:00:36.824 [info] <0.219.0> Memory high watermark set to 1864 MiB (1955262464 bytes) of 3729 MiB (3910524928 bytes) total
2020-06-29 10:00:36.828 [info] <0.221.0> Enabling free disk space monitoring
2020-06-29 10:00:36.828 [info] <0.221.0> Disk free limit set to 50MB
2020-06-29 10:00:36.832 [info] <0.224.0> Limiting to approx 1048476 file handles (943626 sockets)
2020-06-29 10:00:36.832 [info] <0.225.0> FHC read buffering:  OFF
2020-06-29 10:00:36.832 [info] <0.225.0> FHC write buffering: ON
2020-06-29 10:00:36.833 [info] <0.211.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.244.3.16 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2020-06-29 10:00:36.833 [info] <0.211.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2020-06-29 10:00:36.833 [info] <0.211.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2020-06-29 10:00:36.833 [info] <0.211.0> Peer discovery backend does not support locking, falling back to randomized delay
2020-06-29 10:00:36.833 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
**2020-06-29 10:00:36.852 [info] <0.211.0> k8s endpoint listing returned nodes not yet ready: 10.244.2.15, 10.244.3.16**
2020-06-29 10:00:36.852 [info] <0.211.0> All discovered existing cluster peers:
2020-06-29 10:00:36.852 [info] <0.211.0> Discovered no peer nodes to cluster with
2020-06-29 10:00:36.854 [info] <0.43.0> Application mnesia exited with reason: stopped
2020-06-29 10:00:36.918 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 10:00:36.943 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 10:00:36.968 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 10:00:36.968 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping registration.
2020-06-29 10:00:36.969 [info] <0.211.0> Priority queues enabled, real BQ is rabbit_variable_queue
2020-06-29 10:00:36.971 [info] <0.396.0> Starting rabbit_node_monitor
2020-06-29 10:00:36.994 [info] <0.211.0> message_store upgrades: 1 to apply
2020-06-29 10:00:36.994 [info] <0.211.0> message_store upgrades: Applying rabbit_variable_queue:move_messages_to_vhost_store
2020-06-29 10:00:36.995 [info] <0.211.0> message_store upgrades: No durable queues found. Skipping message store migration
2020-06-29 10:00:36.995 [info] <0.211.0> message_store upgrades: Removing the old message store data
2020-06-29 10:00:36.995 [info] <0.211.0> message_store upgrades: All upgrades applied successfully
2020-06-29 10:00:37.020 [info] <0.211.0> Management plugin: using rates mode 'basic'
2020-06-29 10:00:37.021 [info] <0.211.0> Adding vhost '/'
2020-06-29 10:00:37.032 [info] <0.436.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit@10.244.3.16/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists
2020-06-29 10:00:37.036 [info] <0.436.0> Starting message stores for vhost '/'
2020-06-29 10:00:37.036 [info] <0.440.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2020-06-29 10:00:37.037 [info] <0.436.0> Started message store of type transient for vhost '/'
2020-06-29 10:00:37.037 [info] <0.443.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2020-06-29 10:00:37.037 [warning] <0.443.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2020-06-29 10:00:37.038 [info] <0.436.0> Started message store of type persistent for vhost '/'
2020-06-29 10:00:37.039 [info] <0.211.0> Creating user 'guest'
2020-06-29 10:00:37.040 [info] <0.211.0> Setting user tags for user 'guest' to [administrator]
2020-06-29 10:00:37.042 [info] <0.211.0> Setting permissions for 'guest' in '/' to '.*', '.*', '.*'
2020-06-29 10:00:37.044 [warning] <0.467.0> Setting Ranch options together with socket options is deprecated. Please use the new map syntax that allows specifying socket options separately from other options.
2020-06-29 10:00:37.045 [info] <0.481.0> started TCP listener on [::]:5672
2020-06-29 10:00:37.048 [info] <0.211.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@10.244.3.16'
2020-06-29 10:00:37.050 [info] <0.211.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@10.244.3.16'
2020-06-29 10:00:37.051 [info] <0.495.0> Peer discovery: enabling node cleanup (will only log warnings). Check interval: 30 seconds.
2020-06-29 10:00:37.077 [info] <0.545.0> Management plugin: HTTP (non-TLS) listener started on port 15672
2020-06-29 10:00:37.077 [info] <0.651.0> Statistics database started.
2020-06-29 10:00:37.156 [info] <0.8.0> Server startup complete; 5 plugins started.
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * rabbitmq_management_agent
 * rabbitmq_peer_discovery_k8s
 * rabbitmq_peer_discovery_common
 completed with 5 plugins.

[ec2-user@ec2-15-236-60-122 ~]$ kubectl logs -n=queue rabbitmq-cluster-1

  ##  ##
  ##  ##      RabbitMQ 3.7.10. Copyright (C) 2007-2018 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See http://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2020-06-29 09:59:35.287 [info] <0.211.0>
 Starting RabbitMQ 3.7.10 on Erlang 21.2.3
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/
2020-06-29 09:59:35.292 [info] <0.211.0>
 node           : rabbit@10.244.2.15
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : GRtFqfufy0A8wfweQSYTgA==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@10.244.2.15
2020-06-29 09:59:36.555 [info] <0.219.0> Memory high watermark set to 1864 MiB (1955258368 bytes) of 3729 MiB (3910516736 bytes) total
2020-06-29 09:59:36.558 [info] <0.221.0> Enabling free disk space monitoring
2020-06-29 09:59:36.559 [info] <0.221.0> Disk free limit set to 50MB
2020-06-29 09:59:36.562 [info] <0.224.0> Limiting to approx 1048476 file handles (943626 sockets)
2020-06-29 09:59:36.562 [info] <0.225.0> FHC read buffering:  OFF
2020-06-29 09:59:36.562 [info] <0.225.0> FHC write buffering: ON
2020-06-29 09:59:36.562 [info] <0.211.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.244.2.15 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2020-06-29 09:59:36.562 [info] <0.211.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2020-06-29 09:59:36.563 [info] <0.211.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2020-06-29 09:59:36.563 [info] <0.211.0> Peer discovery backend does not support locking, falling back to randomized delay
2020-06-29 09:59:36.563 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2020-06-29 09:59:36.581 [info] <0.211.0> k8s endpoint listing returned nodes not yet ready: 10.244.2.15
2020-06-29 09:59:36.581 [info] <0.211.0> All discovered existing cluster peers: rabbit@10.244.1.14
2020-06-29 09:59:36.581 [info] <0.211.0> Peer nodes we can cluster with: rabbit@10.244.1.14
2020-06-29 09:59:36.588 [info] <0.211.0> Node 'rabbit@10.244.1.14' selected for auto-clustering
2020-06-29 09:59:48.634 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 09:59:48.833 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 09:59:48.861 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 09:59:48.867 [info] <0.211.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.872 [info] <0.211.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.873 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping registration.
2020-06-29 09:59:48.874 [info] <0.211.0> Priority queues enabled, real BQ is rabbit_variable_queue
2020-06-29 09:59:48.877 [info] <0.424.0> Starting rabbit_node_monitor
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: 1 to apply
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: Applying rabbit_variable_queue:move_messages_to_vhost_store
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: No durable queues found. Skipping message store migration
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: Removing the old message store data
2020-06-29 09:59:48.902 [info] <0.211.0> message_store upgrades: All upgrades applied successfully
2020-06-29 09:59:48.927 [info] <0.211.0> Management plugin: using rates mode 'basic'
2020-06-29 09:59:48.930 [info] <0.456.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit@10.244.2.15/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists
2020-06-29 09:59:48.933 [info] <0.456.0> Starting message stores for vhost '/'
2020-06-29 09:59:48.933 [info] <0.460.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2020-06-29 09:59:48.934 [info] <0.456.0> Started message store of type transient for vhost '/'
2020-06-29 09:59:48.934 [info] <0.463.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2020-06-29 09:59:48.935 [warning] <0.463.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2020-06-29 09:59:48.936 [info] <0.456.0> Started message store of type persistent for vhost '/'
2020-06-29 09:59:48.938 [warning] <0.484.0> Setting Ranch options together with socket options is deprecated. Please use the new map syntax that allows specifying socket options separately from other options.
2020-06-29 09:59:48.939 [info] <0.498.0> started TCP listener on [::]:5672
2020-06-29 09:59:48.940 [info] <0.211.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.941 [info] <0.211.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.941 [info] <0.424.0> rabbit on node 'rabbit@10.244.1.14' up
2020-06-29 09:59:48.942 [info] <0.506.0> Peer discovery: enabling node cleanup (will only log warnings). Check interval: 30 seconds.
2020-06-29 09:59:48.973 [info] <0.557.0> Management plugin: HTTP (non-TLS) listener started on port 15672
2020-06-29 09:59:48.973 [info] <0.663.0> Statistics database started.
 completed with 5 plugins.
2020-06-29 09:59:49.058 [info] <0.8.0> Server startup complete; 5 plugins started.
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * rabbitmq_management_agent
 * rabbitmq_peer_discovery_k8s
 * rabbitmq_peer_discovery_common
2020-06-29 10:00:18.948 [info] <0.506.0> k8s endpoint listing returned nodes not yet ready: 10.244.2.15
2020-06-29 10:00:18.948 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.10 is unreachable
2020-06-29 10:00:18.949 [warning] <0.506.0> Peer discovery: node rabbit@10.244.2.11 is unreachable
2020-06-29 10:00:18.949 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.12 is unreachable
2020-06-29 10:00:18.949 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.8 is unreachable
2020-06-29 10:00:21.464 [info] <0.424.0> rabbit on node 'rabbit@10.244.1.14' down
2020-06-29 10:00:21.468 [info] <0.424.0> Keeping rabbit@10.244.1.14 listeners: the node is already back
2020-06-29 10:00:21.493 [info] <0.424.0> node 'rabbit@10.244.1.14' down: connection_closed
2020-06-29 10:00:48.976 [info] <0.506.0> k8s endpoint listing returned nodes not yet ready: 10.244.3.16
2020-06-29 10:00:48.976 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.10 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.14 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.2.11 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.12 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.8 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.10 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.14 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.2.11 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.12 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.8 is unreachable

The text was updated successfully, but these errors were encountered:

przemyslavic · 2020-06-29T18:46:17Z

Performed more testing.
Changed config map cluster_formation.k8s.address_type = ip to cluster_formation.k8s.address_type = hostname but it didn't help.

[ec2-user@ec2-15-236-180-239 ~]$ kubectl exec -it -n=queue rabbitmq-cluster-0 -- rabbitmqctl cluster_status
Cluster status of node rabbit@10.244.3.2 ...
[{nodes,[{disc,['rabbit@10.244.3.2']}]},
 {running_nodes,['rabbit@10.244.3.2']},
 {cluster_name,<<"rabbit@rabbitmq-cluster-0.rabbitmq-cluster.queue.svc.cluster.local">>},
 {partitions,[]},
 {alarms,[{'rabbit@10.244.3.2',[]}]}]
[ec2-user@ec2-15-236-180-239 ~]$ kubectl exec -it -n=queue rabbitmq-cluster-1 -- rabbitmqctl cluster_status
Cluster status of node rabbit@10.244.2.3 ...
[{nodes,[{disc,['rabbit@10.244.2.3']}]},
 {running_nodes,['rabbit@10.244.2.3']},
 {cluster_name,<<"rabbit@rabbitmq-cluster-1.rabbitmq-cluster.queue.svc.cluster.local">>},
 {partitions,[]},
 {alarms,[{'rabbit@10.244.2.3',[]}]}]

Logs from the second pod:

**2020-06-29 18:31:03.723 [info] <0.211.0> All discovered existing cluster peers: rabbit@rabbitmq-cluster-0
2020-06-29 18:31:03.723 [info] <0.211.0> Peer nodes we can cluster with: rabbit@rabbitmq-cluster-0
2020-06-29 18:31:03.737 [warning] <0.211.0> Could not auto-cluster with node rabbit@rabbitmq-cluster-0: {badrpc,nodedown}
2020-06-29 18:31:03.737 [warning] <0.211.0> Could not successfully contact any node of: rabbit@rabbitmq-cluster-0 (as in Erlang distribution). Starting as a blank standalone node...
2020-06-29 18:31:03.739 [info] <0.43.0> Application mnesia exited with reason: stopped**

Could not successfully contact any node of: rabbit@rabbitmq-cluster-0 (as in Erlang distribution). Starting as a blank standalone node...

rafzei · 2020-07-09T12:01:34Z

I've got similar error during an upgrade from 0.4.4 to 0.7.

2020-07-09 11:56:27.864 [info] <0.221.0> Enabling free disk space monitoring
2020-07-09 11:56:27.864 [info] <0.221.0> Disk free limit set to 50MB
2020-07-09 11:56:27.867 [info] <0.224.0> Limiting to approx 1048476 file handles (943626 sockets)
2020-07-09 11:56:27.867 [info] <0.225.0> FHC read buffering:  OFF
2020-07-09 11:56:27.867 [info] <0.225.0> FHC write buffering: ON
2020-07-09 11:56:27.868 [info] <0.211.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.244.2.24 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2020-07-09 11:56:27.868 [info] <0.211.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2020-07-09 11:56:27.868 [info] <0.211.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2020-07-09 11:56:27.868 [info] <0.211.0> Peer discovery backend does not support locking, falling back to randomized delay
2020-07-09 11:56:27.869 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2020-07-09 11:56:35.871 [info] <0.211.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
                 {inet,[inet],nxdomain}]}
2020-07-09 11:56:35.872 [error] <0.210.0> CRASH REPORT Process <0.210.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 138
2020-07-09 11:56:35.872 [info] <0.43.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\\"kubernetes.default.svc.cluster.local\\",443}},\n                 {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,815}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau

ar3ndt · 2020-07-14T11:08:08Z

Not a real fix for this issue but by the way we decided to bump 3.7.10 version to a latest stable one 3.8.3.

jetalone85 · 2020-07-15T08:13:03Z

Additionaly, there is option to add headless service, like: https://github.com/helm/charts/blob/master/stable/rabbitmq-ha/templates/service-discovery.yaml and initContainer like: https://github.com/helm/charts/blob/master/stable/rabbitmq-ha/templates/statefulset.yaml#L68:L116
I tested this service discovery previously and work good with image 3.8.3.

ar3ndt · 2020-07-16T15:08:56Z

Headless service tested and it does not fix the issue.
I suggest adding additional rabbitmq pod restarts during upgrade procedure as a workaround.

przemyslavic · 2020-07-22T11:38:13Z

@ar3ndt There are still issues with RabbitMQ deployment (noticed on the AWS/RedHat/flannel environment).

[ec2-user@ec2-xx-xx-xx-xx ~]$ kubectl get pods -n=queue
NAME                 READY   STATUS             RESTARTS   AGE
rabbitmq-cluster-0   0/1     CrashLoopBackOff   13         50m

[ec2-user@ec2-xx-xx-xx-xx ~]$ kubectl logs -n=queue rabbitmq-cluster-0

  ##  ##
  ##  ##      RabbitMQ 3.7.10. Copyright (C) 2007-2018 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See http://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2020-07-22 11:30:56.324 [info] <0.211.0>
 Starting RabbitMQ 3.7.10 on Erlang 21.2.3
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/
2020-07-22 11:30:56.329 [info] <0.211.0>
 node           : rabbit@10.244.3.2
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : 1kc1w/O0syvbjByXT8iwmQ==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@10.244.3.2
2020-07-22 11:30:58.091 [info] <0.219.0> Memory high watermark set to 1864 MiB (1955262464 bytes) of 3729 MiB (3910524928 bytes) total
2020-07-22 11:30:58.096 [info] <0.221.0> Enabling free disk space monitoring
2020-07-22 11:30:58.096 [info] <0.221.0> Disk free limit set to 50MB
2020-07-22 11:30:58.099 [info] <0.224.0> Limiting to approx 1048476 file handles (943626 sockets)
2020-07-22 11:30:58.100 [info] <0.225.0> FHC read buffering:  OFF
2020-07-22 11:30:58.100 [info] <0.225.0> FHC write buffering: ON
2020-07-22 11:30:58.100 [info] <0.211.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.244.3.2 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2020-07-22 11:30:58.101 [info] <0.211.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2020-07-22 11:30:58.101 [info] <0.211.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2020-07-22 11:30:58.101 [info] <0.211.0> Peer discovery backend does not support locking, falling back to randomized delay
2020-07-22 11:30:58.101 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2020-07-22 11:31:06.103 [info] <0.211.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
                 {inet,[inet],nxdomain}]}
2020-07-22 11:31:06.104 [error] <0.210.0> CRASH REPORT Process <0.210.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n
  {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 138
2020-07-22 11:31:06.104 [info] <0.43.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\\"kubernetes.default.svc.cluster.local\\",443}},\n
               {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,815}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done

Maybe it's related to #1072?

ar3ndt · 2020-08-06T12:35:59Z

main reason why sometimes rabbitmq fails to start is problem with connecting to kubernetes.default.svc.cluster.local. Looks like a dns issue but have not found a root cause yet. New task will be created to have it in a backlog (meanwhile version will be bumped and workaround for upgrade procedure will be added)

…-3.7-to-3.8 bump rabbitmq version from 3.7.10 to 3.8.3 #1395

…rt-pods Workaround restart rabbitmq pods during patching #1395

fix missing variable image rabbitmq #1395

przemyslavic · 2020-08-11T07:23:14Z

The deployment of version 3.8.3 has been tested. Automatic tests have been adjusted to the new version.

* Initialized test status table * Added next sections of test status Refactored status table a bit, added next lines, added next section with descriptions. * Upgrade cluster section filled * All sections filled * Add missing tests * Move CNS proposition design doc to GH. * fixed formatting * Etcd encryption feature refactor for deployment and upgrades (#1427) * kubernetes_master: etcd encryption simplification and refactor * upgrade: refactor of upgrade-kubeadm-config.yml (proper yaml parsing) * upgrade: adding etcd encryption patching procedure * upgrade-master.yml: small coding style improvement (highlight fix) * upgrade: enabling patching of the kubeadm config * fact naming improvements Co-authored-by: to-bar <46519524+to-bar@users.noreply.github.com> * patch-kubeadm-config.yml: skipping unnecessary kubectl apply Co-authored-by: to-bar <46519524+to-bar@users.noreply.github.com> * Bumping AzureCLI to fix SP secrets with special characters. * Added Changelog entry. * Change move to copy build dir during an upgrade (#1429) * Change move to copy build dir during an upgrade * Got rid of unused backup_temp_dir * Update to logging - log piping for stderr. - custom colors for different log levels - mapping some cases of log warnings and errors from Terraform and Ansible * helm documentation #896 * Progress: - simplified piping * Fix K8s upgrade: 'kubeadm upgrade apply' hangs (#1431) * Clean up and optimize K8s upgrades * Patch only kubeadm-config ConfigMap * Downgrade CoreDNS to K8s built-in version before 'kubeadm upgrade apply' * Deploy customized CoreDNS after K8s is upgraded to the latest version * Update changelog * Wait for API resources to propagate * Rename vendor in VSCode recommendations (#1438) Vendor moved owner of mauve.terraform repository to HashiCorp (https://marketplace.visualstudio.com/items?itemName=HashiCorp.terraform) * Fix issue with Vault and Kubernetes Calico/Canal communication (#1434) * Add vault namespace and fixes related to connection issue * Add default policy for default namespace * Remove service endpoint, execute certificate part if enabled, setting protocol correctly in Vault Helm chart * Add possibility to configure manually Vault endpoint * Added changelog. * add howto links for helm doc * Update Changelog for #1438 (#1460) * Update Changelog * Update Changelog - add PR number * bump rabbitmq version from 3.7.10 to 3.8.3 #1395 * Changes in documentation after creating fix for calico and canal (#1459) * Changes after creating fix for calico and canal * Update changelog * Got rid of pipe and grep (#1472) * Assert that current version is upgradeable #1474 (#1476) * Assert that upgrade from current version is supported #1474 * Update core/src/epicli/data/common/ansible/playbooks/roles/upgrade/tasks/kubernetes.yml Co-authored-by: to-bar <46519524+to-bar@users.noreply.github.com> * Add docker_version variable support (#1477) * add docker_version variable support * Docker installation - 2 tasks merged into 1 to speed up the deployment * Remove two useless packages from docker installation Co-authored-by: Grzegorz Dajuk <grzegorz.dajuk@zipzero.com> * Kubernetes HA upgrades (#1456) * epicli/upgrade: reusing existing shared-config + cleanups * upgrade: k8s HA upgrades minimal implementation * upgrade: kubernetes cleanup and refactor * Apply suggestions from code review Co-authored-by: to-bar <46519524+to-bar@users.noreply.github.com> * upgrade: removing unneeded kubeconfig from k8s nodes (security fix) * upgrade: statefulset patching refactor * upgrade: cleanups and refactor for logs * Make deployment manifest tasks more generic * Improve detecting CNI plugin * AnsibleVarsGenerator.py: fixing regression issue introducted during upgrade refactor * Apply suggestions from code review Co-authored-by: to-bar <46519524+to-bar@users.noreply.github.com> * upgrade: statefulset patching refactor - patching all containers (fix) - patching init containers also (fix) - removing include_tasks statements (speedup) * Ensure settings for backward compatibility * Revert "Ensure settings for backward compatibility" This reverts commit 5c9cdb6. * AnsibleInventoryUpgrade.py: merging shared-config with defaults * Adding changelog entry * Revert "AnsibleVarsGenerator.py: fixing regression issue introducted during upgrade refactor" This reverts commit c38eb9d. * Revert "epicli/upgrade: reusing existing shared-config + cleanups" This reverts commit e5957c5. * AnsibleVarsGenerator.py: adding nicer way to handle shared config Co-authored-by: to-bar <46519524+to-bar@users.noreply.github.com> * Fix upgrade of flannel to v0.12.0 (#1484) * Readme and changelog update (#1493) Readme and changelog update * Fixing broken offline CentOS 7.8 installation (#1498) * repository: adding the missing centos-logos package * updating 0.7.1 changelog * repository/centos-7: restoring alphabetical order * Add modularization-approaches.md design document * Kibana config always points its elasticsearch.hosts to a "logging" VM (#1347) (#1483) * Bump elliptic from 6.5.0 to 6.5.3 in /examples/keycloak/implicit/react Bumps [elliptic](https://github.com/indutny/elliptic) from 6.5.0 to 6.5.3. - [Release notes](https://github.com/indutny/elliptic/releases) - [Commits](indutny/elliptic@v6.5.0...v6.5.3) Signed-off-by: dependabot[bot] <support@github.com> * Bump elliptic in /examples/keycloak/authorization/react Bumps [elliptic](https://github.com/indutny/elliptic) from 6.5.0 to 6.5.3. - [Release notes](https://github.com/indutny/elliptic/releases) - [Commits](indutny/elliptic@v6.5.0...v6.5.3) Signed-off-by: dependabot[bot] <support@github.com> * Always setting hostname on all nodes of the cluster (on-prem fix) (#1509) * common: always setting hostname on all nodes of the cluster (on-prem fix) * updating 0.7.1 changelog * Workarund restart rabbitmq pods during patching #1395 * add missing changelog entry * Upgrade Kubernetes to v1.18.6 (#1501) * Upgrade k8s-dashboard to v2.0.3 (#1516) * fix due to review * Dashboard unavailability, network fix for Flannel and Canal #1394 (#1519) * additional defaults for kafka config * fixes after review, remove redundant code * Named demo configuration the same as generated one * Added deletion step description * Added a note related to versions for upgrades * Fixed syntax errors * Added prerequisites section in upgrade doc * Added key encoding troubleshooting info * Test fixes for RabbitMQ 3.8.3 (#1533) * fix missing variable image rabbitmq * Add Kubernetes Dashboard to COMPONENTS.md (#1546) * Update CHANGELOG-0.7.md Minor changes to changelog before release. * CHANGELOG-0.7.md update v0.7.1 release date (#1552) * Increment version string to 0.7.1 (#1554) Co-authored-by: Mateusz Kyc <mateusz.kyc@gmail.com> Co-authored-by: Mateusz Kyc <mkyc@users.noreply.github.com> Co-authored-by: Michał Opala <sk4zuzu@gmail.com> Co-authored-by: to-bar <46519524+to-bar@users.noreply.github.com> Co-authored-by: Luuk van Venrooij <luukvanvenrooij84@gmail.com> Co-authored-by: Tomasz Arendt <tomasz.arendt@pl.abb.com> Co-authored-by: Marcin Pyrka <pyrka.marcin@gmail.com> Co-authored-by: erzetpe <erzetpe@gmail.com> Co-authored-by: Luuk van Venrooij <11056665+seriva@users.noreply.github.com> Co-authored-by: ar3ndt <tomasz.arendt@gmail.com> Co-authored-by: Grzegorz Dajuk <grzegorz@dajuk.net> Co-authored-by: Grzegorz Dajuk <grzegorz.dajuk@zipzero.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: TolikT <tolikt@users.noreply.github.com> Co-authored-by: przemyslavic <43173646+przemyslavic@users.noreply.github.com>

przemyslavic added type/bug status/grooming-needed labels Jun 29, 2020

przemyslavic changed the title ~~[BUG] The RabbitMQ deployment is not working properly after upgrading Kubernetes to 1.17.7~~ [BUG] The RabbitMQ deployment - clustering is not working properly Jun 29, 2020

mkyc added this to the 0.7.1 milestone Jul 2, 2020

ar3ndt removed the status/grooming-needed label Jul 3, 2020

ar3ndt self-assigned this Jul 3, 2020

mkyc mentioned this issue Jul 3, 2020

[BUG] KeyCloak K8s deployments may not be clustered #732

Closed

ar3ndt pushed a commit that referenced this issue Jul 14, 2020

bump rabbitmq version from 3.7.10 to 3.8.3 #1395

16014d2

mkyc modified the milestones: 0.7.1, S20200729 Jul 17, 2020

ar3ndt pushed a commit that referenced this issue Jul 20, 2020

Workarund restart rabbitmq pods during patching #1395

5af5774

przemyslavic self-assigned this Jul 22, 2020

ar3ndt pushed a commit that referenced this issue Jul 22, 2020

Workarund restart rabbitmq pods during patching #1395

6b7c5e5

toszo modified the milestones: S20200729, S20200813 Jul 30, 2020

ar3ndt pushed a commit that referenced this issue Jul 31, 2020

Workarund restart rabbitmq pods during patching #1395

e1219af

ar3ndt closed this as completed Aug 6, 2020

ar3ndt mentioned this issue Aug 6, 2020

[BUG] The RabbitMQ deployment - problem with dns #1530

Closed

ar3ndt added a commit that referenced this issue Aug 6, 2020

Merge pull request #1464 from epiphany-platform/feature/bump-rabbitmq…

78cbb2a

…-3.7-to-3.8 bump rabbitmq version from 3.7.10 to 3.8.3 #1395

przemyslavic reopened this Aug 7, 2020

przemyslavic closed this as completed Aug 7, 2020

przemyslavic reopened this Aug 7, 2020

ar3ndt added a commit that referenced this issue Aug 10, 2020

Merge pull request #1492 from epiphany-platform/hotfix/rabbitmq-resta…

19e43a5

…rt-pods Workaround restart rabbitmq pods during patching #1395

ar3ndt added a commit that referenced this issue Aug 10, 2020

Merge pull request #1540 from ar3ndt/fix_rabbitmq_restart_pods

f4e3982

fix missing variable image rabbitmq #1395

mkyc closed this as completed Aug 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] The RabbitMQ deployment - clustering is not working properly #1395

[BUG] The RabbitMQ deployment - clustering is not working properly #1395

przemyslavic commented Jun 29, 2020 •

edited

Loading

przemyslavic commented Jun 29, 2020 •

edited

Loading

rafzei commented Jul 9, 2020

ar3ndt commented Jul 14, 2020

jetalone85 commented Jul 15, 2020

ar3ndt commented Jul 16, 2020

przemyslavic commented Jul 22, 2020

ar3ndt commented Aug 6, 2020

przemyslavic commented Aug 11, 2020

[BUG] The RabbitMQ deployment - clustering is not working properly #1395

[BUG] The RabbitMQ deployment - clustering is not working properly #1395

Comments

przemyslavic commented Jun 29, 2020 • edited Loading

przemyslavic commented Jun 29, 2020 • edited Loading

rafzei commented Jul 9, 2020

ar3ndt commented Jul 14, 2020

jetalone85 commented Jul 15, 2020

ar3ndt commented Jul 16, 2020

przemyslavic commented Jul 22, 2020

ar3ndt commented Aug 6, 2020

przemyslavic commented Aug 11, 2020

przemyslavic commented Jun 29, 2020 •

edited

Loading

przemyslavic commented Jun 29, 2020 •

edited

Loading