Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The RabbitMQ deployment - clustering is not working properly #1395

Closed
przemyslavic opened this issue Jun 29, 2020 · 8 comments
Closed
Assignees
Labels
Milestone

Comments

@przemyslavic
Copy link
Collaborator

przemyslavic commented Jun 29, 2020

Describe the bug
There is an issue with cluster formation.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy the cluster with the RabbitMQ application
  2. Run kubectl exec -it -n=queue rabbitmq-cluster-0 -- rabbitmqctl cluster_status and kubectl exec -it -n=queue rabbitmq-cluster-1 -- rabbitmqctl cluster_status

Expected behavior
The RabbitMQ deployment is clustered.
The command result shows 2 running nodes.

Additional context
Actual command result:

[ec2-user@ec2-15-236-60-122 ~]$ kubectl exec -it -n=queue rabbitmq-cluster-0 -- rabbitmqctl cluster_status
Cluster status of node rabbit@10.244.3.16 ...
[{nodes,[{disc,['rabbit@10.244.3.16']}]},
 {running_nodes,['rabbit@10.244.3.16']},
 {cluster_name,<<"rabbit@rabbitmq-cluster-0.rabbitmq-cluster.queue.svc.cluster.local">>},
 {partitions,[]},
 {alarms,[{'rabbit@10.244.3.16',[]}]}]
[ec2-user@ec2-15-236-60-122 ~]$ kubectl exec -it -n=queue rabbitmq-cluster-1 -- rabbitmqctl cluster_status
Cluster status of node rabbit@10.244.2.15 ...
[{nodes,[{disc,['rabbit@10.244.1.10','rabbit@10.244.1.14',
                'rabbit@10.244.2.11','rabbit@10.244.2.15',
                'rabbit@10.244.3.12','rabbit@10.244.3.8']}]},
 {running_nodes,['rabbit@10.244.2.15']},
 {cluster_name,<<"rabbit@rabbitmq-cluster-0.rabbitmq-cluster.queue.svc.cluster.local">>},
 {partitions,[]},
 {alarms,[{'rabbit@10.244.2.15',[]}]}]
[ec2-user@ec2-15-236-60-122 ~]$ kubectl get pods -n=queue -o wide
NAME                 READY   STATUS    RESTARTS   AGE     IP            NODE                                                 NOMINATED NODE   READINESS GATES
rabbitmq-cluster-0   1/1     Running   0          4h12m   10.244.3.16   ec2-15-236-95-83.eu-west-3.compute.amazonaws.com     <none>           <none>
rabbitmq-cluster-1   1/1     Running   0          4h13m   10.244.2.15   ec2-15-236-203-152.eu-west-3.compute.amazonaws.com   <none>           <none>
[ec2-user@ec2-15-236-60-122 ~]$ kubectl logs -n=queue rabbitmq-cluster-0

  ##  ##
  ##  ##      RabbitMQ 3.7.10. Copyright (C) 2007-2018 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See http://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2020-06-29 10:00:35.281 [info] <0.211.0>
 Starting RabbitMQ 3.7.10 on Erlang 21.2.3
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/
2020-06-29 10:00:35.288 [info] <0.211.0>
 node           : rabbit@10.244.3.16
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : GRtFqfufy0A8wfweQSYTgA==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@10.244.3.16
2020-06-29 10:00:36.824 [info] <0.219.0> Memory high watermark set to 1864 MiB (1955262464 bytes) of 3729 MiB (3910524928 bytes) total
2020-06-29 10:00:36.828 [info] <0.221.0> Enabling free disk space monitoring
2020-06-29 10:00:36.828 [info] <0.221.0> Disk free limit set to 50MB
2020-06-29 10:00:36.832 [info] <0.224.0> Limiting to approx 1048476 file handles (943626 sockets)
2020-06-29 10:00:36.832 [info] <0.225.0> FHC read buffering:  OFF
2020-06-29 10:00:36.832 [info] <0.225.0> FHC write buffering: ON
2020-06-29 10:00:36.833 [info] <0.211.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.244.3.16 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2020-06-29 10:00:36.833 [info] <0.211.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2020-06-29 10:00:36.833 [info] <0.211.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2020-06-29 10:00:36.833 [info] <0.211.0> Peer discovery backend does not support locking, falling back to randomized delay
2020-06-29 10:00:36.833 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
**2020-06-29 10:00:36.852 [info] <0.211.0> k8s endpoint listing returned nodes not yet ready: 10.244.2.15, 10.244.3.16**
2020-06-29 10:00:36.852 [info] <0.211.0> All discovered existing cluster peers:
2020-06-29 10:00:36.852 [info] <0.211.0> Discovered no peer nodes to cluster with
2020-06-29 10:00:36.854 [info] <0.43.0> Application mnesia exited with reason: stopped
2020-06-29 10:00:36.918 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 10:00:36.943 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 10:00:36.968 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 10:00:36.968 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping registration.
2020-06-29 10:00:36.969 [info] <0.211.0> Priority queues enabled, real BQ is rabbit_variable_queue
2020-06-29 10:00:36.971 [info] <0.396.0> Starting rabbit_node_monitor
2020-06-29 10:00:36.994 [info] <0.211.0> message_store upgrades: 1 to apply
2020-06-29 10:00:36.994 [info] <0.211.0> message_store upgrades: Applying rabbit_variable_queue:move_messages_to_vhost_store
2020-06-29 10:00:36.995 [info] <0.211.0> message_store upgrades: No durable queues found. Skipping message store migration
2020-06-29 10:00:36.995 [info] <0.211.0> message_store upgrades: Removing the old message store data
2020-06-29 10:00:36.995 [info] <0.211.0> message_store upgrades: All upgrades applied successfully
2020-06-29 10:00:37.020 [info] <0.211.0> Management plugin: using rates mode 'basic'
2020-06-29 10:00:37.021 [info] <0.211.0> Adding vhost '/'
2020-06-29 10:00:37.032 [info] <0.436.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit@10.244.3.16/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists
2020-06-29 10:00:37.036 [info] <0.436.0> Starting message stores for vhost '/'
2020-06-29 10:00:37.036 [info] <0.440.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2020-06-29 10:00:37.037 [info] <0.436.0> Started message store of type transient for vhost '/'
2020-06-29 10:00:37.037 [info] <0.443.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2020-06-29 10:00:37.037 [warning] <0.443.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2020-06-29 10:00:37.038 [info] <0.436.0> Started message store of type persistent for vhost '/'
2020-06-29 10:00:37.039 [info] <0.211.0> Creating user 'guest'
2020-06-29 10:00:37.040 [info] <0.211.0> Setting user tags for user 'guest' to [administrator]
2020-06-29 10:00:37.042 [info] <0.211.0> Setting permissions for 'guest' in '/' to '.*', '.*', '.*'
2020-06-29 10:00:37.044 [warning] <0.467.0> Setting Ranch options together with socket options is deprecated. Please use the new map syntax that allows specifying socket options separately from other options.
2020-06-29 10:00:37.045 [info] <0.481.0> started TCP listener on [::]:5672
2020-06-29 10:00:37.048 [info] <0.211.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@10.244.3.16'
2020-06-29 10:00:37.050 [info] <0.211.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@10.244.3.16'
2020-06-29 10:00:37.051 [info] <0.495.0> Peer discovery: enabling node cleanup (will only log warnings). Check interval: 30 seconds.
2020-06-29 10:00:37.077 [info] <0.545.0> Management plugin: HTTP (non-TLS) listener started on port 15672
2020-06-29 10:00:37.077 [info] <0.651.0> Statistics database started.
2020-06-29 10:00:37.156 [info] <0.8.0> Server startup complete; 5 plugins started.
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * rabbitmq_management_agent
 * rabbitmq_peer_discovery_k8s
 * rabbitmq_peer_discovery_common
 completed with 5 plugins.
[ec2-user@ec2-15-236-60-122 ~]$ kubectl logs -n=queue rabbitmq-cluster-1

  ##  ##
  ##  ##      RabbitMQ 3.7.10. Copyright (C) 2007-2018 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See http://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2020-06-29 09:59:35.287 [info] <0.211.0>
 Starting RabbitMQ 3.7.10 on Erlang 21.2.3
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/
2020-06-29 09:59:35.292 [info] <0.211.0>
 node           : rabbit@10.244.2.15
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : GRtFqfufy0A8wfweQSYTgA==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@10.244.2.15
2020-06-29 09:59:36.555 [info] <0.219.0> Memory high watermark set to 1864 MiB (1955258368 bytes) of 3729 MiB (3910516736 bytes) total
2020-06-29 09:59:36.558 [info] <0.221.0> Enabling free disk space monitoring
2020-06-29 09:59:36.559 [info] <0.221.0> Disk free limit set to 50MB
2020-06-29 09:59:36.562 [info] <0.224.0> Limiting to approx 1048476 file handles (943626 sockets)
2020-06-29 09:59:36.562 [info] <0.225.0> FHC read buffering:  OFF
2020-06-29 09:59:36.562 [info] <0.225.0> FHC write buffering: ON
2020-06-29 09:59:36.562 [info] <0.211.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.244.2.15 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2020-06-29 09:59:36.562 [info] <0.211.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2020-06-29 09:59:36.563 [info] <0.211.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2020-06-29 09:59:36.563 [info] <0.211.0> Peer discovery backend does not support locking, falling back to randomized delay
2020-06-29 09:59:36.563 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2020-06-29 09:59:36.581 [info] <0.211.0> k8s endpoint listing returned nodes not yet ready: 10.244.2.15
2020-06-29 09:59:36.581 [info] <0.211.0> All discovered existing cluster peers: rabbit@10.244.1.14
2020-06-29 09:59:36.581 [info] <0.211.0> Peer nodes we can cluster with: rabbit@10.244.1.14
2020-06-29 09:59:36.588 [info] <0.211.0> Node 'rabbit@10.244.1.14' selected for auto-clustering
2020-06-29 09:59:48.634 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 09:59:48.833 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 09:59:48.861 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 09:59:48.867 [info] <0.211.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.872 [info] <0.211.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.873 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping registration.
2020-06-29 09:59:48.874 [info] <0.211.0> Priority queues enabled, real BQ is rabbit_variable_queue
2020-06-29 09:59:48.877 [info] <0.424.0> Starting rabbit_node_monitor
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: 1 to apply
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: Applying rabbit_variable_queue:move_messages_to_vhost_store
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: No durable queues found. Skipping message store migration
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: Removing the old message store data
2020-06-29 09:59:48.902 [info] <0.211.0> message_store upgrades: All upgrades applied successfully
2020-06-29 09:59:48.927 [info] <0.211.0> Management plugin: using rates mode 'basic'
2020-06-29 09:59:48.930 [info] <0.456.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit@10.244.2.15/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists
2020-06-29 09:59:48.933 [info] <0.456.0> Starting message stores for vhost '/'
2020-06-29 09:59:48.933 [info] <0.460.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2020-06-29 09:59:48.934 [info] <0.456.0> Started message store of type transient for vhost '/'
2020-06-29 09:59:48.934 [info] <0.463.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2020-06-29 09:59:48.935 [warning] <0.463.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2020-06-29 09:59:48.936 [info] <0.456.0> Started message store of type persistent for vhost '/'
2020-06-29 09:59:48.938 [warning] <0.484.0> Setting Ranch options together with socket options is deprecated. Please use the new map syntax that allows specifying socket options separately from other options.
2020-06-29 09:59:48.939 [info] <0.498.0> started TCP listener on [::]:5672
2020-06-29 09:59:48.940 [info] <0.211.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.941 [info] <0.211.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.941 [info] <0.424.0> rabbit on node 'rabbit@10.244.1.14' up
2020-06-29 09:59:48.942 [info] <0.506.0> Peer discovery: enabling node cleanup (will only log warnings). Check interval: 30 seconds.
2020-06-29 09:59:48.973 [info] <0.557.0> Management plugin: HTTP (non-TLS) listener started on port 15672
2020-06-29 09:59:48.973 [info] <0.663.0> Statistics database started.
 completed with 5 plugins.
2020-06-29 09:59:49.058 [info] <0.8.0> Server startup complete; 5 plugins started.
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * rabbitmq_management_agent
 * rabbitmq_peer_discovery_k8s
 * rabbitmq_peer_discovery_common
2020-06-29 10:00:18.948 [info] <0.506.0> k8s endpoint listing returned nodes not yet ready: 10.244.2.15
2020-06-29 10:00:18.948 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.10 is unreachable
2020-06-29 10:00:18.949 [warning] <0.506.0> Peer discovery: node rabbit@10.244.2.11 is unreachable
2020-06-29 10:00:18.949 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.12 is unreachable
2020-06-29 10:00:18.949 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.8 is unreachable
2020-06-29 10:00:21.464 [info] <0.424.0> rabbit on node 'rabbit@10.244.1.14' down
2020-06-29 10:00:21.468 [info] <0.424.0> Keeping rabbit@10.244.1.14 listeners: the node is already back
2020-06-29 10:00:21.493 [info] <0.424.0> node 'rabbit@10.244.1.14' down: connection_closed
2020-06-29 10:00:48.976 [info] <0.506.0> k8s endpoint listing returned nodes not yet ready: 10.244.3.16
2020-06-29 10:00:48.976 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.10 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.14 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.2.11 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.12 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.8 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.10 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.14 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.2.11 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.12 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.8 is unreachable
@przemyslavic przemyslavic changed the title [BUG] The RabbitMQ deployment is not working properly after upgrading Kubernetes to 1.17.7 [BUG] The RabbitMQ deployment - clustering is not working properly Jun 29, 2020
@przemyslavic
Copy link
Collaborator Author

przemyslavic commented Jun 29, 2020

Performed more testing.
Changed config map cluster_formation.k8s.address_type = ip to cluster_formation.k8s.address_type = hostname but it didn't help.

[ec2-user@ec2-15-236-180-239 ~]$ kubectl exec -it -n=queue rabbitmq-cluster-0 -- rabbitmqctl cluster_status
Cluster status of node rabbit@10.244.3.2 ...
[{nodes,[{disc,['rabbit@10.244.3.2']}]},
 {running_nodes,['rabbit@10.244.3.2']},
 {cluster_name,<<"rabbit@rabbitmq-cluster-0.rabbitmq-cluster.queue.svc.cluster.local">>},
 {partitions,[]},
 {alarms,[{'rabbit@10.244.3.2',[]}]}]
[ec2-user@ec2-15-236-180-239 ~]$ kubectl exec -it -n=queue rabbitmq-cluster-1 -- rabbitmqctl cluster_status
Cluster status of node rabbit@10.244.2.3 ...
[{nodes,[{disc,['rabbit@10.244.2.3']}]},
 {running_nodes,['rabbit@10.244.2.3']},
 {cluster_name,<<"rabbit@rabbitmq-cluster-1.rabbitmq-cluster.queue.svc.cluster.local">>},
 {partitions,[]},
 {alarms,[{'rabbit@10.244.2.3',[]}]}]

Logs from the second pod:

**2020-06-29 18:31:03.723 [info] <0.211.0> All discovered existing cluster peers: rabbit@rabbitmq-cluster-0
2020-06-29 18:31:03.723 [info] <0.211.0> Peer nodes we can cluster with: rabbit@rabbitmq-cluster-0
2020-06-29 18:31:03.737 [warning] <0.211.0> Could not auto-cluster with node rabbit@rabbitmq-cluster-0: {badrpc,nodedown}
2020-06-29 18:31:03.737 [warning] <0.211.0> Could not successfully contact any node of: rabbit@rabbitmq-cluster-0 (as in Erlang distribution). Starting as a blank standalone node...
2020-06-29 18:31:03.739 [info] <0.43.0> Application mnesia exited with reason: stopped**

Could not successfully contact any node of: rabbit@rabbitmq-cluster-0 (as in Erlang distribution). Starting as a blank standalone node...

@rafzei
Copy link
Contributor

rafzei commented Jul 9, 2020

I've got similar error during an upgrade from 0.4.4 to 0.7.

2020-07-09 11:56:27.864 [info] <0.221.0> Enabling free disk space monitoring
2020-07-09 11:56:27.864 [info] <0.221.0> Disk free limit set to 50MB
2020-07-09 11:56:27.867 [info] <0.224.0> Limiting to approx 1048476 file handles (943626 sockets)
2020-07-09 11:56:27.867 [info] <0.225.0> FHC read buffering:  OFF
2020-07-09 11:56:27.867 [info] <0.225.0> FHC write buffering: ON
2020-07-09 11:56:27.868 [info] <0.211.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.244.2.24 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2020-07-09 11:56:27.868 [info] <0.211.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2020-07-09 11:56:27.868 [info] <0.211.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2020-07-09 11:56:27.868 [info] <0.211.0> Peer discovery backend does not support locking, falling back to randomized delay
2020-07-09 11:56:27.869 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2020-07-09 11:56:35.871 [info] <0.211.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
                 {inet,[inet],nxdomain}]}
2020-07-09 11:56:35.872 [error] <0.210.0> CRASH REPORT Process <0.210.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 138
2020-07-09 11:56:35.872 [info] <0.43.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\\"kubernetes.default.svc.cluster.local\\",443}},\n                 {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,815}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau

@ar3ndt
Copy link
Contributor

ar3ndt commented Jul 14, 2020

Not a real fix for this issue but by the way we decided to bump 3.7.10 version to a latest stable one 3.8.3.

@jetalone85
Copy link
Contributor

Additionaly, there is option to add headless service, like: https://github.com/helm/charts/blob/master/stable/rabbitmq-ha/templates/service-discovery.yaml and initContainer like: https://github.com/helm/charts/blob/master/stable/rabbitmq-ha/templates/statefulset.yaml#L68:L116
I tested this service discovery previously and work good with image 3.8.3.

@ar3ndt
Copy link
Contributor

ar3ndt commented Jul 16, 2020

Headless service tested and it does not fix the issue.
I suggest adding additional rabbitmq pod restarts during upgrade procedure as a workaround.

@mkyc mkyc modified the milestones: 0.7.1, S20200729 Jul 17, 2020
@przemyslavic przemyslavic self-assigned this Jul 22, 2020
@przemyslavic
Copy link
Collaborator Author

@ar3ndt There are still issues with RabbitMQ deployment (noticed on the AWS/RedHat/flannel environment).

[ec2-user@ec2-xx-xx-xx-xx ~]$ kubectl get pods -n=queue
NAME                 READY   STATUS             RESTARTS   AGE
rabbitmq-cluster-0   0/1     CrashLoopBackOff   13         50m
[ec2-user@ec2-xx-xx-xx-xx ~]$ kubectl logs -n=queue rabbitmq-cluster-0

  ##  ##
  ##  ##      RabbitMQ 3.7.10. Copyright (C) 2007-2018 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See http://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2020-07-22 11:30:56.324 [info] <0.211.0>
 Starting RabbitMQ 3.7.10 on Erlang 21.2.3
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/
2020-07-22 11:30:56.329 [info] <0.211.0>
 node           : rabbit@10.244.3.2
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : 1kc1w/O0syvbjByXT8iwmQ==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@10.244.3.2
2020-07-22 11:30:58.091 [info] <0.219.0> Memory high watermark set to 1864 MiB (1955262464 bytes) of 3729 MiB (3910524928 bytes) total
2020-07-22 11:30:58.096 [info] <0.221.0> Enabling free disk space monitoring
2020-07-22 11:30:58.096 [info] <0.221.0> Disk free limit set to 50MB
2020-07-22 11:30:58.099 [info] <0.224.0> Limiting to approx 1048476 file handles (943626 sockets)
2020-07-22 11:30:58.100 [info] <0.225.0> FHC read buffering:  OFF
2020-07-22 11:30:58.100 [info] <0.225.0> FHC write buffering: ON
2020-07-22 11:30:58.100 [info] <0.211.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.244.3.2 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2020-07-22 11:30:58.101 [info] <0.211.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2020-07-22 11:30:58.101 [info] <0.211.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2020-07-22 11:30:58.101 [info] <0.211.0> Peer discovery backend does not support locking, falling back to randomized delay
2020-07-22 11:30:58.101 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2020-07-22 11:31:06.103 [info] <0.211.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
                 {inet,[inet],nxdomain}]}
2020-07-22 11:31:06.104 [error] <0.210.0> CRASH REPORT Process <0.210.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n
  {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 138
2020-07-22 11:31:06.104 [info] <0.43.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\\"kubernetes.default.svc.cluster.local\\",443}},\n
               {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,815}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done

Maybe it's related to #1072?

@toszo toszo modified the milestones: S20200729, S20200813 Jul 30, 2020
@ar3ndt ar3ndt closed this as completed Aug 6, 2020
@ar3ndt
Copy link
Contributor

ar3ndt commented Aug 6, 2020

main reason why sometimes rabbitmq fails to start is problem with connecting to kubernetes.default.svc.cluster.local. Looks like a dns issue but have not found a root cause yet. New task will be created to have it in a backlog (meanwhile version will be bumped and workaround for upgrade procedure will be added)

ar3ndt added a commit that referenced this issue Aug 6, 2020
…-3.7-to-3.8

bump rabbitmq version from 3.7.10 to 3.8.3 #1395
@przemyslavic przemyslavic reopened this Aug 7, 2020
@przemyslavic przemyslavic reopened this Aug 7, 2020
ar3ndt added a commit that referenced this issue Aug 10, 2020
…rt-pods

Workaround restart rabbitmq pods during patching #1395
ar3ndt added a commit that referenced this issue Aug 10, 2020
@przemyslavic
Copy link
Collaborator Author

The deployment of version 3.8.3 has been tested. Automatic tests have been adjusted to the new version.

@mkyc mkyc closed this as completed Aug 11, 2020
rafzei added a commit that referenced this issue Aug 13, 2020
* Initialized test status table

* Added next sections of test status

Refactored status table a bit, added next lines, added next section with descriptions.

* Upgrade cluster section filled

* All sections filled

* Add missing tests

* Move CNS proposition design doc to GH.

* fixed formatting

* Etcd encryption feature refactor for deployment and upgrades (#1427)

* kubernetes_master: etcd encryption simplification and refactor

* upgrade: refactor of upgrade-kubeadm-config.yml (proper yaml parsing)

* upgrade: adding etcd encryption patching procedure

* upgrade-master.yml: small coding style improvement (highlight fix)

* upgrade: enabling patching of the kubeadm config

* fact naming improvements

Co-authored-by: to-bar <46519524+to-bar@users.noreply.github.com>

* patch-kubeadm-config.yml: skipping unnecessary kubectl apply

Co-authored-by: to-bar <46519524+to-bar@users.noreply.github.com>

* Bumping AzureCLI to fix SP secrets with special characters.

* Added Changelog entry.

* Change move to copy build dir during an upgrade (#1429)

* Change move to copy build dir during an upgrade
* Got rid of unused backup_temp_dir

* Update to logging

- log piping for stderr.
- custom colors for different log levels
- mapping some cases of log warnings and errors from Terraform and Ansible

* helm documentation #896

* Progress:

- simplified piping

* Fix K8s upgrade: 'kubeadm upgrade apply' hangs (#1431)

* Clean up and optimize K8s upgrades

* Patch only kubeadm-config ConfigMap

* Downgrade CoreDNS to K8s built-in version before 'kubeadm upgrade apply'

* Deploy customized CoreDNS after K8s is upgraded to the latest version

* Update changelog

* Wait for API resources to propagate

* Rename vendor in VSCode recommendations (#1438)

Vendor moved owner of mauve.terraform repository to HashiCorp (https://marketplace.visualstudio.com/items?itemName=HashiCorp.terraform)

* Fix issue with Vault and Kubernetes Calico/Canal communication (#1434)

* Add vault namespace and fixes related to connection issue

* Add default policy for default namespace

* Remove service endpoint, execute certificate part if enabled, setting protocol correctly in Vault Helm chart

* Add possibility to configure manually Vault endpoint

* Added changelog.

* add howto links for helm doc

* Update Changelog for #1438 (#1460)

* Update Changelog

* Update Changelog - add PR number

* bump rabbitmq version from 3.7.10 to 3.8.3 #1395

* Changes in documentation after creating fix for calico and canal (#1459)

* Changes after creating fix for calico and canal

* Update changelog

* Got rid of pipe and grep (#1472)

* Assert that current version is upgradeable #1474 (#1476)

* Assert that upgrade from current version is supported #1474

* Update core/src/epicli/data/common/ansible/playbooks/roles/upgrade/tasks/kubernetes.yml

Co-authored-by: to-bar <46519524+to-bar@users.noreply.github.com>

* Add docker_version variable support (#1477)

* add docker_version variable support
* Docker installation - 2 tasks merged into 1 to speed up the deployment
* Remove two useless packages from docker installation

Co-authored-by: Grzegorz Dajuk <grzegorz.dajuk@zipzero.com>

* Kubernetes HA upgrades (#1456)

* epicli/upgrade: reusing existing shared-config + cleanups

* upgrade: k8s HA upgrades minimal implementation

* upgrade: kubernetes cleanup and refactor

* Apply suggestions from code review

Co-authored-by: to-bar <46519524+to-bar@users.noreply.github.com>

* upgrade: removing unneeded kubeconfig from k8s nodes (security fix)

* upgrade: statefulset patching refactor

* upgrade: cleanups and refactor for logs

* Make deployment manifest tasks more generic

* Improve detecting CNI plugin

* AnsibleVarsGenerator.py: fixing regression issue introducted during upgrade refactor

* Apply suggestions from code review

Co-authored-by: to-bar <46519524+to-bar@users.noreply.github.com>

* upgrade: statefulset patching refactor

- patching all containers (fix)
- patching init containers also (fix)
- removing include_tasks statements (speedup)

* Ensure settings for backward compatibility

* Revert "Ensure settings for backward compatibility"

This reverts commit 5c9cdb6.

* AnsibleInventoryUpgrade.py: merging shared-config with defaults

* Adding changelog entry

* Revert "AnsibleVarsGenerator.py: fixing regression issue introducted during upgrade refactor"

This reverts commit c38eb9d.

* Revert "epicli/upgrade: reusing existing shared-config + cleanups"

This reverts commit e5957c5.

* AnsibleVarsGenerator.py: adding nicer way to handle shared config

Co-authored-by: to-bar <46519524+to-bar@users.noreply.github.com>

* Fix upgrade of flannel to v0.12.0 (#1484)

* Readme and changelog update (#1493)

Readme and changelog update

* Fixing broken offline CentOS 7.8 installation (#1498)

* repository: adding the missing centos-logos package

* updating 0.7.1 changelog

* repository/centos-7: restoring alphabetical order

* Add modularization-approaches.md design document

* Kibana config always points its elasticsearch.hosts to a "logging" VM (#1347) (#1483)

* Bump elliptic from 6.5.0 to 6.5.3 in /examples/keycloak/implicit/react

Bumps [elliptic](https://github.com/indutny/elliptic) from 6.5.0 to 6.5.3.
- [Release notes](https://github.com/indutny/elliptic/releases)
- [Commits](indutny/elliptic@v6.5.0...v6.5.3)

Signed-off-by: dependabot[bot] <support@github.com>

* Bump elliptic in /examples/keycloak/authorization/react

Bumps [elliptic](https://github.com/indutny/elliptic) from 6.5.0 to 6.5.3.
- [Release notes](https://github.com/indutny/elliptic/releases)
- [Commits](indutny/elliptic@v6.5.0...v6.5.3)

Signed-off-by: dependabot[bot] <support@github.com>

* Always setting hostname on all nodes of the cluster (on-prem fix) (#1509)

* common: always setting hostname on all nodes of the cluster (on-prem fix)

* updating 0.7.1 changelog

* Workarund restart rabbitmq pods during patching #1395

* add missing changelog entry

* Upgrade Kubernetes to v1.18.6 (#1501)

* Upgrade k8s-dashboard to v2.0.3 (#1516)

* fix due to review

* Dashboard unavailability, network fix for Flannel and Canal #1394 (#1519)

* additional defaults for kafka config

* fixes after review, remove redundant code

* Named demo configuration the same as generated one

* Added deletion step description

* Added a note related to versions for upgrades

* Fixed syntax errors

* Added prerequisites section in upgrade doc

* Added key encoding troubleshooting info

* Test fixes for RabbitMQ 3.8.3 (#1533)

* fix missing variable image rabbitmq

* Add Kubernetes Dashboard to COMPONENTS.md (#1546)

* Update CHANGELOG-0.7.md

Minor changes to changelog before release.

* CHANGELOG-0.7.md update v0.7.1 release date (#1552)

* Increment version string to 0.7.1 (#1554)

Co-authored-by: Mateusz Kyc <mateusz.kyc@gmail.com>
Co-authored-by: Mateusz Kyc <mkyc@users.noreply.github.com>
Co-authored-by: Michał Opala <sk4zuzu@gmail.com>
Co-authored-by: to-bar <46519524+to-bar@users.noreply.github.com>
Co-authored-by: Luuk van Venrooij <luukvanvenrooij84@gmail.com>
Co-authored-by: Tomasz Arendt <tomasz.arendt@pl.abb.com>
Co-authored-by: Marcin Pyrka <pyrka.marcin@gmail.com>
Co-authored-by: erzetpe <erzetpe@gmail.com>
Co-authored-by: Luuk van Venrooij <11056665+seriva@users.noreply.github.com>
Co-authored-by: ar3ndt <tomasz.arendt@gmail.com>
Co-authored-by: Grzegorz Dajuk <grzegorz@dajuk.net>
Co-authored-by: Grzegorz Dajuk <grzegorz.dajuk@zipzero.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: TolikT <tolikt@users.noreply.github.com>
Co-authored-by: przemyslavic <43173646+przemyslavic@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants