Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The RabbitMQ deployment - problem with dns #1530

Closed
ar3ndt opened this issue Aug 6, 2020 · 4 comments
Closed

[BUG] The RabbitMQ deployment - problem with dns #1530

ar3ndt opened this issue Aug 6, 2020 · 4 comments

Comments

@ar3ndt
Copy link
Contributor

ar3ndt commented Aug 6, 2020

Describe the bug
There is an issue with cluster formation.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy the cluster with the RabbitMQ application
  2. Run kubectl exec -it -n=queue rabbitmq-cluster-0 -- rabbitmqctl cluster_status and kubectl exec -it -n=queue rabbitmq-cluster-1 -- rabbitmqctl cluster_status

Expected behavior
The RabbitMQ deployment is clustered.
The command result shows 2 running nodes.

Additional context
Actual command result:

[ec2-user@ec2-15-236-60-122 ~]$ kubectl exec -it -n=queue rabbitmq-cluster-0 -- rabbitmqctl cluster_status
Cluster status of node rabbit@10.244.3.16 ...
[{nodes,[{disc,['rabbit@10.244.3.16']}]},
 {running_nodes,['rabbit@10.244.3.16']},
 {cluster_name,<<"rabbit@rabbitmq-cluster-0.rabbitmq-cluster.queue.svc.cluster.local">>},
 {partitions,[]},
 {alarms,[{'rabbit@10.244.3.16',[]}]}]
[ec2-user@ec2-15-236-60-122 ~]$ kubectl exec -it -n=queue rabbitmq-cluster-1 -- rabbitmqctl cluster_status
Cluster status of node rabbit@10.244.2.15 ...
[{nodes,[{disc,['rabbit@10.244.1.10','rabbit@10.244.1.14',
                'rabbit@10.244.2.11','rabbit@10.244.2.15',
                'rabbit@10.244.3.12','rabbit@10.244.3.8']}]},
 {running_nodes,['rabbit@10.244.2.15']},
 {cluster_name,<<"rabbit@rabbitmq-cluster-0.rabbitmq-cluster.queue.svc.cluster.local">>},
 {partitions,[]},
 {alarms,[{'rabbit@10.244.2.15',[]}]}]
[ec2-user@ec2-15-236-60-122 ~]$ kubectl get pods -n=queue -o wide
NAME                 READY   STATUS    RESTARTS   AGE     IP            NODE                                                 NOMINATED NODE   READINESS GATES
rabbitmq-cluster-0   1/1     Running   0          4h12m   10.244.3.16   ec2-15-236-95-83.eu-west-3.compute.amazonaws.com     <none>           <none>
rabbitmq-cluster-1   1/1     Running   0          4h13m   10.244.2.15   ec2-15-236-203-152.eu-west-3.compute.amazonaws.com   <none>           <none>
[ec2-user@ec2-15-236-60-122 ~]$ kubectl logs -n=queue rabbitmq-cluster-0

  ##  ##
  ##  ##      RabbitMQ 3.7.10. Copyright (C) 2007-2018 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See http://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2020-06-29 10:00:35.281 [info] <0.211.0>
 Starting RabbitMQ 3.7.10 on Erlang 21.2.3
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/
2020-06-29 10:00:35.288 [info] <0.211.0>
 node           : rabbit@10.244.3.16
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : GRtFqfufy0A8wfweQSYTgA==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@10.244.3.16
2020-06-29 10:00:36.824 [info] <0.219.0> Memory high watermark set to 1864 MiB (1955262464 bytes) of 3729 MiB (3910524928 bytes) total
2020-06-29 10:00:36.828 [info] <0.221.0> Enabling free disk space monitoring
2020-06-29 10:00:36.828 [info] <0.221.0> Disk free limit set to 50MB
2020-06-29 10:00:36.832 [info] <0.224.0> Limiting to approx 1048476 file handles (943626 sockets)
2020-06-29 10:00:36.832 [info] <0.225.0> FHC read buffering:  OFF
2020-06-29 10:00:36.832 [info] <0.225.0> FHC write buffering: ON
2020-06-29 10:00:36.833 [info] <0.211.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.244.3.16 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2020-06-29 10:00:36.833 [info] <0.211.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2020-06-29 10:00:36.833 [info] <0.211.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2020-06-29 10:00:36.833 [info] <0.211.0> Peer discovery backend does not support locking, falling back to randomized delay
2020-06-29 10:00:36.833 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
**2020-06-29 10:00:36.852 [info] <0.211.0> k8s endpoint listing returned nodes not yet ready: 10.244.2.15, 10.244.3.16**
2020-06-29 10:00:36.852 [info] <0.211.0> All discovered existing cluster peers:
2020-06-29 10:00:36.852 [info] <0.211.0> Discovered no peer nodes to cluster with
2020-06-29 10:00:36.854 [info] <0.43.0> Application mnesia exited with reason: stopped
2020-06-29 10:00:36.918 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 10:00:36.943 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 10:00:36.968 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 10:00:36.968 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping registration.
2020-06-29 10:00:36.969 [info] <0.211.0> Priority queues enabled, real BQ is rabbit_variable_queue
2020-06-29 10:00:36.971 [info] <0.396.0> Starting rabbit_node_monitor
2020-06-29 10:00:36.994 [info] <0.211.0> message_store upgrades: 1 to apply
2020-06-29 10:00:36.994 [info] <0.211.0> message_store upgrades: Applying rabbit_variable_queue:move_messages_to_vhost_store
2020-06-29 10:00:36.995 [info] <0.211.0> message_store upgrades: No durable queues found. Skipping message store migration
2020-06-29 10:00:36.995 [info] <0.211.0> message_store upgrades: Removing the old message store data
2020-06-29 10:00:36.995 [info] <0.211.0> message_store upgrades: All upgrades applied successfully
2020-06-29 10:00:37.020 [info] <0.211.0> Management plugin: using rates mode 'basic'
2020-06-29 10:00:37.021 [info] <0.211.0> Adding vhost '/'
2020-06-29 10:00:37.032 [info] <0.436.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit@10.244.3.16/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists
2020-06-29 10:00:37.036 [info] <0.436.0> Starting message stores for vhost '/'
2020-06-29 10:00:37.036 [info] <0.440.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2020-06-29 10:00:37.037 [info] <0.436.0> Started message store of type transient for vhost '/'
2020-06-29 10:00:37.037 [info] <0.443.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2020-06-29 10:00:37.037 [warning] <0.443.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2020-06-29 10:00:37.038 [info] <0.436.0> Started message store of type persistent for vhost '/'
2020-06-29 10:00:37.039 [info] <0.211.0> Creating user 'guest'
2020-06-29 10:00:37.040 [info] <0.211.0> Setting user tags for user 'guest' to [administrator]
2020-06-29 10:00:37.042 [info] <0.211.0> Setting permissions for 'guest' in '/' to '.*', '.*', '.*'
2020-06-29 10:00:37.044 [warning] <0.467.0> Setting Ranch options together with socket options is deprecated. Please use the new map syntax that allows specifying socket options separately from other options.
2020-06-29 10:00:37.045 [info] <0.481.0> started TCP listener on [::]:5672
2020-06-29 10:00:37.048 [info] <0.211.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@10.244.3.16'
2020-06-29 10:00:37.050 [info] <0.211.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@10.244.3.16'
2020-06-29 10:00:37.051 [info] <0.495.0> Peer discovery: enabling node cleanup (will only log warnings). Check interval: 30 seconds.
2020-06-29 10:00:37.077 [info] <0.545.0> Management plugin: HTTP (non-TLS) listener started on port 15672
2020-06-29 10:00:37.077 [info] <0.651.0> Statistics database started.
2020-06-29 10:00:37.156 [info] <0.8.0> Server startup complete; 5 plugins started.
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * rabbitmq_management_agent
 * rabbitmq_peer_discovery_k8s
 * rabbitmq_peer_discovery_common
 completed with 5 plugins.
[ec2-user@ec2-15-236-60-122 ~]$ kubectl logs -n=queue rabbitmq-cluster-1

  ##  ##
  ##  ##      RabbitMQ 3.7.10. Copyright (C) 2007-2018 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See http://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2020-06-29 09:59:35.287 [info] <0.211.0>
 Starting RabbitMQ 3.7.10 on Erlang 21.2.3
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/
2020-06-29 09:59:35.292 [info] <0.211.0>
 node           : rabbit@10.244.2.15
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : GRtFqfufy0A8wfweQSYTgA==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@10.244.2.15
2020-06-29 09:59:36.555 [info] <0.219.0> Memory high watermark set to 1864 MiB (1955258368 bytes) of 3729 MiB (3910516736 bytes) total
2020-06-29 09:59:36.558 [info] <0.221.0> Enabling free disk space monitoring
2020-06-29 09:59:36.559 [info] <0.221.0> Disk free limit set to 50MB
2020-06-29 09:59:36.562 [info] <0.224.0> Limiting to approx 1048476 file handles (943626 sockets)
2020-06-29 09:59:36.562 [info] <0.225.0> FHC read buffering:  OFF
2020-06-29 09:59:36.562 [info] <0.225.0> FHC write buffering: ON
2020-06-29 09:59:36.562 [info] <0.211.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.244.2.15 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2020-06-29 09:59:36.562 [info] <0.211.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2020-06-29 09:59:36.563 [info] <0.211.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2020-06-29 09:59:36.563 [info] <0.211.0> Peer discovery backend does not support locking, falling back to randomized delay
2020-06-29 09:59:36.563 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2020-06-29 09:59:36.581 [info] <0.211.0> k8s endpoint listing returned nodes not yet ready: 10.244.2.15
2020-06-29 09:59:36.581 [info] <0.211.0> All discovered existing cluster peers: rabbit@10.244.1.14
2020-06-29 09:59:36.581 [info] <0.211.0> Peer nodes we can cluster with: rabbit@10.244.1.14
2020-06-29 09:59:36.588 [info] <0.211.0> Node 'rabbit@10.244.1.14' selected for auto-clustering
2020-06-29 09:59:48.634 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 09:59:48.833 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 09:59:48.861 [info] <0.211.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-06-29 09:59:48.867 [info] <0.211.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.872 [info] <0.211.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.873 [info] <0.211.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping registration.
2020-06-29 09:59:48.874 [info] <0.211.0> Priority queues enabled, real BQ is rabbit_variable_queue
2020-06-29 09:59:48.877 [info] <0.424.0> Starting rabbit_node_monitor
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: 1 to apply
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: Applying rabbit_variable_queue:move_messages_to_vhost_store
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: No durable queues found. Skipping message store migration
2020-06-29 09:59:48.901 [info] <0.211.0> message_store upgrades: Removing the old message store data
2020-06-29 09:59:48.902 [info] <0.211.0> message_store upgrades: All upgrades applied successfully
2020-06-29 09:59:48.927 [info] <0.211.0> Management plugin: using rates mode 'basic'
2020-06-29 09:59:48.930 [info] <0.456.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit@10.244.2.15/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists
2020-06-29 09:59:48.933 [info] <0.456.0> Starting message stores for vhost '/'
2020-06-29 09:59:48.933 [info] <0.460.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2020-06-29 09:59:48.934 [info] <0.456.0> Started message store of type transient for vhost '/'
2020-06-29 09:59:48.934 [info] <0.463.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2020-06-29 09:59:48.935 [warning] <0.463.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2020-06-29 09:59:48.936 [info] <0.456.0> Started message store of type persistent for vhost '/'
2020-06-29 09:59:48.938 [warning] <0.484.0> Setting Ranch options together with socket options is deprecated. Please use the new map syntax that allows specifying socket options separately from other options.
2020-06-29 09:59:48.939 [info] <0.498.0> started TCP listener on [::]:5672
2020-06-29 09:59:48.940 [info] <0.211.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.941 [info] <0.211.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@10.244.2.15'
2020-06-29 09:59:48.941 [info] <0.424.0> rabbit on node 'rabbit@10.244.1.14' up
2020-06-29 09:59:48.942 [info] <0.506.0> Peer discovery: enabling node cleanup (will only log warnings). Check interval: 30 seconds.
2020-06-29 09:59:48.973 [info] <0.557.0> Management plugin: HTTP (non-TLS) listener started on port 15672
2020-06-29 09:59:48.973 [info] <0.663.0> Statistics database started.
 completed with 5 plugins.
2020-06-29 09:59:49.058 [info] <0.8.0> Server startup complete; 5 plugins started.
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * rabbitmq_management_agent
 * rabbitmq_peer_discovery_k8s
 * rabbitmq_peer_discovery_common
2020-06-29 10:00:18.948 [info] <0.506.0> k8s endpoint listing returned nodes not yet ready: 10.244.2.15
2020-06-29 10:00:18.948 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.10 is unreachable
2020-06-29 10:00:18.949 [warning] <0.506.0> Peer discovery: node rabbit@10.244.2.11 is unreachable
2020-06-29 10:00:18.949 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.12 is unreachable
2020-06-29 10:00:18.949 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.8 is unreachable
2020-06-29 10:00:21.464 [info] <0.424.0> rabbit on node 'rabbit@10.244.1.14' down
2020-06-29 10:00:21.468 [info] <0.424.0> Keeping rabbit@10.244.1.14 listeners: the node is already back
2020-06-29 10:00:21.493 [info] <0.424.0> node 'rabbit@10.244.1.14' down: connection_closed
2020-06-29 10:00:48.976 [info] <0.506.0> k8s endpoint listing returned nodes not yet ready: 10.244.3.16
2020-06-29 10:00:48.976 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.10 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.14 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.2.11 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.12 is unreachable
2020-06-29 10:00:48.977 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.8 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.10 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.1.14 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.2.11 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.12 is unreachable
2020-06-29 10:01:18.947 [warning] <0.506.0> Peer discovery: node rabbit@10.244.3.8 is unreachable
@ar3ndt
Copy link
Contributor Author

ar3ndt commented Aug 6, 2020

Some analysis are available in already closed issue:
#1395

@przemyslavic
Copy link
Collaborator

I can't reproduce the problem on the current develop branch. It's quite possible that updating the version of kubernetes and network plugins may have fixed it.

@plirglo
Copy link
Contributor

plirglo commented Oct 16, 2020

Please run and verify tests if kubernetes upgrade solved this issue and/or confirm with @przemyslavic actual status

@przemyslavic
Copy link
Collaborator

It looks like the problem was noticed in version 0.7 or 0.8. Since then, we have updated both kubernetes and rabbitmq, and the problem could not be reproduced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants