Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul connection error on port 8300 #14464

Closed
HalacliSeda opened this issue Sep 2, 2022 · 21 comments
Closed

Consul connection error on port 8300 #14464

HalacliSeda opened this issue Sep 2, 2022 · 21 comments

Comments

@HalacliSeda
Copy link

Hello,

I use Consul 1.13.1
I have two server (as an example): 10.10.10.1, 10.10.10.2
I set up consul server on both.
consul.json are same on both:
{
"bind_addr": "10.10.10.1",
"client_addr": "0.0.0.0",
"datacenter": "datacenter-01",
"bootstrap_expect": 3,
"data_dir": "/var/lib/consul",
"encrypt": "",
"disable_update_check": true,
"server": true,
"ui": true,
"rejoin_after_leave": true,
"retry_join": ["10.10.10.1","10.10.10.2","......."],
"acl": {
"enabled": true,
"default_policy": "deny",
"tokens": {
"agent": ""
}
}
}
{
"bind_addr": "10.10.10.2",
"client_addr": "0.0.0.0",
"datacenter": "datacenter-01",
"bootstrap_expect": 3,
"data_dir": "/var/lib/consul",
"encrypt": "",
"disable_update_check": true,
"server": true,
"ui": true,
"rejoin_after_leave": true,
"retry_join": ["10.10.10.1","10.10.10.2","......."],
"acl": {
"enabled": true,
"default_policy": "deny",
"tokens": {
"agent": ""
}
}
}

consul members output like that:
Node Address Status Type Build Protocol DC Partition Segment
ha1 10.10.10.1:8301 alive server 1.13.1 2 datacenter-01 default
ha2 10.10.10.2:8301 alive server 1.13.1 2 datacenter-01 default

But I got an error both server like that:
[WARN] agent: [core]grpc: addrConn.createTransport failed to connect to {ha1:8300 ha1.compute.internal 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp 10.10.10.2:0->10.10.10.1:8300: operation was canceled". Reconnecting...

Port 8300 used for consul service on both server. I check ports with telnet and there is no problem:
telnet 10.10.10.1 8300
Trying 10.10.10.1...
Connected to 10.10.10.1.
Escape character is '^]'.

I did not get an error in Consul 1.12.1. Is this a bug in Consul 1.13.1 ?

Thanks,
Seda

@Serg2294
Copy link

I have the same issue.

@Din-He
Copy link

Din-He commented Oct 18, 2022

我在k8s中使用helm部署consul集群时遇到相同的问题

@obourdon
Copy link

obourdon commented Oct 18, 2022

Same here with consul 1.11.4

Seen in the changelog:
Fixed in 1.12.1:
rpc: Adds a deadline to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions. [GH-8504] [GH=11500]
Fixed in 1.12.3:
deps: Update go-grpc/grpc, resolving connection memory leak [GH-13051]

Not sure though that these are related

@obourdon
Copy link

Just to me more complete on this, after seeing a lot of these errors, even a call to localhost:8500 just fails

@obourdon
Copy link

upgrade to 1.12.3 did not help, same errors across cluster of consul servers+clients

@obourdon
Copy link

obourdon commented Oct 19, 2022

on my side and in the contrary of @HalacliSeda 1.12.1 does also seem to have the issue even if less frequently

@obourdon
Copy link

Also tried latest 1.13.2 with same results :-(

@quinndiggitypolymath
Copy link

yeah, seeing these as well, intermittently - as @obourdon mentioned, this seems to be related to the timeouts/aborts that were recently added; my prior clusters don't experience these disconnects

all in all, the functionality of the clusters logging these messages aren't otherwise affected, so this seems to be due to overly aggressive timeouts - there was a recent refactor around rpc timeouts + the addition of limits.rpc_client_timeout (defaulting to 60s): #14965

hopefully easing the timeouts resolves these errors

@Din-He
Copy link

Din-He commented Oct 21, 2022

@quinndiggitypolymath 看您这意思,这个[WARN]信息不影响集群的正常使用对吗?

@obourdon
Copy link

obourdon commented Oct 21, 2022

@quinndiggitypolymath many thanks for these very valuable infos.

However there are cases where after quite a while it seems that even accessing port 8500 localy just fails as mentioned here

Furthermore this does not seem "recent" as the list of impacted versions seems to to prove

Could you please explain in more details what you meant by easing the timeouts resolves these errors ?
Is there some configuration we can set to avoid these errors ? Like increasing the limits.rpc_client_timeout to 120 or 180 seconds ?
What would be the (other) impact(s)/risk(s) of doing so ?

Many thanks again

@quinndiggitypolymath
Copy link

@Din-He, at least this particular message operation was canceled on its own doesn't seem to indicate a specific problem (to me); I am seeing the same message being logged, and still have functional clusters (in terms of service resolution/mesh network traffic flow/key-value/distributed locks, etc) - for me, nomad is still able to schedule services, and those services are functioning correctly, vault is operational, etc

@obourdon, that sounds like the messages may be a symptom of another issue (or multiple issues) - consul has a lot of areas where things can be broken if not configured exactly right, and the logging could be better in some spots when debugging. Without knowing what your configuration is like, I would recommend adjusting the logging level https://developer.hashicorp.com/consul/docs/agent/config/config-files#log_level to debug (or trace, if you need more verbosity; remember to return to info or warn as the log volume can be enormous) to see if that shakes out any specific errors; be sure to double check the process isn't being restarted/stopped/stalling/crashing under whatever means it is being run. Ensuring the underlying storage volume has enough throughput/iops is important, and that all network traffic can be sent/received through the network https://developer.hashicorp.com/consul/docs/install/ports If you are utilizing containers, ensuring that consul isn't listening only on 127.0.0.1 (unless you have an arrangement set up to make that work through DNAT, etc). Check that the cluster is healthy https://developer.hashicorp.com/consul/api-docs/status#get-raft-leader and recover if not https://learn.hashicorp.com/tutorials/consul/recovery-outage If you are (hopefully you are) using encryption https://developer.hashicorp.com/consul/docs/agent/config/config-files#encrypt mTLS https://developer.hashicorp.com/consul/docs/agent/config/config-files#tls_defaults_verify_outgoing (or just TLS https://developer.hashicorp.com/consul/docs/agent/config/config-files#tls_defaults_cert_file ) for everything, ensure that your certificate chains are proper and pass verification https://developer.hashicorp.com/consul/docs/agent/config/config-files#tls_defaults_verify_incoming

Furthermore this does not seem "recent" as the list of impacted versions seems to to prove

Hashicorp (I am not affiliated) supports the last 2 releases of consul https://support.hashicorp.com/hc/en-us/articles/360021185113-Support-Period-and-End-of-Life-EOL-Policy so I've been on 1.11 due to Connect CA changes that broke my arrangement (requiring federated clusters to share a common Vault cluster is bunk https://developer.hashicorp.com/consul/docs/connect/ca/vault#rootpkipath but Cluster Peering https://developer.hashicorp.com/consul/docs/connect/cluster-peering effectively replaces the per-datacenter Vault arrangement + is how I would have preferred Connect work in the first place 🥲 ). This error I have not seen on 1.11 or before, but as 1.14 beta is out, I will need to be on 1.12 or above soon; am nearly done moving fully to 1.13 (or 1.14 to utilize the Peering Service Mesh setup, once that matures a little more), so I haven't as thoroughly evaluated this particular error with a production workload on versions between

what you meant by easing the timeouts resolves these errors

Essentially, if the limit is being hit, slightly increase that limit (test/record metrics before + after); if you have particularly slow request, where hitting 60s causes it to abort sometimes but it needs only ~5s more (for whatever reason, say running on an ARM device with slow storage) a jump to 90s might handle that. Going overboard with that can be bad; 60s is the status quo without overriding

What would be the (other) impact(s)/risk(s) of doing so ?

Increased resource usage, more sockets, more memory, more load, etc; under failure modes it could have cascading effects, all those sorts of things, on top of it taking longer to know something is wrong (if the request won't actually ever succeed, failing faster would allow retries + potentially freeing up resources). As with any change, measure before and after, and refine; if it needs 65s, 90s is overkill in that scenario, so reduce and measure again

@Din-He
Copy link

Din-He commented Oct 21, 2022

@quinndiggitypolymath very thank you!哈哈哈

@Din-He
Copy link

Din-He commented Oct 21, 2022

请教各位大佬一个新的问题。
我在k8s上使用helm部署consul(为了方便,k8s只有一个master节点),在helm的value.yaml文件中配置开启了consul的acls,gossipEncryption:
autoGenerate: true
acls:
manageSystemACLs: true
部署是好的,查看k8s中pod都是正常的。他自动生成了一些token。如下图
1666334903096
然后我在springboot程序中使用global-management token将微服务注册到我部署的consul上,他报了一个错
token with AccessorID '00000000-0000-0000-0000-000000000002' lacks permission 'service:write' on "demo20221017"
其中demo20221017是我的服务名,看这句话意思是AccessorID=002的这个token没有写入的权限,但是我并没有用002的这个token呀,我用的是global-management token。不知道是什么原因,大佬们清楚这是怎么回事吗?感谢解答。

@quinndiggitypolymath
Copy link

quinndiggitypolymath commented Oct 21, 2022

@Din-He, token with AccessorID '00000000-0000-0000-0000-000000000002' is the default anonymous token ( https://developer.hashicorp.com/consul/docs/security/acl/acl-tokens#anonymous-token ) , meaning the node itself is trying to service:write on demo20221017 without a token being provided; double check that you are setting: https://developer.hashicorp.com/consul/docs/agent/config/config-files#acl_tokens

You will need a token for the node, and a policy attached to it; your policy may look along the lines of:

service "demo20221017" {
  policy = "write"
}
service "demo20221017-sidecar-proxy" {
  policy = "write"
}

but refer to the following for specifics:

@ncode
Copy link
Contributor

ncode commented Dec 21, 2022

I'm also having the same issue with 1.14.2. I'm playing with the rpc_client_timeout, but no luck so far.

@jkirschner-hashicorp
Copy link
Contributor

so I've been on 1.11 due to Connect CA changes that broke my arrangement (requiring federated clusters to share a common Vault cluster is bunk)

@quinndiggitypolymath : Can you share more about the Connect CA change that made your 1 Vault cluster : 1 Consul cluster setup stop working? I had thought that WAN federated Consul clusters could use different Vault clusters. And if they can't in your experience, that's something I'm interested in following up on. It would be preferable from a latency and resilience perspective to have a Vault cluster in the same region as the Consul cluster it acts as the Connect CA for.

@obourdon
Copy link

obourdon commented Feb 6, 2023

is this somehow related to issue #10603 ???

@obourdon
Copy link

obourdon commented Feb 6, 2023

Seems like migrating to consul 1.14.4 fixes this issue on my side

@tunguyen9889
Copy link

Seems like migrating to consul 1.14.4 fixes this issue on my side

Yes, I confirmed 1.14.4 fixed this warning message.

@obourdon
Copy link

obourdon commented Feb 7, 2023

In fact, after 1 night of operations, it is drastically reduced but still present. It went down from 100-150 occurences/hour down to 1 or 2 each and every 2/3 hour (previously installed version was 1.14.3)

@david-yu
Copy link
Contributor

david-yu commented Feb 7, 2023

Thanks @obourdon. This does seem to be a dupe of #10603 which was just closed. Please note that still occurs for agent startups which is why you likely see still see this issue and is tracked here: #15821. I'll go ahead and close this issue as there is now a separate issue tracking the agent startup WARN logs.

@david-yu david-yu closed this as completed Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants