Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd container failing to cummincate for some port in different (docker swarm node)subnet,but works in same subnet #10494

Closed
samar51 opened this issue Feb 21, 2019 · 5 comments

Comments

@samar51
Copy link

samar51 commented Feb 21, 2019

Please read https://github.com/etcd-io/etcd/blob/master/Documentation/reporting_bugs.md.
While deploying docker etcd container on nodes on the different subnet in different datacentre.getting the below-mentioned error.

2019-02-21 16:52:42.506314 I | raft: b8b747c74aaea686 is starting a new election at term 928
2019-02-21 16:52:42.506344 I | raft: b8b747c74aaea686 became candidate at term 929
2019-02-21 16:52:42.506353 I | raft: b8b747c74aaea686 received MsgVoteResp from b8b747c74aaea686 at term 929
2019-02-21 16:52:42.506361 I | raft: b8b747c74aaea686 [logterm: 1, index: 3] sent MsgVote request to b3504381e8ba3cb at term 929
2019-02-21 16:52:42.506367 I | raft: b8b747c74aaea686 [logterm: 1, index: 3] sent MsgVote request to f572fdfc5cb68406 at term 929
2019-02-21 16:52:43.158372 W | rafthttp: health check for peer b3504381e8ba3cb could not connect: dial tcp 10.0.2.81:2380: i/o timeout
2019-02-21 16:52:43.159658 W | rafthttp: health check for peer f572fdfc5cb68406 could not connect: dial tcp 10.0.2.83:2380: i/o timeout

docker version:
[user-docker@f1cloud2201 ~]$ docker version
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:10:14 2017
OS/Arch: linux/amd64

Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:12:46 2017
OS/Arch: linux/amd64
Experimental: true

Followed all the pre-requieste of docker swarm port constraints.
telnet and nc also working

etcdc compose file:

version: '3'
services:
etcd01:
image: quay.io/coreos/etcd
ports:

  • 2379
  • 4001
  • 2380
    networks:
  • dbs1
    volumes:
  • etcd01:/etcd_data
    deploy:
    placement:
    constraints:
  • node.labels.type20 == etcd01
    replicas: 1
command:
  - /usr/local/bin/etcd
  - -name
  - etcd01
  - --data-dir
  - /etcd_data
  - -advertise-client-urls
  - http://etcd01:2379,http://etcd01:4001,http://127.0.0.1:2379
  - -listen-client-urls
  - http://0.0.0.0:2379,http://0.0.0.0:4001
  - -initial-advertise-peer-urls
  - http://etcd01:2380
  - -listen-peer-urls
  - http://0.0.0.0:2380
  - -initial-cluster
  - etcd01=http://etcd01:2380,etcd02=http://etcd02:2380,etcd03=http://etcd03:2380

etcd02:
image: quay.io/coreos/etcd
ports:

command:
  - /usr/local/bin/etcd
  - -name
  - etcd03
  - --data-dir
  - /etcd_data
  - -advertise-client-urls
  - http://etcd03:2379,http://etcd03:4001,http://127.0.0.1:2379
  - -listen-client-urls
  - http://0.0.0.0:2379,http://0.0.0.0:4001
  - -initial-advertise-peer-urls
  - http://etcd03:2380
  - -listen-peer-urls
  - http://0.0.0.0:2380
  - -initial-cluster
  - etcd01=http://etcd01:2380,etcd02=http://etcd02:2380,etcd03=http://etcd03:2380

volumes:
etcd01:
etcd02:
etcd03:

networks:
dbs1:
external: true

@hexfusion
Copy link
Contributor

While deploying docker etcd container on nodes on the different subnet in different datacentre.

2019-02-21 16:52:43.159658 W | rafthttp: health check for peer f572fdfc5cb68406 could not connect: dial tcp 10.0.2.83:2380: i/o timeout

I assume you have verified this is routable? I mean strip all of this back and you have a networking problem at best latency issues. In which case the message is expected? exec into one of these containers and see if you can connect to the other. If yes then you might need to look at tuning etcd to deal with the latencies [1][2]. Checkout your etcd_network_peer_round_trip.. latencies and election metrics. My guess is you are seeing heavy leader elections which destabilize the cluster. But if your going to run etcd cross data centers like this you need to understand how to tune it as the defaults aren't going to cover this use case in general. Focus on --heartbeat-interval and --election-timeout

[1] https://github.com/etcd-io/etcd/blob/master/Documentation/tuning.md#time-parameters
[2] https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#does-etcd-work-in-cross-region-or-cross-data-center-deployments

@samar51
Copy link
Author

samar51 commented Feb 22, 2019

Hi Sam,

Thanks for responding,
I checked the document and made all the necessary changes, Still no luck. issue persists


changes made and tried thrice with below mentioned values.:
--heartbeat-interval=100 --election-timeout=500(default)
--heartbeat-interval=80 --election-timeout=500(average heartbeat interval)
--heartbeat-interval=20 --election-timeout=500(tried lower heart beat interval because latency is very low)
/ # ping -c 5 10.0.2.103
PING 10.0.2.103 (10.0.2.103): 56 data bytes
64 bytes from 10.0.2.103: seq=0 ttl=64 time=0.052 ms
64 bytes from 10.0.2.103: seq=1 ttl=64 time=0.056 ms
64 bytes from 10.0.2.103: seq=2 ttl=64 time=0.062 ms
64 bytes from 10.0.2.103: seq=3 ttl=64 time=0.052 ms
64 bytes from 10.0.2.103: seq=4 ttl=64 time=0.060 ms

--- 10.0.2.103 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.052/0.056/0.062 ms
/ # ping -c 5 10.0.2.99
PING 10.0.2.99 (10.0.2.99): 56 data bytes
64 bytes from 10.0.2.99: seq=0 ttl=64 time=0.051 ms
64 bytes from 10.0.2.99: seq=1 ttl=64 time=0.048 ms
64 bytes from 10.0.2.99: seq=2 ttl=64 time=0.048 ms
64 bytes from 10.0.2.99: seq=3 ttl=64 time=0.051 ms
64 bytes from 10.0.2.99: seq=4 ttl=64 time=0.047 ms

--- 10.0.2.99 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max = 0.047/0.049/0.051 ms


etcdc health status from diffrent etcdc container:

etcdctl cluster-health

cluster may be unhealthy: failed to list members
Error: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:4001: getsockopt: connection refused
; error #1: client: endpoint http://127.0.0.1:2379 exceeded header timeout
error #0: dial tcp 127.0.0.1:4001: getsockopt: connection refused
error #1: client: endpoint http://127.0.0.1:2379 exceeded header timeout

/ # etcdctl cluster-health
member b3504381e8ba3cb is healthy: got healthy result from http://etcd02:2379
member b8b747c74aaea686 is unreachable: no available published client urls
member f572fdfc5cb68406 is healthy: got healthy result from http://etcd03:2379
cluster is degraded


error logs:

[user-docker@f1cloud2201 ~]$ docker logs 638ad344a6c7
2019-02-22 07:03:45.582035 I | etcdmain: etcd Version: 3.3.8
2019-02-22 07:03:45.582099 I | etcdmain: Git SHA: 33245c6
2019-02-22 07:03:45.582103 I | etcdmain: Go Version: go1.9.7
2019-02-22 07:03:45.582107 I | etcdmain: Go OS/Arch: linux/amd64
2019-02-22 07:03:45.582111 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2019-02-22 07:03:45.582255 I | embed: listening for peers on http://0.0.0.0:2380
2019-02-22 07:03:45.582300 I | embed: listening for client requests on 0.0.0.0:2379
2019-02-22 07:03:45.593008 W | pkg/netutil: failed resolving host etcd01:2380 (lookup etcd01 on 127.0.0.11:53: no such host); retrying in 1s
2019-02-22 07:03:46.593459 I | pkg/netutil: resolving etcd01:2380 to 10.0.2.101:2380
2019-02-22 07:03:46.593718 I | pkg/netutil: resolving etcd01:2380 to 10.0.2.101:2380
2019-02-22 07:03:47.599106 I | etcdserver: name = etcd01
2019-02-22 07:03:47.599127 I | etcdserver: data dir = /etcd_data
2019-02-22 07:03:47.599132 I | etcdserver: member dir = /etcd_data/member
2019-02-22 07:03:47.599135 I | etcdserver: heartbeat = 100ms
2019-02-22 07:03:47.599138 I | etcdserver: election = 5000ms
2019-02-22 07:03:47.599140 I | etcdserver: snapshot count = 100000
2019-02-22 07:03:47.599161 I | etcdserver: advertise client URLs = http://etcd01:2379
2019-02-22 07:03:47.599171 I | etcdserver: initial advertise peer URLs = http://etcd01:2380
2019-02-22 07:03:47.599187 I | etcdserver: initial cluster = etcd01=http://etcd01:2380,etcd02=http://etcd02:2380,etcd03=http://etcd03:2380
2019-02-22 07:03:47.603068 I | etcdserver: starting member b8b747c74aaea686 in cluster a86cb9d32082dbec
2019-02-22 07:03:47.603097 I | raft: b8b747c74aaea686 became follower at term 0
2019-02-22 07:03:47.603106 I | raft: newRaft b8b747c74aaea686 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2019-02-22 07:03:47.603110 I | raft: b8b747c74aaea686 became follower at term 1
2019-02-22 07:03:47.608531 W | auth: simple token is not cryptographically signed
2019-02-22 07:03:47.610933 I | rafthttp: starting peer b3504381e8ba3cb...
2019-02-22 07:03:47.610975 I | rafthttp: started HTTP pipelining with peer b3504381e8ba3cb
2019-02-22 07:03:47.611552 I | rafthttp: started streaming with peer b3504381e8ba3cb (writer)
2019-02-22 07:03:47.611832 I | rafthttp: started streaming with peer b3504381e8ba3cb (writer)
2019-02-22 07:03:47.612426 I | rafthttp: started peer b3504381e8ba3cb
2019-02-22 07:03:47.612448 I | rafthttp: added peer b3504381e8ba3cb
2019-02-22 07:03:47.612460 I | rafthttp: starting peer f572fdfc5cb68406...
2019-02-22 07:03:47.612473 I | rafthttp: started HTTP pipelining with peer f572fdfc5cb68406
2019-02-22 07:03:47.612494 I | rafthttp: started streaming with peer b3504381e8ba3cb (stream MsgApp v2 reader)
2019-02-22 07:03:47.612568 I | rafthttp: started streaming with peer b3504381e8ba3cb (stream Message reader)
2019-02-22 07:03:47.613408 I | rafthttp: started streaming with peer f572fdfc5cb68406 (writer)
2019-02-22 07:03:47.613708 I | rafthttp: started streaming with peer f572fdfc5cb68406 (writer)
2019-02-22 07:03:47.614702 I | rafthttp: started peer f572fdfc5cb68406
2019-02-22 07:03:47.614720 I | rafthttp: added peer f572fdfc5cb68406
2019-02-22 07:03:47.614730 I | rafthttp: started streaming with peer f572fdfc5cb68406 (stream MsgApp v2 reader)
2019-02-22 07:03:47.614816 I | rafthttp: started streaming with peer f572fdfc5cb68406 (stream Message reader)
2019-02-22 07:03:47.614908 I | etcdserver: starting server... [version: 3.3.8, cluster version: to_be_decided]
2019-02-22 07:03:47.615468 I | etcdserver/membership: added member b3504381e8ba3cb [http://etcd02:2380] to cluster a86cb9d32082dbec
2019-02-22 07:03:47.615593 I | etcdserver/membership: added member b8b747c74aaea686 [http://etcd01:2380] to cluster a86cb9d32082dbec
2019-02-22 07:03:47.615662 I | etcdserver/membership: added member f572fdfc5cb68406 [http://etcd03:2380] to cluster a86cb9d32082dbec
2019-02-22 07:03:52.612602 W | rafthttp: health check for peer b3504381e8ba3cb could not connect:
2019-02-22 07:03:52.614825 W | rafthttp: health check for peer f572fdfc5cb68406 could not connect:
2019-02-22 07:03:57.503343 I | raft: b8b747c74aaea686 is starting a new election at term 1
2019-02-22 07:03:57.503381 I | raft: b8b747c74aaea686 became candidate at term 2
2019-02-22 07:03:57.503401 I | raft: b8b747c74aaea686 received MsgVoteResp from b8b747c74aaea686 at term 2
2019-02-22 07:03:57.503410 I | raft: b8b747c74aaea686 [logterm: 1, index: 3] sent MsgVote request to f572fdfc5cb68406 at term 2
2019-02-22 07:03:57.503426 I | raft: b8b747c74aaea686 [logterm: 1, index: 3] sent MsgVote request to b3504381e8ba3cb at term 2
2019-02-22 07:03:57.612724 W | rafthttp: health check for peer b3504381e8ba3cb could not connect: dial tcp 10.0.2.103:2380: i/o timeout
2019-02-22 07:03:57.614903 W | rafthttp: health check for peer f572fdfc5cb68406 could not connect:
2019-02-22 07:04:02.612828 W | rafthttp: health check for peer b3504381e8ba3cb could not connect: dial tcp 10.0.2.103:2380: i/o timeout
2019-02-22 07:04:02.615003 W | rafthttp: health check for peer f572fdfc5cb68406 could not connect: dial tcp 10.0.2.99:2380: i/o timeout
2019-02-22 07:04:02.615298 E | etcdserver: publish error: etcdserver: request timed out
2019-02-22 07:04:07.303301 I | raft: b8b747c74aaea686 is starting a new election at term 2
2019-02-22 07:04:07.303330 I | raft: b8b747c74aaea686 became candidate at term 3

@samar51
Copy link
Author

samar51 commented Feb 22, 2019

adding
this issue is while bootstrapping the etcd cluster

@samar51
Copy link
Author

samar51 commented Feb 22, 2019

Hi Sam ,

Thanks for the support,
There is some other issue with docker service.etcd is fine.I logged into the container and checked the tcpdump.

@samar51 samar51 closed this as completed Feb 22, 2019
@hexfusion
Copy link
Contributor

Glad you figured it out @samar51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants