Pod Connectivity is broken randomly #721

spikewang · 2019-11-14T19:20:58Z

POD connectivity is broken with EKS in the region: us-west-1 (Oregon)

Connectivity between pods is broken for 1 ETCD pod. To isolate further removed the etcd service and am trying to ping the etcd pods directly from the source pods.

Source pods:
orchestrator-us-west-8-5db22211e2e90e0db2d1f856-orchestratcsl76 1/1 Running 0 16h 172.16.0.137 ip-172-16-0-216

Destination pods:
etcd-cluster-5db355dbee30e565b6e1459d-69hdw2gqxr 1/1 Running 0 172.16.0.85 ip-172-16-0-111.us-west-2.compute.internal
etcd-cluster-5db355dbee30e565b6e1459d-fpr4h7g547 1/1 Running 0 172.16.0.71 ip-172-16-0-56.us-west-2.compute.internal
etcd-cluster-5db355dbee30e565b6e1459d-pft5tsbd4k 1/1 Running 0 172.16.0.176 ip-172-16-0-216.us-west-2.compute.internal

Ping from source pods:

ping 172.16.0.85 (works)
ping 172.16.0.71 (works)
ping 172.16.0.176 (fails)

Packet capture on the node is showing time exceeded error:

ip-172-16-0-216.us-west-2.compute.internal Ready 21d v1.12.7 172.16.0.216

sh-4.2# tcpdump -ni eni1daa9b475a7 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eni1daa9b475a7, link-type EN10MB (Ethernet), capture size 262144 bytes

--> WORKING case:

17:54:40.882449 IP 172.16.0.137 > 172.16.0.85: ICMP echo request, id 56411, seq 0, length 64
17:54:40.887135 IP 172.16.0.85 > 172.16.0.137: ICMP echo reply, id 56411, seq 0, length 64
17:54:41.887705 IP 172.16.0.137 > 172.16.0.85: ICMP echo request, id 56411, seq 1, length 64
17:54:41.888421 IP 172.16.0.85 > 172.16.0.137: ICMP echo reply, id 56411, seq 1, length 64
17:54:45.300603 IP 172.16.0.137 > 172.16.0.71: ICMP echo request, id 56667, seq 0, length 64
17:54:45.301375 IP 172.16.0.71 > 172.16.0.137: ICMP echo reply, id 56667, seq 0, length 64
17:54:46.301119 IP 172.16.0.137 > 172.16.0.71: ICMP echo request, id 56667, seq 1, length 64
17:54:46.301925 IP 172.16.0.71 > 172.16.0.137: ICMP echo reply, id 56667, seq 1, length 64

--> FAILED case:

17:54:50.225198 IP 172.16.0.137 > 172.16.0.176: ICMP echo request, id 56923, seq 0, length 64
17:54:50.232979 IP 172.16.0.216 > 172.16.0.137: ICMP time exceeded in-transit, length 92
17:54:51.225334 IP 172.16.0.137 > 172.16.0.176: ICMP echo request, id 56923, seq 1, length 64
17:54:51.237460 IP 172.16.0.216 > 172.16.0.137: ICMP time exceeded in-transit, length 92
17:54:52.225519 IP 172.16.0.137 > 172.16.0.176: ICMP echo request, id 56923, seq 2, length 64
17:54:52.234741 IP 172.16.0.216 > 172.16.0.137: ICMP time exceeded in-transit, length 92

Any hints here, should I dump a CNI admin tech support?

mogren · 2019-11-14T19:24:20Z

@spikewang Hi, what version of the CNI are you using? v1.5.4 had an issue with ip rule, #641.

spikewang · 2019-11-14T20:59:03Z

hi @mogren, thanks for the quick reply. Yes, I am aware of that issue with v.1.5.4 and we already downgraded all CNI from 1.5.4 to 1.5.3 on all our clusters last week.... However, those pods were created a while back....

mogren · 2019-11-14T22:08:19Z

@spikewang Yes, that is the ip rule issue. The missing rules for existing pods will not be re-created. If you really don't want to restart the nodes, you would have to manually add those routes back on each node. First, check which pods you have running on each node, and what IPs they have. Then ssh to the node and run ip rule. The missing rules look like:

512:	from all to <pod IP> lookup main

To add the IP for one of your pods that were created with v1.5.4, do:

sudo ip rule add to <missing IP> lookup main priority 512

spikewang · 2019-11-14T22:11:44Z

I see. Cool, appreciate for the clarification. I will try it out!

spikewang closed this as completed Nov 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod Connectivity is broken randomly #721

Pod Connectivity is broken randomly #721

spikewang commented Nov 14, 2019

mogren commented Nov 14, 2019

spikewang commented Nov 14, 2019

mogren commented Nov 14, 2019

spikewang commented Nov 14, 2019

Pod Connectivity is broken randomly #721

Pod Connectivity is broken randomly #721

Comments

spikewang commented Nov 14, 2019

mogren commented Nov 14, 2019

spikewang commented Nov 14, 2019

mogren commented Nov 14, 2019

spikewang commented Nov 14, 2019