Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod Connectivity is broken randomly #721

Closed
spikewang opened this issue Nov 14, 2019 · 4 comments
Closed

Pod Connectivity is broken randomly #721

spikewang opened this issue Nov 14, 2019 · 4 comments

Comments

@spikewang
Copy link

POD connectivity is broken with EKS in the region: us-west-1 (Oregon)

Connectivity between pods is broken for 1 ETCD pod. To isolate further removed the etcd service and am trying to ping the etcd pods directly from the source pods.

Source pods:
orchestrator-us-west-8-5db22211e2e90e0db2d1f856-orchestratcsl76   1/1     Running   0          16h     172.16.0.137   ip-172-16-0-216

Destination pods:
etcd-cluster-5db355dbee30e565b6e1459d-69hdw2gqxr                  1/1     Running   0   172.16.0.85    ip-172-16-0-111.us-west-2.compute.internal  
etcd-cluster-5db355dbee30e565b6e1459d-fpr4h7g547                  1/1     Running   0   172.16.0.71    ip-172-16-0-56.us-west-2.compute.internal    
etcd-cluster-5db355dbee30e565b6e1459d-pft5tsbd4k                  1/1     Running   0    172.16.0.176   ip-172-16-0-216.us-west-2.compute.internal

Ping from source pods:

ping 172.16.0.85 (works)
ping 172.16.0.71 (works)
ping 172.16.0.176 (fails)

Packet capture on the node  is showing time exceeded error:

ip-172-16-0-216.us-west-2.compute.internal   Ready         21d    v1.12.7   172.16.0.216

sh-4.2# tcpdump -ni eni1daa9b475a7 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eni1daa9b475a7, link-type EN10MB (Ethernet), capture size 262144 bytes

--> WORKING case:

17:54:40.882449 IP 172.16.0.137 > 172.16.0.85: ICMP echo request, id 56411, seq 0, length 64
17:54:40.887135 IP 172.16.0.85 > 172.16.0.137: ICMP echo reply, id 56411, seq 0, length 64
17:54:41.887705 IP 172.16.0.137 > 172.16.0.85: ICMP echo request, id 56411, seq 1, length 64
17:54:41.888421 IP 172.16.0.85 > 172.16.0.137: ICMP echo reply, id 56411, seq 1, length 64
17:54:45.300603 IP 172.16.0.137 > 172.16.0.71: ICMP echo request, id 56667, seq 0, length 64
17:54:45.301375 IP 172.16.0.71 > 172.16.0.137: ICMP echo reply, id 56667, seq 0, length 64
17:54:46.301119 IP 172.16.0.137 > 172.16.0.71: ICMP echo request, id 56667, seq 1, length 64
17:54:46.301925 IP 172.16.0.71 > 172.16.0.137: ICMP echo reply, id 56667, seq 1, length 64

--> FAILED case:

17:54:50.225198 IP 172.16.0.137 > 172.16.0.176: ICMP echo request, id 56923, seq 0, length 64
17:54:50.232979 IP 172.16.0.216 > 172.16.0.137: ICMP time exceeded in-transit, length 92
17:54:51.225334 IP 172.16.0.137 > 172.16.0.176: ICMP echo request, id 56923, seq 1, length 64
17:54:51.237460 IP 172.16.0.216 > 172.16.0.137: ICMP time exceeded in-transit, length 92
17:54:52.225519 IP 172.16.0.137 > 172.16.0.176: ICMP echo request, id 56923, seq 2, length 64
17:54:52.234741 IP 172.16.0.216 > 172.16.0.137: ICMP time exceeded in-transit, length 92

Any hints here, should I dump a CNI admin tech support?

@mogren
Copy link
Contributor

mogren commented Nov 14, 2019

@spikewang Hi, what version of the CNI are you using? v1.5.4 had an issue with ip rule, #641.

@spikewang
Copy link
Author

hi @mogren, thanks for the quick reply. Yes, I am aware of that issue with v.1.5.4 and we already downgraded all CNI from 1.5.4 to 1.5.3 on all our clusters last week.... However, those pods were created a while back....

@mogren
Copy link
Contributor

mogren commented Nov 14, 2019

@spikewang Yes, that is the ip rule issue. The missing rules for existing pods will not be re-created. If you really don't want to restart the nodes, you would have to manually add those routes back on each node. First, check which pods you have running on each node, and what IPs they have. Then ssh to the node and run ip rule. The missing rules look like:

512:	from all to <pod IP> lookup main 

To add the IP for one of your pods that were created with v1.5.4, do:

sudo ip rule add to <missing IP> lookup main priority 512

@spikewang
Copy link
Author

I see. Cool, appreciate for the clarification. I will try it out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants