Upgrade CNI version broke pod-to-pod communication within the same worker node #641

rimaulana · 2019-10-02T16:45:36Z

After upgrading the CNI version from v1.5.1-rc1 to v1.5.4, we are seeing issue where pod was unable to communicate with other pod on the same worker node. We have the following schema

CoreDNS pod on eth0
Kibana pod on eth0
App1 on eth1
App2 on eth2

What we are seeing is that DNS query from App1 and App2 failed with no server found when we tried it using dig command

dig @CoreDNS-ip amazonaws.com

Meanwhile, executing the same command from Kibana pod, the worker node and pod on a different worker node works as expected.

When collecting the logs using https://github.com/nithu0115/eks-logs-collector, we found out that CoreDNS IP was not found anywhere on the output of the ip rule show command. I would expect for each IP address of a pod running on the worker node it should have at least this associated rule on the ip rule

512: from all to POD_IP lookup main

However, we do not see one for the CoreDNS pod IP. Therefore, we believe that this is an issue with the CNI plugin unable to rebuild the rule after upgrade. There is an internal issue open for this if you want to get the collected logs

MartiUK · 2019-10-02T18:21:02Z

Downgrading to v1.5.3 resolved this issue on (EKS) k8s v1.14 with CoreDNS v1.3.1.
Required node reboots first.

mogren · 2019-10-03T04:38:51Z

Glad you found a work-around (rebooting the nodes), but I'll keep trying to reproduce this.

igor-pinchuk · 2019-10-03T14:43:42Z

Facing the same issue.
Downgrading to v1.5.3 with following nodes rebooting helped.

ueokande · 2019-10-03T22:20:11Z

We encountered the issue with Kubernetes 1.13 (eks.4) and amazon-vpc-cni-k8s v1.5.4. Its not only on CoreDNS, but also inter-pods communication.

It occurs immediately after cluster created. We just repaired by restarting pods (release and reassign an IP address on the pod):

$ kubectl delete pod --all
$ kubectl delete pod -nkube-system --all

dmarkey · 2019-10-04T16:10:26Z

I've been tearing my hair out all day after upgrading a cluster. Please change https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html to suggest v1.5.3 and not v1.5.4 as to not break more clusters until it's verified that this bug is fixed.

mogren · 2019-10-04T16:26:21Z

@dmarkey None of the three minor changes between v1.5.3 and v1.5.4 has anything to do with routes, so I suspect there is some other existing issue that we have not been able to reproduce yet. Does rebooting the nodes without downgrading not fix the issue?

We have seen related issues with routes when using Calico, but they are the same on v1.5.3 and v1.5.4. Still investigating this.

angelichorsey · 2019-10-04T16:40:08Z

This is a sysctl fix, no?

net.bridge.bridge-nf-call-ip6tables=1
net.bridge.bridge-nf-call-iptables=1
net.bridge.bridge-nf-call-arptables=1

If you don't have these then the docker bridge can't talk back to itself.

https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/#network-plugin-requirements

nithu0115 · 2019-10-04T16:56:54Z

@dmarkey are you seeing missing rule from routing table database ? Could you elaborate more on the issue you are running into ?

schahal · 2019-10-04T22:02:41Z

Can we please update https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-1.5/config/v1.5/aws-k8s-cni.yaml to be 1.5.3 until 1.5.4 is fully vetted? We are running into the same issue and want default to be the working version.

dmarkey · 2019-10-04T23:09:51Z

The main issue was around 10% of pods not being able to talk to other pods, like coredns, and therefore couldn't resolve and/or connect to dependent services. They could however connect to services on the internet.

I also noticed that for the problematic pods. Their IP was missing from the node ifconfig output. I assume they would need an interface added that would be visible on the host?

dmarkey · 2019-10-04T23:15:58Z

I have powered up the cluster twice from scratch with ~200 pods with 1.5.3 and it has come up flawlessly.

With 1.5.4 about 20% of pods couldn't find dependencies either by not being able resolve their address(services in the same namespace mostly), or couldn't reach the dependency at all. I must have powered up the ASG about 10 times to try to troubleshoot the situation.

mogren · 2019-10-04T23:32:17Z

@dmarkey Thanks for the update, will keep testing this. @schahal I have reverted config/v1.5/aws-k8s-cni.yaml to point to v1.5.3 for now.

mogren · 2019-10-04T23:39:08Z

@dmarkey Could you please send me log output from https://github.com/awslabs/amazon-eks-ami/tree/master/log-collector-script ? (Either mogren at amazon.com or c.m in the Kubernetes slack)

dmarkey · 2019-10-04T23:49:14Z

Do you mean with 1.5.3 or 1.5.4? I'm afraid this cluster is in active use (although not classed as "production") so I cant easily revert without causing at least some disruption. Either way I don't have access until AM Irish time Monday.

mogren · 2019-10-04T23:59:25Z

@dmarkey Logs from a node where you see the communication issue, so v1.5.4. If you could get that next week I'd be very thankful. Sorry to cause bother on a Friday evening! 🙂

mogren · 2019-10-14T21:16:30Z

I have still not been able to reproduce this issue, and I have not gotten any logs showing errors in the CNI, but I have seen a lot of errors in the CoreDNS logs. If anyone can reliably reproduce the issue, or find a missing route or iptable rule, I'd be happy to know more.

ayosec · 2019-10-15T03:35:56Z

We had a similar problem today, with 1.5.4.

Yesterday, we changed the configuration of the deployment to set AWS_VPC_K8S_CNI_LOGLEVEL=INFO, so the aws-node-* pods were restarted. We checked if it was able to assign IP address to new pods, and everything was working as expected.

Today, we updated some deployments, and then we started to see 504 Gateway Timeout errors in some requests.

After some investigation we found that the ingress controller was not able to connect to the pods when they were in the same node. The pod (with IP 10.200.254.228) was accessible from ingress controllers in other nodes.

We discarded a bug in the ingress controller because even a ping was not possible:

# nsenter -t 1558 -n ping -c 2 10.200.254.228
PING 10.200.254.228 (10.200.254.228) 56(84) bytes of data.

--- 10.200.254.228 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1001ms

(1558 is the PID of the ingress controller).

The ping worked from the host network.

After more investigation, we found an issue in the IP rules:

# ip rule show
0:	from all lookup local 
512:	from all to 10.200.211.143 lookup main 
512:	from all to 10.200.204.145 lookup main 
512:	from all to 10.200.212.149 lookup main 
512:	from all to 10.200.206.165 lookup main 
512:	from all to 10.200.236.131 lookup main 
512:	from all to 10.200.202.149 lookup main 
512:	from all to 10.200.220.69 lookup main 
512:	from all to 10.200.223.122 lookup main 
512:	from all to 10.200.212.190 lookup main 
512:	from all to 10.200.206.240 lookup main 
1024:	from all fwmark 0x80/0x80 lookup main 
1536:	from 10.200.222.108 to 10.200.0.0/16 lookup 2 
1536:	from 10.200.254.228 to 10.200.0.0/16 lookup 3 
1536:	from 10.200.221.230 to 10.200.0.0/16 lookup 3 
1536:	from 10.200.211.143 to 10.200.0.0/16 lookup 3 
1536:	from 10.200.204.145 to 10.200.0.0/16 lookup 3 
1536:	from 10.200.212.149 to 10.200.0.0/16 lookup 2 
1536:	from 10.200.206.165 to 10.200.0.0/16 lookup 3 
1536:	from 10.200.236.131 to 10.200.0.0/16 lookup 2 
1536:	from 10.200.202.149 to 10.200.0.0/16 lookup 2 
1536:	from 10.200.220.69 to 10.200.0.0/16 lookup 2 
1536:	from 10.200.223.122 to 10.200.0.0/16 lookup 2 
1536:	from 10.200.206.240 to 10.200.0.0/16 lookup 2 
32766:	from all lookup main 
32767:	from all lookup default

In the previous list, you can see that 10.200.254.228 is missing in the from all to ... rules.

We added it manually:

# ip rule add from all to 10.200.254.228 lookup main

And the issue was fixed.

We checked the logs, and the only error related to 10.200.254.228 is the following (in plugin.log):

2019-10-14T03:55:21.684Z [INFO]	Received CNI del request: ContainerID(ae40e6b983f6f3cb21753559ed9eb10eb7e7a341ce3a9afe975078d65d9002ec) Netns(/proc/23768/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=staging;K8S_POD_NAME=redacted-58948849cf-bjlfb;K8S_POD_INFRA_CONTAINER_ID=ae40e6b983f6f3cb21753559ed9eb10eb7e7a341ce3a9afe975078d65d9002ec) Path(/opt/cni/bin) argsStdinData({"cniVersion":"0.3.1","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-10-14T03:55:21.688Z [ERROR]	Failed to delete toContainer rule for 10.200.254.228/32 err no such file or directory
2019-10-14T03:55:21.688Z [INFO]	Delete Rule List By Src [{10.200.254.228 ffffffff}]
2019-10-14T03:55:21.688Z [INFO]	Remove current list [[ip rule 1536: from 10.200.254.228/32 table 3]]
2019-10-14T03:55:21.688Z [INFO]	Delete fromContainer rule for 10.200.254.228/32 in table 3

mogren · 2019-10-15T21:35:10Z

@ayosec Thanks a lot for the helpful details!

Magizhchi · 2019-10-15T22:35:15Z

We are facing the same issue, pod to pod communication is intermittently going down, restarting the pods brings it back up.

We followed the suggestion above to downgrade to 1.5.3 and restart the node which worked for us.

So maybe there is some issue with v1.5.4

ueokande · 2019-10-24T21:40:36Z

Today, we created a new EKS cluster, and amazon-k8s-cni:v1.5.3 is deployed.
Our cluster is now fine!

mprenditore · 2019-10-25T12:54:47Z

Faced the same issue. Upgraded from 1.5.3 to 1.5.4 started to create some issues, a lot of 504.
Reverting back to 1.5.3 wasn't enough, we needed to restart all the cluster nodes in order to be back on fully functionality. Probably a full restart with 1.5.4 could have worked too based on what other people said here that there are no huge changes. But even in that case, the upgrade to 1.5.3 from 1.2.1 didn't created any issue.

mogren · 2019-10-29T03:58:32Z

Please try the v1.5.5 release candidate if you need g4, m5dn, r5dn or Kubernetes 1.16 support.

aws/amazon-vpc-cni-k8s#641

daviddelucca · 2019-11-08T20:31:59Z

@MartiUK How did you downgrade amazon-k8s-cni? Could you show me the steps, please?

chadlwilson · 2019-11-09T06:21:16Z

@daviddelucca Replacing region below with whatever is appropriate for you...

kubectl set image daemonset.apps/aws-node \
  -n kube-system \
  aws-node=602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni:v1.5.3

And then it seems restarting all pods at minimum is required. Some seem to have restarted all nodes (which would restart the pods by side effect), but it's unclear if that's really required.

daviddelucca · 2019-11-13T15:10:22Z

@chadlwilson thank you very much

mogren · 2019-11-13T22:13:08Z

v1.5.5 is released with a revert of the commit that caused issues. Resolving this issue.

wadey · 2019-11-26T03:19:56Z

Unless I'm misunderstanding, it looks like v1.6.0-rc4 also has the problematic commit. Can we get a v1.6.0-rc5 with the fix there as well?

eladazary · 2019-11-26T16:37:57Z

I'm facing this issue since yesterday with CNI 1.5.5, I've tried to downgrade to 1.5.3 and 1.5.5 but with no success.
It looks like the /etc/cni/net.d/10-aws.conflist file gets created only when using CNI v1.5.1.

Errors from ipamd.log:
Starting L-IPAMD v1.5.5 ...
2019-11-26T16:31:57.105Z [INFO] Testing communication with server
2019-11-26T16:32:27.106Z [INFO] Failed to communicate with K8S Server. Please check instance security groups or http proxy setting
2019-11-26T16:32:27.106Z [ERROR] Failed to create client: error communicating with apiserver: Get https://172.20.0.1:443/version?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout

I saw that after I've upgraded to CNI 1.5.5 again the file /etc/cni/10-aws.conflist got created, maybe is something with the path kubelet is looking for the cni file?

Nodes are in Ready status but all pods are in ContainerCreating state.

Do you have any idea why does it happen?

mogren · 2019-11-26T21:10:55Z

@wadey The issue is not in v1.6.0-rc4, there we solved it in another way, see #688. This is a better solution since if we return an error when we try to delete a pod that was never created, kubelet will retry 10 times trying to delete something that doesn't exist before giving up.

mogren · 2019-11-26T21:16:31Z

@eladazary The error you are seeing is unrelated to this issue. Starting with v1.5.3, we don't make the node active until ipamd can talk to the API server. If permissions are not correct and ipamd (aws-node pods) can't talk to the API server or to the EC2 control plane, it can't attach IPs to the nodes and then pods will never get IPs and become active.

Make sure that the worker nodes are configured correctly. The logs for ipamd should tell you what the issue is, they can be found in /var/log/aws-routed-eni/ on the node.

More about worker nodes: https://docs.aws.amazon.com/eks/latest/userguide/launch-workers.html

itsLucario · 2021-01-12T20:31:46Z

Similar issue came up with 1.7.5 on upgrading from 1.6.1. Around 10% of the pods are able to communicate with each other and others are failing.

Even downgrading to 1.6.1 didn't work until we restated the nodes. Can someone brief the cause and the status of the solution for this?

jayanthvn · 2021-01-12T20:36:31Z

Hi @itsLucario

When you upgraded was it just an image update or you reapplied the config (https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.7.5/config/v1.7/aws-k8s-cni.yaml) ?

itsLucario · 2021-01-13T03:16:34Z

@jayanthvn I have applied the exact config yaml which you have shared.
Also since we are using CNI custom networking. Once the daemonset is updated we run:

kubectl set env daemonset aws-node -n kube-system AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true

Edit:
While updating CNI itself if I set the AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true in the container env then the upgrade happens seamlessly.

I think docs must be updated mentioning if custom configuration is there then update manifests respectively before upgrading.

jayanthvn · 2021-01-19T15:28:26Z

Hi @itsLucario

Yes that makes sense and thanks for checking. Even I was suspecting that is what is happening hence wanted to know how you upgraded. Can you please open an issue for documentation? I can take care of it.

Thanks.

mogren added bug needs investigation labels Oct 2, 2019

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Oct 4, 2019

Revert to v1.5.3 while root causing issues reported in aws#641

7b2f702

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Oct 4, 2019

Revert to v1.5.3 image until aws#641 is root caused

535939a

mogren mentioned this issue Oct 4, 2019

Revert to v1.5.3 image until #641 is root caused #644

Merged

mogren pushed a commit that referenced this issue Oct 4, 2019

Revert to v1.5.3 image until #641 is root caused

fb2fd20

abstrask mentioned this issue Oct 7, 2019

EKS: More specific VPC CNI reference, to avoid buggy AWS CNI dfds/infrastructure-modules#57

Merged

artem-belov mentioned this issue Oct 10, 2019

Update aws cni plugin networkservicemesh/networkservicemesh#1754

Closed

longatrover mentioned this issue Oct 15, 2019

Branch release-1.5, the aws-k8s-cni.yaml still pointing to amazon-k8s-cni v1.5.3 instead of v1.5.4 #658

Closed

mogren added the priority/P0 Highest priority. Someone needs to actively work on this. label Oct 15, 2019

mogren mentioned this issue Oct 15, 2019

Revert back to v1.5.4 #660

Merged

mogren mentioned this issue Oct 29, 2019

Cannot run metrics helper with IAM roles mapped to k8s service accounts #663

Closed

mogren removed the needs investigation label Oct 29, 2019

Timvissers mentioned this issue Oct 30, 2019

Random / Sporadic 502 gateway timeouts kubernetes/ingress-nginx#4433

Closed

This was referenced Nov 1, 2019

I'm getting timeouts outbound from inside the container, works fine from the instance #682

Closed

Routes missing for some IP:s #705

Closed

nusnewob added a commit to uktrade/terraform-module-eks-base that referenced this issue Nov 8, 2019

stick AWS CNI to 1.5.3, bug reported in 1.5.4

f97ff82

aws/amazon-vpc-cni-k8s#641

etungsten mentioned this issue Nov 12, 2019

INSTALL: Update recommended CNI plugin version bottlerocket-os/bottlerocket#507

Merged

mogren closed this as completed Nov 13, 2019

mogren mentioned this issue Nov 14, 2019

Pod Connectivity is broken randomly #721

Closed

mogren mentioned this issue Dec 3, 2019

Not returning error when unassign called on pod without IP #740

Closed

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Dec 15, 2019

Revert back to v1.5.3 to mitigate aws#641

1b6e793

mogren mentioned this issue Dec 18, 2019

podIP duplicate with hostIP #709

Closed

jaypipes pushed a commit that referenced this issue Dec 19, 2019

Revert back to v1.5.3 to mitigate #641

5ded925

mogren mentioned this issue Jan 22, 2020

Issue #282 regression in master -- Pods stuck in ContainerCreating if created/delete while the aws-node on the same instance is (re)deploying #601

Closed

itsLucario mentioned this issue Mar 13, 2021

Updated important note point with CNI env set awsdocs/amazon-eks-user-guide#334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade CNI version broke pod-to-pod communication within the same worker node #641

Upgrade CNI version broke pod-to-pod communication within the same worker node #641

rimaulana commented Oct 2, 2019 •

edited

Loading

MartiUK commented Oct 2, 2019 •

edited

Loading

mogren commented Oct 3, 2019

igor-pinchuk commented Oct 3, 2019 •

edited

Loading

ueokande commented Oct 3, 2019

dmarkey commented Oct 4, 2019

mogren commented Oct 4, 2019

angelichorsey commented Oct 4, 2019 •

edited

Loading

nithu0115 commented Oct 4, 2019

schahal commented Oct 4, 2019

dmarkey commented Oct 4, 2019

dmarkey commented Oct 4, 2019

mogren commented Oct 4, 2019

mogren commented Oct 4, 2019

dmarkey commented Oct 4, 2019 •

edited

Loading

mogren commented Oct 4, 2019

mogren commented Oct 14, 2019

ayosec commented Oct 15, 2019

mogren commented Oct 15, 2019

Magizhchi commented Oct 15, 2019

ueokande commented Oct 24, 2019

mprenditore commented Oct 25, 2019

mogren commented Oct 29, 2019

daviddelucca commented Nov 8, 2019

chadlwilson commented Nov 9, 2019

daviddelucca commented Nov 13, 2019

mogren commented Nov 13, 2019

wadey commented Nov 26, 2019

eladazary commented Nov 26, 2019

mogren commented Nov 26, 2019

mogren commented Nov 26, 2019

itsLucario commented Jan 12, 2021

jayanthvn commented Jan 12, 2021

itsLucario commented Jan 13, 2021 •

edited

Loading

jayanthvn commented Jan 19, 2021

Upgrade CNI version broke pod-to-pod communication within the same worker node #641

Upgrade CNI version broke pod-to-pod communication within the same worker node #641

Comments

rimaulana commented Oct 2, 2019 • edited Loading

MartiUK commented Oct 2, 2019 • edited Loading

mogren commented Oct 3, 2019

igor-pinchuk commented Oct 3, 2019 • edited Loading

ueokande commented Oct 3, 2019

dmarkey commented Oct 4, 2019

mogren commented Oct 4, 2019

angelichorsey commented Oct 4, 2019 • edited Loading

nithu0115 commented Oct 4, 2019

schahal commented Oct 4, 2019

dmarkey commented Oct 4, 2019

dmarkey commented Oct 4, 2019

mogren commented Oct 4, 2019

mogren commented Oct 4, 2019

dmarkey commented Oct 4, 2019 • edited Loading

mogren commented Oct 4, 2019

mogren commented Oct 14, 2019

ayosec commented Oct 15, 2019

mogren commented Oct 15, 2019

Magizhchi commented Oct 15, 2019

ueokande commented Oct 24, 2019

mprenditore commented Oct 25, 2019

mogren commented Oct 29, 2019

daviddelucca commented Nov 8, 2019

chadlwilson commented Nov 9, 2019

daviddelucca commented Nov 13, 2019

mogren commented Nov 13, 2019

wadey commented Nov 26, 2019

eladazary commented Nov 26, 2019

mogren commented Nov 26, 2019

mogren commented Nov 26, 2019

itsLucario commented Jan 12, 2021

jayanthvn commented Jan 12, 2021

itsLucario commented Jan 13, 2021 • edited Loading

jayanthvn commented Jan 19, 2021

rimaulana commented Oct 2, 2019 •

edited

Loading

MartiUK commented Oct 2, 2019 •

edited

Loading

igor-pinchuk commented Oct 3, 2019 •

edited

Loading

angelichorsey commented Oct 4, 2019 •

edited

Loading

dmarkey commented Oct 4, 2019 •

edited

Loading

itsLucario commented Jan 13, 2021 •

edited

Loading