-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade CNI version broke pod-to-pod communication within the same worker node #641
Comments
Downgrading to v1.5.3 resolved this issue on (EKS) k8s v1.14 with CoreDNS v1.3.1. |
Glad you found a work-around (rebooting the nodes), but I'll keep trying to reproduce this. |
Facing the same issue. |
We encountered the issue with Kubernetes 1.13 (eks.4) and amazon-vpc-cni-k8s v1.5.4. Its not only on CoreDNS, but also inter-pods communication. It occurs immediately after cluster created. We just repaired by restarting pods (release and reassign an IP address on the pod): $ kubectl delete pod --all
$ kubectl delete pod -nkube-system --all |
I've been tearing my hair out all day after upgrading a cluster. Please change https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html to suggest v1.5.3 and not v1.5.4 as to not break more clusters until it's verified that this bug is fixed. |
@dmarkey None of the three minor changes between v1.5.3 and v1.5.4 has anything to do with routes, so I suspect there is some other existing issue that we have not been able to reproduce yet. Does rebooting the nodes without downgrading not fix the issue? We have seen related issues with routes when using Calico, but they are the same on v1.5.3 and v1.5.4. Still investigating this. |
This is a sysctl fix, no?
If you don't have these then the docker bridge can't talk back to itself. |
@dmarkey are you seeing missing rule from routing table database ? Could you elaborate more on the issue you are running into ? |
Can we please update https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-1.5/config/v1.5/aws-k8s-cni.yaml to be |
The main issue was around 10% of pods not being able to talk to other pods, like coredns, and therefore couldn't resolve and/or connect to dependent services. They could however connect to services on the internet. I also noticed that for the problematic pods. Their IP was missing from the node |
I have powered up the cluster twice from scratch with ~200 pods with With |
@dmarkey Thanks for the update, will keep testing this. @schahal I have reverted config/v1.5/aws-k8s-cni.yaml to point to v1.5.3 for now. |
@dmarkey Could you please send me log output from https://github.com/awslabs/amazon-eks-ami/tree/master/log-collector-script ? (Either mogren at amazon.com or c.m in the Kubernetes slack) |
Do you mean with 1.5.3 or 1.5.4? I'm afraid this cluster is in active use (although not classed as "production") so I cant easily revert without causing at least some disruption. Either way I don't have access until AM Irish time Monday. |
@dmarkey Logs from a node where you see the communication issue, so v1.5.4. If you could get that next week I'd be very thankful. Sorry to cause bother on a Friday evening! 🙂 |
I have still not been able to reproduce this issue, and I have not gotten any logs showing errors in the CNI, but I have seen a lot of errors in the CoreDNS logs. If anyone can reliably reproduce the issue, or find a missing route or iptable rule, I'd be happy to know more. |
We had a similar problem today, with 1.5.4. Yesterday, we changed the configuration of the deployment to set Today, we updated some deployments, and then we started to see After some investigation we found that the ingress controller was not able to connect to the pods when they were in the same node. The pod (with IP We discarded a bug in the ingress controller because even a ping was not possible:
( The ping worked from the host network. After more investigation, we found an issue in the IP rules:
In the previous list, you can see that We added it manually:
And the issue was fixed. We checked the logs, and the only error related to
|
@ayosec Thanks a lot for the helpful details! |
We are facing the same issue, pod to pod communication is intermittently going down, restarting the pods brings it back up. We followed the suggestion above to downgrade to 1.5.3 and restart the node which worked for us. So maybe there is some issue with v1.5.4 |
Today, we created a new EKS cluster, and amazon-k8s-cni:v1.5.3 is deployed. |
Faced the same issue. Upgraded from 1.5.3 to 1.5.4 started to create some issues, a lot of 504. |
Please try the v1.5.5 release candidate if you need g4, m5dn, r5dn or Kubernetes 1.16 support. |
@MartiUK How did you downgrade amazon-k8s-cni? Could you show me the steps, please? |
@daviddelucca Replacing region below with whatever is appropriate for you... kubectl set image daemonset.apps/aws-node \
-n kube-system \
aws-node=602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni:v1.5.3 And then it seems restarting all pods at minimum is required. Some seem to have restarted all nodes (which would restart the pods by side effect), but it's unclear if that's really required. |
@chadlwilson thank you very much |
v1.5.5 is released with a revert of the commit that caused issues. Resolving this issue. |
Unless I'm misunderstanding, it looks like |
I'm facing this issue since yesterday with CNI 1.5.5, I've tried to downgrade to 1.5.3 and 1.5.5 but with no success. Errors from ipamd.log: I saw that after I've upgraded to CNI 1.5.5 again the file /etc/cni/10-aws.conflist got created, maybe is something with the path kubelet is looking for the cni file? Nodes are in Ready status but all pods are in ContainerCreating state. Do you have any idea why does it happen? |
@eladazary The error you are seeing is unrelated to this issue. Starting with v1.5.3, we don't make the node active until ipamd can talk to the API server. If permissions are not correct and ipamd (aws-node pods) can't talk to the API server or to the EC2 control plane, it can't attach IPs to the nodes and then pods will never get IPs and become active. Make sure that the worker nodes are configured correctly. The logs for ipamd should tell you what the issue is, they can be found in More about worker nodes: https://docs.aws.amazon.com/eks/latest/userguide/launch-workers.html |
Similar issue came up with 1.7.5 on upgrading from 1.6.1. Around 10% of the pods are able to communicate with each other and others are failing. Even downgrading to 1.6.1 didn't work until we restated the nodes. Can someone brief the cause and the status of the solution for this? |
Hi @itsLucario When you upgraded was it just an image update or you reapplied the config (https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.7.5/config/v1.7/aws-k8s-cni.yaml) ? |
@jayanthvn I have applied the exact config yaml which you have shared.
Edit: I think docs must be updated mentioning if custom configuration is there then update manifests respectively before upgrading. |
Hi @itsLucario Yes that makes sense and thanks for checking. Even I was suspecting that is what is happening hence wanted to know how you upgraded. Can you please open an issue for documentation? I can take care of it. Thanks. |
After upgrading the CNI version from v1.5.1-rc1 to v1.5.4, we are seeing issue where pod was unable to communicate with other pod on the same worker node. We have the following schema
CoreDNS pod on eth0
Kibana pod on eth0
App1 on eth1
App2 on eth2
What we are seeing is that DNS query from App1 and App2 failed with no server found when we tried it using dig command
dig @CoreDNS-ip amazonaws.com
Meanwhile, executing the same command from Kibana pod, the worker node and pod on a different worker node works as expected.
When collecting the logs using https://github.com/nithu0115/eks-logs-collector, we found out that CoreDNS IP was not found anywhere on the output of the ip rule show command. I would expect for each IP address of a pod running on the worker node it should have at least this associated rule on the ip rule
512: from all to POD_IP lookup main
However, we do not see one for the CoreDNS pod IP. Therefore, we believe that this is an issue with the CNI plugin unable to rebuild the rule after upgrade. There is an internal issue open for this if you want to get the collected logs
The text was updated successfully, but these errors were encountered: