-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some pods are getting stuck without external network #1070
Comments
Hi @michalzxc, Which version of CNI you are using? Also can you please run the aws-cni-support.sh on the node having the issue and share us the logs? WARM_ENI_TARGET=1 should be sufficient. Probably MIN_IP_TARGET and WARM_IP_TARGET is not needed. Thanks. |
I've been having the same issue with v1.6.x, tried all versions of this series and still get new pods without a working network from time to time. Currently running v1.6.3. In my case, the stuck pods can’t access CoreDNS or other pods either, container restart has no effect, the only way I found to fix is to delete the bugged pod. I wonder if this could be related to SNAT changes on 1.6? Or maybe IP cooldown? |
Hi @rochacon Can you please run aws-cni-support.sh on the impacted node and share us the logs? Please do share how many pods you had when the issue occurred. Thanks. |
https://drive.google.com/file/d/1_Mo1XnXqgRdT3VAtMFCcVVZaGVqHaM02/view?usp=sharing - result of aws-cni-support.sh |
Other one:
https://drive.google.com/file/d/1Y8OGleFJqlb-DecnEI-XT92jMXbPhrnT/view?usp=sharing |
Both cases above - deleting aws-node pod, solves these pods issue - network is appearing the moment new aws-node pod started |
Before we were using 1.5.3, we upgraded to 1.6.3 because of this issue (it happen out of nowhere, so we thought maybe aws API changes somewhere and we need new aws-node etc) |
We were using aws-node 1.5.3 over a year without this issue. Recently we reduced some resources requests, so possibly it is result of having more pods (=>more IPs) per node? |
Hi @michalzxc Thanks for sharing the logs. Seems like CNI plugin hasnt got an add request and also Kubelet logs show below errors -
|
Don't see OCI error - downgraded docker and trying with older one
Stuck pods: https://drive.google.com/file/d/1YBhpDxO7EClzpdoY7iSnWpZbVkVW6xmF/view?usp=sharing |
Downgraded to VM image what was fully stable month ago
|
I can connect to stucked pods from its node, I can't connect from other nodes. Any advice where to look, how aws-cni is setting routing/iptables/etc ? |
The stucked pod is
When I check routing table 2 and 3, third one is empty:
|
And pod is reachable from other nodes and unstuck Why aws-node is failing to set this routing, how to fix it in permanent automatic way? |
I am considering writing cron script populating routing tables on VMs, but I hope for better solution on aws-cni side :) |
This is what I am running every minute in cron:
Seems it is doing the trick |
@mogren Hey, could you take a look at this one as well? |
We are facing the same issue with CNI v1.5.5 and EKS v1.14.9-eks-f459c0. Route table 3 was empty. Updating to CNI v1.6.3 and rebooting (!) EKS node helped for now. |
From the logs (eks_i-0477eb0d8ac0edef0_2020-07-08_0940-UTC_0.6.2) attached
This will result in same issue as #1094 as you can see below
|
@michalzxc Thank you for reporting these issues. We are actively working on the fix. Sorry for the delayed response. |
The changes in #1177 fixes this issue. |
@mogren thank you for this fix! Do we have plans for the release date of it? Will it be backported to 1.6 series or upgrading to 1.7 will be required? |
@rochacon We are planning to get a v1.7.2 out before the end of next week, if all testing continues to go well. We don't plan to backport this to the v1.6.x branch, too much has changed in the code base, so an upgrade will be required. |
OK, thank you. |
Our pods are using vault-agent, so it is very simple to see when it is getting stuck without network greping for pods like:
In logs I see connections issue:
Sometimes some pods are getting stuck like that, deploying 1500 pods in average 17 get stuck.
Often half or more pods on node are fine, but other get stuck
Single node:
When I will delete aws-node pod from corresponding node they are all recovering instantly when new one will start.
I tried - in hope that maybe it will improve - adding to aws-node MINIMUM_IP_TARGET=30 and WARM_IP_TARGET=10. Later I also add "sleep" to aws-node, maybe it is placebo but it seems when new node is created by autoscaler it is better if pods are Scheaduled before aws-node witll start starting.
It "seems" much better, but even now it is sometimes stucking:
Over there background-76ff6b85cb-4l5z5 was stuck until I killed previous aws-node-s5jzc pod, 30s later since new one and it worked
https://drive.google.com/drive/folders/1dfjrcBOaUc3cE7gEfcLZHlpBh_Lv2oGZ?usp=sharing
More logs:
Describe when stuck:
Plugin logs:
This time it actually got the IP, but after 11 minutes, most of the time it is not as lucky (there was only 6 pods on this node, so all IPs were already attached to VM waiting to be used MINIMUM_IP_TARGET=30)
Problem started last week, previously we were using aws-node 1.5.X for months, but after issues upgraded to current one last Friday - what didn't help
We have over 100 nodes, this is cluster where developers are creating their personal environments, so as result of such big rotation (sometimes thousand of pods per day) issue happens between every couple of days, and sometimes even 2 times per day when there is big activity, on random node(s). So it seems like edge case issue, 98%ish of pods are fine, but with big activity it becomes a issue. When creating 20 environments (around 1500 pods in total), it is almost certain that 5-20 pods will get stuck
Just to make it clear, it doesn't only happen during some crazy activity, at some rare cases it happened when deployed a single pod to node - but deploying tones of pods seems to be effective way to check if issue is still there or is it fixed. Yesterday at 5PM I deployed 1500 pods 4 times without any issue, but today got a single stuck pod again. Maybe it is some issue in AWS side, some endpoint issue what might be used by aws-node ?
The text was updated successfully, but these errors were encountered: