-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition between CNI plugin install and aws-k8s-agent startup #282
Comments
I started looking into the teardown failures as fixing those would likely avoid pods getting stuck. In plugins/routed-eni/cni.go, del() always attempts to talk to ipamd and causes a failure if ipamd isn't responding. Since setting up the host veth is one of the last steps done by add(), del() can use the existence of the host veth as a quick check before attempting to talk to ipamd. That way in cases where setup failed, teardown can return success quickly regardless of whether ipamd is listening. |
Was unable to try the quick veth check fix, punting to v1.5 |
I'm at the same employer that @kc8apf was at and am handling this work now. When I do a
Running the
|
I am running into the exact same problem, when using cluster autoscaling |
Administering a cluster through argo but I'm seeing the same error on my pods. Trying to autoscale r5 large spot instances. |
I'm in the same situation as @kc8apf. We are using EKS+Autoscalar to run out Gitlab CI runner. The cluster version is 1.13 eks.2 and CNI version is 1.5.0. Sometimes the pod is stuck at ContainerCreating stage and its description is like:
This only happens when a new node scaling-up is triggered by new jobs. After that, this node isn't able to take any pods because of this CNI failure. I compared the successful_ipamd.log.2019-07-02-18.log An interesting finding is that, on the successful node, pod IP assignment happened after all the 30 IP have been added. Note that log However on the failed node, the pod tries to get an IP(description not accurate) before the ENI was even added! Starting from line 58, it tried 12 times to get an IP but failed apparently because the CNI is not ready. After 12 attempts, it gave up and then the CNI had the chance to add 30 IP to the IP pool. Any thought on how to coordinate the starting sequence? Thanks |
A solution could be to run 2 containers in the daemonset. However they do not have any coordination between them in the install CNI script: https://github.com/projectcalico/cni-plugin/blob/master/k8s-install/scripts/install-cni.sh |
I made a naive implementation of the wait mechanism (meaning I added a sleep 10s) here: According to my tests this solves the problem. My tests exist of launching a 100 pods via airflow at once, having the cluster scale from 3 to 15 machines because of that and seeing if they succeed. In the version without the sleep this fails consistently, with adding a sleep this succeeds (at least the 2 times I tested). |
With the above changes applied I now see the following message
|
This is happening to us as well. CNI version: 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.5.0 Is there a fix coming down the pipeline for this? It's legitimately affecting production workloads right now. |
I have the same problem ! CNI version : 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.5.0 |
Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. aws#282
Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. aws#282
Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. aws#282
Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. aws#282
Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. aws#282
Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. #282
I am having the same problem right now. I just migrated from KOPS which did not have this problem. At this point in time, my EKS cluster cannot produce the same functionality. That is: |
@Mewzyk This is an issue with the CNI provider, not so much EKS, although this CNI is default on EKS. A much better test would be to install this CNI on Kops to see if the same error occurs. |
@Mewzyk I just made a release candidate with a potential fix of this issue. It's not yet a final release and we are still doing testing, but if you would like to test the build the instructions are here: https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.5.1-rc1 |
@mogren Working so far. Thanks for the patch. |
Thanks for the update. A note, the recent v1.5.1 release does not have the startup fix! I will make a v1.5.2 soon with the quick config fix plus the changes currently in v1.5.1-rc1. |
I believe one of the reasons more people are reporting this issue now than before, is that:
@mogren since the fix for the Kubernetes regression is in kubelet, would it be possible to upgrade the version in the AMI to 1.13.8 as a stop-gap until the control plane can also be upgraded so that it's in sync? (See aws/containers-roadmap/issues/411) |
@edmorley Hey, thanks a lot for the tip. We can definitely bump the kubelet version in the worker node AMI before the the controlplane gets updated. That said, we will try to update both versions, since some users have Prometheus configured to alert on mismatched versions. |
We created a custom AMI with kubelet version 1.13.8 and we can confirm it fixes this issue. Also, control plane is now on 1.13.8, is the new AMI coming anytime soon? |
I can confirm another configuration that works - as of today: 2019-08-13.
I posted a job with a specific affinity and toleration to target an ASG that had 0 nodes in it. Cluster autoscaler added a node to the ASG. When the node became ready, there was a little delay, but no |
@mattmi88 Ditto! |
Fixed for new nodes in v1.5.3, and merged back to master. |
I encountered this issue just this morning with a build off the tip of amazon-vpc-ni-k8s
What I found is that while an
|
@mogren I am facing a similar issue new pod is getting scheduled to a node.
EKS: 1.17 |
@tibin-mfl - This is a very old issue. Please file a new issue and Can you please share the logs ?
Thanks. |
My employer runs GitLab CI using the Kubernetes executor which runs each CI job as a separate pod. Since the workload is bursty, cluster-autoscaler is also enabled. When certain CI stages begin, 100+ pods are created rapidly and a large number of them are unscheduleable until the autoscaler has spun up enough new nodes. GitLab CI will wait up to 10m for the pod to start before assuming the job failed.
Occasionally, one of these pods will hit the 10m timeout with the following events according to
kubectl describe po
:So, the pod failed during setup due to the CNI plugin failing to connect to aws-k8-cni. Looking at install-aws.sh, the CNI plugin is installed before aws-k8s-agent starts. kubelet assumes that CNI is ready as soon as the plugin is installed. If there are pods waiting to be scheduled, there is a narrow chance that a pod will be scheduled and CNI will fail because aws-k8s-agent hasn't started responding yet.
Unfortunately, the CNI plugin seems to also report a failure during the sandbox cleanup (again due to connection refused when trying to do CNI teardown) which prevents the pod from being rescheduled or retried until the container is manually removed.
The text was updated successfully, but these errors were encountered: