-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #282 regression in master -- Pods stuck in ContainerCreating if created/delete while the aws-node on the same instance is (re)deploying #601
Comments
@lgarrett-isp can you test this against 1.5.3? It might be a p0 if it occurs on that version since that's presently the default version deployed with new clusters (1.14.6 for me). |
@lgarrett-isp Could you try with the v1.5.5 release? This commit is relevant: b0b2fc1 (Note: Edited to say v1.5.5 instead of v1.5.4, since that release had a serious issue #641) |
We're standing up a new cluster today and I was going to test this but I realized the instance type we'll be using is not included until v1.6. I'll exercise deployments on 1.6.0RC2 as part of my normal work to see if I can reproduce the behavior there--if not, I'll try to find some time to test against some different instance types. |
@mogren Please try to avoid 1.5.4 as it is a known version with bugs that have been patched into 1.5.5. If needed please try either 1.5.5. or 1.5.3. |
@lgarrett-isp Did this work with v1.6.0-rc5? Pods should retry and not get stuck, even if ipamd is restarting. |
It's likely this error is related to kubernetes/kubernetes#79398 since the aws-node restart can take a few seconds to come up. Have you seen this issue since? |
@lgarrett-isp Could you try and reproduce this using |
Please reopen if this is still an issue. |
We are facing this issue at 1.6.1 also and eks version is 1.14.9. |
It looks like #282 was closed with the assumption the v1.5.3 resolves the behavior, however builds against the tip of master are showing the exact same behavior (so it seems reasonable that either the issues still exists in v1.5.3 or the next release may have a regression of this behavior).
Copied/added more detail from comment at the bottom of #282:
I encountered this (#282) issue just this morning with a build off the tip of amazon-vpc-ni-k8s master branch. Config:
I am unfamiliar with the inner workings of the CNI but am slowly working my way through this repo--however, based on observing behavior of the components involved in a variety of situations, I have high confidence that this issue specifically occurs when Pods are removed or added to a node at the exact time that the
aws-node
Pod on the instance is (re)deploying. As best I can tell, it appears that window-of-potential failure is small (per instance) in which this can happen so failure is not guaranteed--but there several cases where this can happen that make it very risky for our production use-cases:aws-node
Pods to re-deploy -- the deployment strategy updates one Pod/Node at a time so only Pods being added to/removed from the instance where aws-node is currently updating are at risk at any given timeI am unable to reproduce this bug for "stable" clusters where no Pod scheduling activity is taking place during the time that aws-node is (re)deploying so I believe this is specifically about Pods being removed or added from a Node during the exact time that the
aws-node
Pod on that instance is being deployed. There is a related (but so far non-impacting) error log that also shows up while I am creating these situations to test (number 1 below)--the original issue as reported in #282 is number 2 below.aws-node
is re-deployed without any other Pod scheduling activity. It does not appear to affect IP allocation or any other behavior so far other than kubelet requesting "Delete Network" once per minute for Pods that no longer exist and the CNI and kubelet reporting the error.The text was updated successfully, but these errors were encountered: