Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #282 regression in master -- Pods stuck in ContainerCreating if created/delete while the aws-node on the same instance is (re)deploying #601

Closed
lgarrett-isp opened this issue Aug 27, 2019 · 9 comments
Assignees
Labels
priority/P1 Must be staffed and worked currently or soon. Is a candidate for next release
Milestone

Comments

@lgarrett-isp
Copy link

lgarrett-isp commented Aug 27, 2019

It looks like #282 was closed with the assumption the v1.5.3 resolves the behavior, however builds against the tip of master are showing the exact same behavior (so it seems reasonable that either the issues still exists in v1.5.3 or the next release may have a regression of this behavior).

Copied/added more detail from comment at the bottom of #282:
I encountered this (#282) issue just this morning with a build off the tip of amazon-vpc-ni-k8s master branch. Config:

  • EKS K8S control plane is v1.13.8-eks-a977ba
  • EKS kubelets on the worker nodes are v1.13.8-eks-cd3eb0
  • aws-cni v1.5.3 does not support our instance type even though a PR to add the instance type was checked into master before the release; so I pulled master and built our own image and updated the aws-node daemonset to use that image--I worry that there may be a regression in master if Race condition between CNI plugin install and aws-k8s-agent startup #282 was root-caused and solved in v1.5.3.

I am unfamiliar with the inner workings of the CNI but am slowly working my way through this repo--however, based on observing behavior of the components involved in a variety of situations, I have high confidence that this issue specifically occurs when Pods are removed or added to a node at the exact time that the aws-node Pod on the instance is (re)deploying. As best I can tell, it appears that window-of-potential failure is small (per instance) in which this can happen so failure is not guaranteed--but there several cases where this can happen that make it very risky for our production use-cases:

  1. Instance scale-up for any reason -- as the Node comes up, Pods may be attempted to be scheduled to the instance before aws-node is fully initialized.
  2. CNI Upgrades/deployments can cause the aws-node Pods to re-deploy -- the deployment strategy updates one Pod/Node at a time so only Pods being added to/removed from the instance where aws-node is currently updating are at risk at any given time

I am unable to reproduce this bug for "stable" clusters where no Pod scheduling activity is taking place during the time that aws-node is (re)deploying so I believe this is specifically about Pods being removed or added from a Node during the exact time that the aws-node Pod on that instance is being deployed. There is a related (but so far non-impacting) error log that also shows up while I am creating these situations to test (number 1 below)--the original issue as reported in #282 is number 2 below.

  1. Pods removed from the specific Node on which the aws-node deployment is currently in progress can end up in a race condition where kubelet reaches out to the CNI to release the IP but ipamD and the plugin seem to have no reference to the deleted Pod's containers or IP and report failure for DeleteNetwork. This continues forever (or is still going hours later at least) until aws-node is re-deployed without any other Pod scheduling activity. It does not appear to affect IP allocation or any other behavior so far other than kubelet requesting "Delete Network" once per minute for Pods that no longer exist and the CNI and kubelet reporting the error.
  2. The original error reported in Race condition between CNI plugin install and aws-k8s-agent startup #282--Pods that try to spin up on the instance while the aws-node Pod is re-deploying for that specific Node sometimes get stuck permanently in "ContainerCreating" with the following error (and kubectl delete pod immediately remedies the issue):
failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "7af9a6ac803344de0bb5fcd79c3852107cf473ce8f89a6c8b8cf08cd99dad226" network for pod "datadog-kube-state-metrics-cc4669b55-vtzz9": NetworkPlugin cni failed to set up pod "datadog-kube-state-metrics-cc4669b55-vtzz9_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "7af9a6ac803344de0bb5fcd79c3852107cf473ce8f89a6c8b8cf08cd99dad226" network for pod "datadog-kube-state-metrics-cc4669b55-vtzz9": NetworkPlugin cni failed to teardown pod "datadog-kube-state-metrics-cc4669b55-vtzz9_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"
@mogren mogren self-assigned this Sep 3, 2019
@mogren mogren added the priority/P1 Must be staffed and worked currently or soon. Is a candidate for next release label Sep 3, 2019
@mogren mogren added this to the v1.6 milestone Sep 3, 2019
@InAnimaTe
Copy link

@lgarrett-isp can you test this against 1.5.3? It might be a p0 if it occurs on that version since that's presently the default version deployed with new clusters (1.14.6 for me).

@mogren
Copy link
Contributor

mogren commented Oct 2, 2019

@lgarrett-isp Could you try with the v1.5.5 release? This commit is relevant: b0b2fc1

(Note: Edited to say v1.5.5 instead of v1.5.4, since that release had a serious issue #641)

@lgarrett-isp
Copy link
Author

We're standing up a new cluster today and I was going to test this but I realized the instance type we'll be using is not included until v1.6. I'll exercise deployments on 1.6.0RC2 as part of my normal work to see if I can reproduce the behavior there--if not, I'll try to find some time to test against some different instance types.

@droidls
Copy link

droidls commented Jan 19, 2020

@mogren Please try to avoid 1.5.4 as it is a known version with bugs that have been patched into 1.5.5. If needed please try either 1.5.5. or 1.5.3.

@mogren
Copy link
Contributor

mogren commented Jan 22, 2020

@lgarrett-isp Did this work with v1.6.0-rc5? Pods should retry and not get stuck, even if ipamd is restarting.

@mogren
Copy link
Contributor

mogren commented Apr 15, 2020

It's likely this error is related to kubernetes/kubernetes#79398 since the aws-node restart can take a few seconds to come up. Have you seen this issue since?

@mogren
Copy link
Contributor

mogren commented Apr 22, 2020

@lgarrett-isp Could you try and reproduce this using v1.6.1?

@mogren
Copy link
Contributor

mogren commented Jun 3, 2020

Please reopen if this is still an issue.

@mogren mogren closed this as completed Jun 3, 2020
@nandeeshb09
Copy link

We are facing this issue at 1.6.1 also and eks version is 1.14.9.
Is there any fix on this issue ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/P1 Must be staffed and worked currently or soon. Is a candidate for next release
Projects
None yet
Development

No branches or pull requests

5 participants