Issue #282 regression in master -- Pods stuck in ContainerCreating if created/delete while the aws-node on the same instance is (re)deploying #601

lgarrett-isp · 2019-08-27T18:42:13Z

It looks like #282 was closed with the assumption the v1.5.3 resolves the behavior, however builds against the tip of master are showing the exact same behavior (so it seems reasonable that either the issues still exists in v1.5.3 or the next release may have a regression of this behavior).

Copied/added more detail from comment at the bottom of #282:
I encountered this (#282) issue just this morning with a build off the tip of amazon-vpc-ni-k8s master branch. Config:

EKS K8S control plane is v1.13.8-eks-a977ba
EKS kubelets on the worker nodes are v1.13.8-eks-cd3eb0
aws-cni v1.5.3 does not support our instance type even though a PR to add the instance type was checked into master before the release; so I pulled master and built our own image and updated the aws-node daemonset to use that image--I worry that there may be a regression in master if Race condition between CNI plugin install and aws-k8s-agent startup #282 was root-caused and solved in v1.5.3.

I am unfamiliar with the inner workings of the CNI but am slowly working my way through this repo--however, based on observing behavior of the components involved in a variety of situations, I have high confidence that this issue specifically occurs when Pods are removed or added to a node at the exact time that the aws-node Pod on the instance is (re)deploying. As best I can tell, it appears that window-of-potential failure is small (per instance) in which this can happen so failure is not guaranteed--but there several cases where this can happen that make it very risky for our production use-cases:

Instance scale-up for any reason -- as the Node comes up, Pods may be attempted to be scheduled to the instance before aws-node is fully initialized.
CNI Upgrades/deployments can cause the aws-node Pods to re-deploy -- the deployment strategy updates one Pod/Node at a time so only Pods being added to/removed from the instance where aws-node is currently updating are at risk at any given time

I am unable to reproduce this bug for "stable" clusters where no Pod scheduling activity is taking place during the time that aws-node is (re)deploying so I believe this is specifically about Pods being removed or added from a Node during the exact time that the aws-node Pod on that instance is being deployed. There is a related (but so far non-impacting) error log that also shows up while I am creating these situations to test (number 1 below)--the original issue as reported in #282 is number 2 below.

Pods removed from the specific Node on which the aws-node deployment is currently in progress can end up in a race condition where kubelet reaches out to the CNI to release the IP but ipamD and the plugin seem to have no reference to the deleted Pod's containers or IP and report failure for DeleteNetwork. This continues forever (or is still going hours later at least) until aws-node is re-deployed without any other Pod scheduling activity. It does not appear to affect IP allocation or any other behavior so far other than kubelet requesting "Delete Network" once per minute for Pods that no longer exist and the CNI and kubelet reporting the error.
The original error reported in Race condition between CNI plugin install and aws-k8s-agent startup #282--Pods that try to spin up on the instance while the aws-node Pod is re-deploying for that specific Node sometimes get stuck permanently in "ContainerCreating" with the following error (and kubectl delete pod immediately remedies the issue):

failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "7af9a6ac803344de0bb5fcd79c3852107cf473ce8f89a6c8b8cf08cd99dad226" network for pod "datadog-kube-state-metrics-cc4669b55-vtzz9": NetworkPlugin cni failed to set up pod "datadog-kube-state-metrics-cc4669b55-vtzz9_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "7af9a6ac803344de0bb5fcd79c3852107cf473ce8f89a6c8b8cf08cd99dad226" network for pod "datadog-kube-state-metrics-cc4669b55-vtzz9": NetworkPlugin cni failed to teardown pod "datadog-kube-state-metrics-cc4669b55-vtzz9_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"

The text was updated successfully, but these errors were encountered:

InAnimaTe · 2019-09-24T13:45:46Z

@lgarrett-isp can you test this against 1.5.3? It might be a p0 if it occurs on that version since that's presently the default version deployed with new clusters (1.14.6 for me).

mogren · 2019-10-02T20:48:43Z

@lgarrett-isp Could you try with the v1.5.5 release? This commit is relevant: b0b2fc1

(Note: Edited to say v1.5.5 instead of v1.5.4, since that release had a serious issue #641)

lgarrett-isp · 2019-10-07T16:01:54Z

We're standing up a new cluster today and I was going to test this but I realized the instance type we'll be using is not included until v1.6. I'll exercise deployments on 1.6.0RC2 as part of my normal work to see if I can reproduce the behavior there--if not, I'll try to find some time to test against some different instance types.

droidls · 2020-01-19T17:08:38Z

@mogren Please try to avoid 1.5.4 as it is a known version with bugs that have been patched into 1.5.5. If needed please try either 1.5.5. or 1.5.3.

mogren · 2020-01-22T19:38:45Z

@lgarrett-isp Did this work with v1.6.0-rc5? Pods should retry and not get stuck, even if ipamd is restarting.

mogren · 2020-04-15T17:42:16Z

It's likely this error is related to kubernetes/kubernetes#79398 since the aws-node restart can take a few seconds to come up. Have you seen this issue since?

mogren · 2020-04-22T18:13:07Z

@lgarrett-isp Could you try and reproduce this using v1.6.1?

mogren · 2020-06-03T17:28:46Z

Please reopen if this is still an issue.

nandeeshb09 · 2020-07-31T06:41:43Z

We are facing this issue at 1.6.1 also and eks version is 1.14.9.
Is there any fix on this issue ?

mogren self-assigned this Sep 3, 2019

mogren added the priority/P1 Must be staffed and worked currently or soon. Is a candidate for next release label Sep 3, 2019

mogren added this to the v1.6 milestone Sep 3, 2019

edmorley mentioned this issue Feb 7, 2020

Improvements to branch/release workflow #835

Closed

mogren closed this as completed Jun 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue #282 regression in master -- Pods stuck in ContainerCreating if created/delete while the aws-node on the same instance is (re)deploying #601

Issue #282 regression in master -- Pods stuck in ContainerCreating if created/delete while the aws-node on the same instance is (re)deploying #601

lgarrett-isp commented Aug 27, 2019 •

edited

Loading

InAnimaTe commented Sep 24, 2019

mogren commented Oct 2, 2019 •

edited

Loading

lgarrett-isp commented Oct 7, 2019

droidls commented Jan 19, 2020

mogren commented Jan 22, 2020

mogren commented Apr 15, 2020

mogren commented Apr 22, 2020 •

edited

Loading

mogren commented Jun 3, 2020

nandeeshb09 commented Jul 31, 2020

Issue #282 regression in master -- Pods stuck in ContainerCreating if created/delete while the aws-node on the same instance is (re)deploying #601

Issue #282 regression in master -- Pods stuck in ContainerCreating if created/delete while the aws-node on the same instance is (re)deploying #601

Comments

lgarrett-isp commented Aug 27, 2019 • edited Loading

InAnimaTe commented Sep 24, 2019

mogren commented Oct 2, 2019 • edited Loading

lgarrett-isp commented Oct 7, 2019

droidls commented Jan 19, 2020

mogren commented Jan 22, 2020

mogren commented Apr 15, 2020

mogren commented Apr 22, 2020 • edited Loading

mogren commented Jun 3, 2020

nandeeshb09 commented Jul 31, 2020

lgarrett-isp commented Aug 27, 2019 •

edited

Loading

mogren commented Oct 2, 2019 •

edited

Loading

mogren commented Apr 22, 2020 •

edited

Loading