Race condition between CNI plugin install and aws-k8s-agent startup #282

mx-shift · 2019-01-08T02:32:43Z

My employer runs GitLab CI using the Kubernetes executor which runs each CI job as a separate pod. Since the workload is bursty, cluster-autoscaler is also enabled. When certain CI stages begin, 100+ pods are created rapidly and a large number of them are unscheduleable until the autoscaler has spun up enough new nodes. GitLab CI will wait up to 10m for the pod to start before assuming the job failed.

Occasionally, one of these pods will hit the 10m timeout with the following events according to kubectl describe po:

Events:
  Type     Reason                  Age                From                                                 Message
  ----     ------                  ----               ----                                                 -------
  Normal   Scheduled               14m                default-scheduler                                    Successfully assigned gitlab-ci-runner/runner-cad974a7-project-1-concurrent-0cvwc7 to ip-172-30-13-60.us-west-2.compute.internal

  Warning  FailedCreatePodSandBox  14m                kubelet, ip-172-30-13-60.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "c96fe97cd035ce410f537254a20102485c83eb275f3fc317f0e2b5ea5c4b7720" network for pod "runner-cad974a7-project-1-concurrent-0cvwc7": NetworkPlugin cni failed to set up pod "runner-cad974a7-project-1-concurrent-0cvwc7_gitlab-ci-runner" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "c96fe97cd035ce410f537254a20102485c83eb275f3fc317f0e2b5ea5c4b7720" network for pod "runner-cad974a7-project-1-concurrent-0cvwc7": NetworkPlugin cni failed to teardown pod "runner-cad974a7-project-1-concurrent-0cvwc7_gitlab-ci-runner" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
  Normal   SandboxChanged          4m (x46 over 14m)  kubelet, ip-172-30-13-60.us-west-2.compute.internal  Pod sandbox changed, it will be killed and re-created.

So, the pod failed during setup due to the CNI plugin failing to connect to aws-k8-cni. Looking at install-aws.sh, the CNI plugin is installed before aws-k8s-agent starts. kubelet assumes that CNI is ready as soon as the plugin is installed. If there are pods waiting to be scheduled, there is a narrow chance that a pod will be scheduled and CNI will fail because aws-k8s-agent hasn't started responding yet.

Unfortunately, the CNI plugin seems to also report a failure during the sandbox cleanup (again due to connection refused when trying to do CNI teardown) which prevents the pod from being rescheduled or retried until the container is manually removed.

The text was updated successfully, but these errors were encountered:

mx-shift · 2019-01-10T00:15:08Z

I started looking into the teardown failures as fixing those would likely avoid pods getting stuck. In plugins/routed-eni/cni.go, del() always attempts to talk to ipamd and causes a failure if ipamd isn't responding. Since setting up the host veth is one of the last steps done by add(), del() can use the existence of the host veth as a quick check before attempting to talk to ipamd. That way in cases where setup failed, teardown can return success quickly regardless of whether ipamd is listening.

mogren · 2019-01-10T00:38:51Z

Oh, interesting. Thanks a lot for looking into this issue and letting us know what you've found so far.

I have a feeling this might be related to some other issues we've seen like #272, #204 or #180, but I don't know enough yet.

mogren · 2019-03-29T23:51:48Z

Was unable to try the quick veth check fix, punting to v1.5

jtszalay · 2019-06-27T18:53:06Z

I'm at the same employer that @kc8apf was at and am handling this work now. When I do a kubectl describe po <name> I get the following events:

Events:
  Type     Reason                  Age               From                                                  Message
  ----     ------                  ----              ----                                                  -------
  Normal   Scheduled               4m                default-scheduler                                     Successfully assigned gitlab-ci-runner/runner-1bcd72b4-project-1-concurrent-1012nf6l to ip-172-30-25-134.us-west-2.compute.internal
  Warning  FailedCreatePodSandBox  4m                kubelet, ip-172-30-25-134.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "491f4efa828fc1268151d56b5039796eef079bd43cdf8c77d6fc19dc2f8a642f" network for pod "runner-1bcd72b4-project-1-concurrent-1012nf6l": NetworkPlugin cni failed to set up pod "runner-1bcd72b4-project-1-concurrent-1012nf6l_gitlab-ci-runner" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  4m                kubelet, ip-172-30-25-134.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "cf73b20a6e755269ff781e4022d7624cc396062eaab83031cd5f290ab24db311" network for pod "runner-1bcd72b4-project-1-concurrent-1012nf6l": NetworkPlugin cni failed to set up pod "runner-1bcd72b4-project-1-concurrent-1012nf6l_gitlab-ci-runner" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  4m                kubelet, ip-172-30-25-134.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "646c27118964586386882d5b31a4c9079c822dde99b4884694bd8dafcf38592c" network for pod "runner-1bcd72b4-project-1-concurrent-1012nf6l": NetworkPlugin cni failed to set up pod "runner-1bcd72b4-project-1-concurrent-1012nf6l_gitlab-ci-runner" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  4m                kubelet, ip-172-30-25-134.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "78761b304bcfc8530ac3c64fd1389e33daf113a45200280fa81bf85cdabc9928" network for pod "runner-1bcd72b4-project-1-concurrent-1012nf6l": NetworkPlugin cni failed to set up pod "runner-1bcd72b4-project-1-concurrent-1012nf6l_gitlab-ci-runner" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  4m                kubelet, ip-172-30-25-134.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "b2741e895e6da9745f8a6ae2d2447f74fc917f6240191b50f3421adefb621566" network for pod "runner-1bcd72b4-project-1-concurrent-1012nf6l": NetworkPlugin cni failed to set up pod "runner-1bcd72b4-project-1-concurrent-1012nf6l_gitlab-ci-runner" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  4m                kubelet, ip-172-30-25-134.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "1139869224ee604030287cb96c43bc7adecc33a28a2c11155b638bb7ff002f73" network for pod "runner-1bcd72b4-project-1-concurrent-1012nf6l": NetworkPlugin cni failed to set up pod "runner-1bcd72b4-project-1-concurrent-1012nf6l_gitlab-ci-runner" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  4m                kubelet, ip-172-30-25-134.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "316b92154ad06c474ec66170b1c3e1708cc98ee42e74736f1d57498924454e68" network for pod "runner-1bcd72b4-project-1-concurrent-1012nf6l": NetworkPlugin cni failed to set up pod "runner-1bcd72b4-project-1-concurrent-1012nf6l_gitlab-ci-runner" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  4m                kubelet, ip-172-30-25-134.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e869885bb98570241947d791e34765205ab7d2b7e1d96e37a254faf56f20e20d" network for pod "runner-1bcd72b4-project-1-concurrent-1012nf6l": NetworkPlugin cni failed to set up pod "runner-1bcd72b4-project-1-concurrent-1012nf6l_gitlab-ci-runner" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  4m                kubelet, ip-172-30-25-134.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "d19b1b6effbf693b506123a0f4b8d87546ba3f2c3c16072f46f95313857e0d6d" network for pod "runner-1bcd72b4-project-1-concurrent-1012nf6l": NetworkPlugin cni failed to set up pod "runner-1bcd72b4-project-1-concurrent-1012nf6l_gitlab-ci-runner" network: add cmd: failed to assign an IP address to container
  Warning  FailedSync              4m                kubelet, ip-172-30-25-134.us-west-2.compute.internal  error determining status: rpc error: code = Unknown desc = Error: No such container: 646c27118964586386882d5b31a4c9079c822dde99b4884694bd8dafcf38592c
  Warning  FailedCreatePodSandBox  4m (x3 over 4m)   kubelet, ip-172-30-25-134.us-west-2.compute.internal  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8fe86f9a9db6731d81ed2ff7688e6f42e9425b0158551b776c6907dcf14dae0a" network for pod "runner-1bcd72b4-project-1-concurrent-1012nf6l": NetworkPlugin cni failed to set up pod "runner-1bcd72b4-project-1-concurrent-1012nf6l_gitlab-ci-runner" network: add cmd: failed to assign an IP address to container
  Normal   SandboxChanged          4m (x12 over 4m)  kubelet, ip-172-30-25-134.us-west-2.compute.internal  Pod sandbox changed, it will be killed and re-created.

Running the sudo bash /opt/cni/bin/aws-cni-support.sh command on that node gets me the following:

[ec2-user@ip-172-30-25-134 ~]$ sudo bash /opt/cni/bin/aws-cni-support.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   179  100   179    0     0    179      0  0:00:01 --:--:--  0:00:01  174k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     2  100     2    0     0      2      0  0:00:01 --:--:--  0:00:01  2000
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   106  100   106    0     0    106      0  0:00:01 --:--:--  0:00:01  103k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    83  100    83    0     0     83      0  0:00:01 --:--:--  0:00:01 83000
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    28  100    28    0     0     28      0  0:00:01 --:--:--  0:00:01 28000
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 61884  100 61884    0     0  61884      0  0:00:01 --:--:--  0:00:01 29.5M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 10255: Connection refused

mikerlt · 2019-07-01T20:48:44Z

I am running into the exact same problem, when using cluster autoscaling

ianCambrio · 2019-07-01T21:49:40Z

Administering a cluster through argo but I'm seeing the same error on my pods. Trying to autoscale r5 large spot instances.

t0ny-peng · 2019-07-02T19:27:04Z

I'm in the same situation as @kc8apf. We are using EKS+Autoscalar to run out Gitlab CI runner. The cluster version is 1.13 eks.2 and CNI version is 1.5.0.

Sometimes the pod is stuck at ContainerCreating stage and its description is like:

Events:
  Type     Reason                  Age              From                                                 Message
  ----     ------                  ----             ----                                                 -------
  Warning  FailedScheduling        4m (x2 over 4m)  default-scheduler                                    0/3 nodes are available: 1 node(s) didn't match node selector, 3 Insufficient cpu.
  Normal   TriggeredScaleUp        4m               cluster-autoscaler                                   pod triggered scale-up: [{gitlab-x86-64-asg 3->4 (max: 16)}]
  Warning  FailedScheduling        4m (x2 over 4m)  default-scheduler                                    0/4 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had taints that the pod didn't tolerate, 3 Insufficient cpu.
  Normal   Scheduled               3m               default-scheduler                                    Successfully assigned default/runner-ex6ucuta-project-136-concurrent-48nnx7 to ip-172-31-2-101.us-west-2.compute.internal
  Warning  FailedCreatePodSandBox  3m               kubelet, ip-172-31-2-101.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "d6f10d64bd0133c14aa4e95cf50565c459198f652656b254dd09745c713e24e5" network for pod "runner-ex6ucuta-project-136-concurrent-48nnx7": NetworkPlugin cni failed to set up pod "runner-ex6ucuta-project-136-concurrent-48nnx7_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "d6f10d64bd0133c14aa4e95cf50565c459198f652656b254dd09745c713e24e5" network for pod "runner-ex6ucuta-project-136-concurrent-48nnx7": NetworkPlugin cni failed to teardown pod "runner-ex6ucuta-project-136-concurrent-48nnx7_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

This only happens when a new node scaling-up is triggered by new jobs. After that, this node isn't able to take any pods because of this CNI failure.

I compared the ipamd.log between a successful node and a failed node,

successful_ipamd.log.2019-07-02-18.log
failed_ipamd.log.2019-07-02-18.log

An interesting finding is that, on the successful node, pod IP assignment happened after all the 30 IP have been added. Note that log Add/Update for Pod runner-xxx first showed up on line 156, after all the Added ENI(....

However on the failed node, the pod tries to get an IP(description not accurate) before the ENI was even added! Starting from line 58, it tried 12 times to get an IP but failed apparently because the CNI is not ready. After 12 attempts, it gave up and then the CNI had the chance to add 30 IP to the IP pool.

Any thought on how to coordinate the starting sequence? Thanks

stijndehaes · 2019-07-09T13:15:20Z

A solution could be to run 2 containers in the daemonset.
1 to install the CNI and one that runs /app/aws-k8s-agent.
The one that install the CNI can do a grpc health check or call an endpoint to check that /app/aws-k8s-agent is fully up and running before installing the CNI.
Calico for example runs 2 containers in the daemonset: https://docs.projectcalico.org/v3.1/getting-started/kubernetes/installation/hosted/calico.yaml

However they do not have any coordination between them in the install CNI script: https://github.com/projectcalico/cni-plugin/blob/master/k8s-install/scripts/install-cni.sh

stijndehaes · 2019-07-10T06:34:35Z

I made a naive implementation of the wait mechanism (meaning I added a sleep 10s) here:
https://github.com/stijndehaes/amazon-vpc-cni-k8s/blob/feature/test-naive-dependency/config/v1.4/aws-k8s-cni.yaml

According to my tests this solves the problem. My tests exist of launching a 100 pods via airflow at once, having the cluster scale from 3 to 15 machines because of that and seeing if they succeed. In the version without the sleep this fails consistently, with adding a sleep this succeeds (at least the 2 times I tested).
Does making this POC better sound like a good idea? Or should we add the copy logic to the go code? In the go code we can check if the ENI has been attached and if new ip addresses have been received.

jtszalay · 2019-07-10T17:50:07Z

With the above changes applied I now see the following message

  Warning  FailedCreatePodSandBox  2m (x4 over 2m)   kubelet, ip.us-west-2.compute.internal  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "48da484a03e11f033f32c6b69449001b58db1aec21c2194cd5b057f33e28640a" network for pod "runner-94da68a0-project-1-concurrent-372d5rx": NetworkPlugin cni failed to set up pod "runner-94da68a0-project-1-concurrent-372d5rx_gitlab-ci-runner" network: conf.VethPrefix must be less than 4 characters long

sc250024 · 2019-07-16T09:02:17Z

This is happening to us as well.

CNI version: 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.5.0
EKS version: v1.13.7-eks-c57ff8

Is there a fix coming down the pipeline for this? It's legitimately affecting production workloads right now.

clementlecorre · 2019-07-26T14:14:39Z

I have the same problem !

CNI version : 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.5.0
EKS version v1.13.7-eks-c57ff8

Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. aws#282

Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. #282

Mewzyk · 2019-08-02T19:49:07Z

I am having the same problem right now.
CNI version : 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.5.0
EKS version v1.13.7-eks-c57ff8

I just migrated from KOPS which did not have this problem. At this point in time, my EKS cluster cannot produce the same functionality.

That is:
Fresh Cluster + Cluster autoscaler + Run 10 Jobs

sc250024 · 2019-08-02T21:44:23Z

@Mewzyk This is an issue with the CNI provider, not so much EKS, although this CNI is default on EKS.

A much better test would be to install this CNI on Kops to see if the same error occurs.

mogren · 2019-08-02T23:54:04Z

@Mewzyk I just made a release candidate with a potential fix of this issue. It's not yet a final release and we are still doing testing, but if you would like to test the build the instructions are here:

https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.5.1-rc1

Mewzyk · 2019-08-04T07:17:04Z

@mogren Working so far. Thanks for the patch.

mogren · 2019-08-04T07:47:06Z

Thanks for the update.

A note, the recent v1.5.1 release does not have the startup fix! I will make a v1.5.2 soon with the quick config fix plus the changes currently in v1.5.1-rc1.

edmorley · 2019-08-06T06:09:46Z

I believe one of the reasons more people are reporting this issue now than before, is that:

Kubernetes 1.13+ has a regression where the kubelet won't retry PodSandbox creation for pods with a restartPolicy of Never (kubelet won't retry PodSandbox creation for pods with restart policy "Never" kubernetes/kubernetes#79398) - meaning that pods get stuck as ContainerCreating after experiencing the CNI plugin race condition, and are not able to self-recover
The above regression has been fixed in Kubernetes 1.13.8, however EKS is still running 1.13.7
More people are now finally upgrading to EKS 1.13 now that it's (a) generally available on AWS, (b) version 1.10 has been EOLed

@mogren since the fix for the Kubernetes regression is in kubelet, would it be possible to upgrade the version in the AMI to 1.13.8 as a stop-gap until the control plane can also be upgraded so that it's in sync? (See aws/containers-roadmap/issues/411)

mogren · 2019-08-07T03:31:08Z

@edmorley Hey, thanks a lot for the tip. We can definitely bump the kubelet version in the worker node AMI before the the controlplane gets updated. That said, we will try to update both versions, since some users have Prometheus configured to alert on mismatched versions.

jorge-vargas-tfs · 2019-08-09T17:36:04Z

We created a custom AMI with kubelet version 1.13.8 and we can confirm it fixes this issue. Also, control plane is now on 1.13.8, is the new AMI coming anytime soon?

mattmi88 · 2019-08-14T02:03:10Z

I can confirm another configuration that works - as of today: 2019-08-13.

EKS K8S control plane is v1.13.8-eks-a977ba (from kubectl version)
EKS kubelets on the worker nodes are v1.13.7-eks-c57ff8 (from kubectl get node)
Manually updated aws-cni daemonset to v1.5.3 (kubectl edit ds -n kubesystem aws-cni)
Cluster autoscaler is v1.15.1

I posted a job with a specific affinity and toleration to target an ASG that had 0 nodes in it. Cluster autoscaler added a node to the ASG. When the node became ready, there was a little delay, but no FailedCreatePodSandBox errors in the pod events.

sc250024 · 2019-08-14T05:28:03Z

@mattmi88 Ditto!

mogren · 2019-08-22T21:38:20Z

Fixed for new nodes in v1.5.3, and merged back to master.

lgarrett-isp · 2019-08-27T18:37:45Z

I encountered this issue just this morning with a build off the tip of amazon-vpc-ni-k8s master branch. Config:

EKS K8S control plane is v1.13.8-eks-a977ba
EKS kubelets on the worker nodes are v1.13.8-eks-cd3eb0
aws-cni v1.5.3 does not support our instance type (c5.metal) even though it was checked into master before the release; so I pulled master and built our own image and updated the aws-node daemonset--are there things in v1.5.3 that fix this that are NOT in master? Is there a regression in master that will reveal this bug again in the next release?

What I found is that while an aws-node Pod is (re)deploying (so, during CNI upgrades and worker nodes scaling up) to a given instance, two things can happen if Pods are deleted or added to that instance during the time that the aws-nodeis spinning down and re-initializing:

Any Pods removed from that specific Node during the aws-node deployment window will end up in a race condition where kubelet reaches out to the CNI to release the IP but the CNI seems to have no reference to the deleted Pod or IP and reports failure. This continues forever until aws-node (or is still going hours later at least). It does not appear to affect IP allocation or any other behavior that I can see at this point other than kubelet requesting "Delete Network" once per minute for Pods that no longer exist and the CNI and kubelet reporting the error.
The original error reported in this issue--Pods that try to spin up on the instance while the aws-node Pod is re-deploying sometimes get stuck permanently in "ContainerCreating" with the following error (and kubectl delete pod immediately remedies the issue):

failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "7af9a6ac803344de0bb5fcd79c3852107cf473ce8f89a6c8b8cf08cd99dad226" network for pod "datadog-kube-state-metrics-cc4669b55-vtzz9": NetworkPlugin cni failed to set up pod "datadog-kube-state-metrics-cc4669b55-vtzz9_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "7af9a6ac803344de0bb5fcd79c3852107cf473ce8f89a6c8b8cf08cd99dad226" network for pod "datadog-kube-state-metrics-cc4669b55-vtzz9": NetworkPlugin cni failed to teardown pod "datadog-kube-state-metrics-cc4669b55-vtzz9_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"

tibin-mfl · 2020-12-18T08:36:42Z

@mogren I am facing a similar issue new pod is getting scheduled to a node.

Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "9f3dcceb6da3b53d10af353d83b374cf4359d38d85da5cc813925f58f02a232c" network for pod "nginx-deployment-56c774d974-kqgcm":
 networkPlugin cni failed to set up pod "nginx-deployment-56c774d974-kqgcm_nginx-example" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", 
failed to clean up sandbox container "9f3dcceb6da3b53d10af353d83b374cf4359d38d85da5cc813925f58f02a232c" network for pod "nginx-deployment-56c774d974-kqgcm": networkPlugin cni failed to teardown pod "nginx-deployment-56c774d974-kqgcm_nginx-example"
 network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

EKS: 1.17
aws cni : 1.7.5

jayanthvn · 2020-12-18T16:02:12Z

@mogren I am facing a similar issue new pod is getting scheduled to a node.

Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "9f3dcceb6da3b53d10af353d83b374cf4359d38d85da5cc813925f58f02a232c" network for pod "nginx-deployment-56c774d974-kqgcm":
 networkPlugin cni failed to set up pod "nginx-deployment-56c774d974-kqgcm_nginx-example" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", 
failed to clean up sandbox container "9f3dcceb6da3b53d10af353d83b374cf4359d38d85da5cc813925f58f02a232c" network for pod "nginx-deployment-56c774d974-kqgcm": networkPlugin cni failed to teardown pod "nginx-deployment-56c774d974-kqgcm_nginx-example"
 network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

EKS: 1.17
aws cni : 1.7.5

@tibin-mfl - This is a very old issue. Please file a new issue and Can you please share the logs ?

bash /opt/cni/bin/aws-cni-support.sh

Thanks.

mogren mentioned this issue Feb 27, 2019

Kubelet will start scheduling pods before amazon-vpc-cni-k8s daemon set is fully functional causing workloads to error with "failed to assign an IP address to container" #330

Closed

mogren added this to the v1.4 milestone Mar 1, 2019

mogren added the priority/P0 Highest priority. Someone needs to actively work on this. label Mar 1, 2019

tabern mentioned this issue Mar 5, 2019

aws-node pod restarts without any obvious errors #283

Closed

mogren modified the milestones: v1.4, v1.5 Mar 29, 2019

mogren modified the milestones: v1.5, v1.6 Jun 5, 2019

t0ny-peng mentioned this issue Jul 18, 2019

ENI warming is delayed for at least for 1 minute, probably caused by #480 #525

Closed

deliahu mentioned this issue Jul 22, 2019

Pods stuck in ContainerCreating after cluster autoscaling (AWS CNI race condition) cortexlabs/cortex#247

Closed

mogren added the bug label Jul 22, 2019

mogren mentioned this issue Jul 23, 2019

Detach ENI before deleting #538

Merged

deliahu mentioned this issue Jul 25, 2019

Bring back AWS CNI once issues are resolved cortexlabs/cortex#260

Closed

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Jul 31, 2019

Update start script to wait for ipamd health

525dde0

Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. aws#282

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Jul 31, 2019

Update start script to wait for ipamd health

a0a9336

Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. aws#282

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Aug 1, 2019

Update start script to wait for ipamd health

01b0785

Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. aws#282

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Aug 1, 2019

Update start script to wait for ipamd health

8bba9a4

Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. aws#282

mogren mentioned this issue Aug 1, 2019

Update start script to wait for ipamd health check #553

Merged

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Aug 1, 2019

Update start script to wait for ipamd health

f2d7099

Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. aws#282

mogren pushed a commit that referenced this issue Aug 1, 2019

Update start script to wait for ipamd health

79fd64a

Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. #282

TechnicalMercenary mentioned this issue Aug 5, 2019

Migrating to EKS 1.13 receiving FailedCreatePodSandBox #571

Closed

mogren mentioned this issue Aug 7, 2019

Copy the binary and config after ipamd is ready #576

Merged

edmorley mentioned this issue Aug 7, 2019

Release v1.5.2 #574

Merged

deliahu mentioned this issue Aug 7, 2019

Pods stuck in ContainerCreating (AWS CNI pod limit) cortexlabs/cortex#219

Closed

3 tasks

mogren closed this as completed Aug 22, 2019

lgarrett-isp mentioned this issue Aug 27, 2019

Issue #282 regression in master -- Pods stuck in ContainerCreating if created/delete while the aws-node on the same instance is (re)deploying #601

Closed

This was referenced Sep 20, 2019

dask worker pods and nodes not removed by autoscaler pangeo-data/pangeo-cloud-federation#408

Closed

Version option for update cluster command eksctl-io/eksctl#909

Closed

duxing mentioned this issue May 2, 2023

Mismatching pod IP and datastore IP during deletion #2352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition between CNI plugin install and aws-k8s-agent startup #282

Race condition between CNI plugin install and aws-k8s-agent startup #282

mx-shift commented Jan 8, 2019

mx-shift commented Jan 10, 2019

mogren commented Jan 10, 2019

mogren commented Mar 29, 2019

jtszalay commented Jun 27, 2019 •

edited

Loading

mikerlt commented Jul 1, 2019

ianCambrio commented Jul 1, 2019

t0ny-peng commented Jul 2, 2019 •

edited

Loading

stijndehaes commented Jul 9, 2019

stijndehaes commented Jul 10, 2019

jtszalay commented Jul 10, 2019

sc250024 commented Jul 16, 2019

clementlecorre commented Jul 26, 2019

Mewzyk commented Aug 2, 2019

sc250024 commented Aug 2, 2019

mogren commented Aug 2, 2019

Mewzyk commented Aug 4, 2019

mogren commented Aug 4, 2019

edmorley commented Aug 6, 2019 •

edited

Loading

mogren commented Aug 7, 2019 •

edited

Loading

jorge-vargas-tfs commented Aug 9, 2019

mattmi88 commented Aug 14, 2019

sc250024 commented Aug 14, 2019

mogren commented Aug 22, 2019

lgarrett-isp commented Aug 27, 2019

tibin-mfl commented Dec 18, 2020 •

edited

Loading

jayanthvn commented Dec 18, 2020

Race condition between CNI plugin install and aws-k8s-agent startup #282

Race condition between CNI plugin install and aws-k8s-agent startup #282

Comments

mx-shift commented Jan 8, 2019

mx-shift commented Jan 10, 2019

mogren commented Jan 10, 2019

mogren commented Mar 29, 2019

jtszalay commented Jun 27, 2019 • edited Loading

mikerlt commented Jul 1, 2019

ianCambrio commented Jul 1, 2019

t0ny-peng commented Jul 2, 2019 • edited Loading

stijndehaes commented Jul 9, 2019

stijndehaes commented Jul 10, 2019

jtszalay commented Jul 10, 2019

sc250024 commented Jul 16, 2019

clementlecorre commented Jul 26, 2019

Mewzyk commented Aug 2, 2019

sc250024 commented Aug 2, 2019

mogren commented Aug 2, 2019

Mewzyk commented Aug 4, 2019

mogren commented Aug 4, 2019

edmorley commented Aug 6, 2019 • edited Loading

mogren commented Aug 7, 2019 • edited Loading

jorge-vargas-tfs commented Aug 9, 2019

mattmi88 commented Aug 14, 2019

sc250024 commented Aug 14, 2019

mogren commented Aug 22, 2019

lgarrett-isp commented Aug 27, 2019

tibin-mfl commented Dec 18, 2020 • edited Loading

jayanthvn commented Dec 18, 2020

jtszalay commented Jun 27, 2019 •

edited

Loading

t0ny-peng commented Jul 2, 2019 •

edited

Loading

edmorley commented Aug 6, 2019 •

edited

Loading

mogren commented Aug 7, 2019 •

edited

Loading

tibin-mfl commented Dec 18, 2020 •

edited

Loading