Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

networkPlugin cni failed to teardown pod "traefik-6d4b5f9c9f-7sfg7_default" network: invalid version "": the version is empty] #1412

Closed
nodesocket opened this issue Mar 25, 2021 · 14 comments

Comments

@nodesocket
Copy link

nodesocket commented Mar 25, 2021

Just upgraded our EKS cluster to Kubernetes version 1.19. The worker nodes are instance types t3a.xlarge. When re-deploying Traefik via Helm getting the following error on the Traefik pods as well as a pod named storeconfig-job-1-x9cc5.

storeconfig-job-1-x9cc5                   0/1     ContainerCreating   0          6m1s
traefik-6d4b5f9c9f-7sfg7                  0/1     ContainerCreating   0          6m3s
traefik-6d4b5f9c9f-dhb9v                  0/1     ContainerCreating   0          6m3s
traefik-6d4b5f9c9f-ffbvs                  0/1     ContainerCreating   0          6m3s

Warning FailedCreatePodSandBox 3m38s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "bd85f8205cf2b59a5dc0230f82c24aba121487f802b17519528897839b2b8290" network for pod "traefik-6d4b5f9c9f-7sfg7": networkPlugin cni failed to set up pod "traefik-6d4b5f9c9f-7sfg7_default" network: add cmd: failed to assign an IP address to container, failed to clean up sandbox container "bd85f8205cf2b59a5dc0230f82c24aba121487f802b17519528897839b2b8290" network for pod "traefik-6d4b5f9c9f-7sfg7": networkPlugin cni failed to teardown pod "traefik-6d4b5f9c9f-7sfg7_default" network: invalid version "": the version is empty]

Any ideas? This worked previously deploying Traefik via Helm on the old version of Kubernetes.

@nodesocket
Copy link
Author

nodesocket commented Mar 26, 2021

Updating my Kubernetes worker nodes to the latest AWS EKS version 1.19 AMI and then rebuilding the worker nodes fixed the issue. So, gonna assume the problem was hardcoded into the EKS 1.19 AMI I was using previously.

@mgoltzsche
Copy link

mgoltzsche commented Mar 26, 2021

ah, turns out, after CNI was changed to require the cniVersion field to be specified with a proper version, it was changed again in a newer CNI version so that a default cniVersion is used when the field is empty for backward compatibility. That's probably why updating the node vm image fixed it.
Though the default EKS CNI configuration should specify the cniVersion field explicitly.

@jayanthvn
Copy link
Contributor

Hi @nodesocket

I would like to review kubelet logs, can you please share me (varavaj@amazon.com) the log dump by running this script on the instance - sudo bash /opt/cni/bin/aws-cni-support.sh. Also which CNI version are you using?

Thanks.

@nodesocket
Copy link
Author

nodesocket commented Mar 26, 2021

@jayanthvn and @mgoltzsche I don't have the previous EKS EC2 worker node instances anymore. I recreated them using the latest EKS AMI of amazon-eks-node-1.19-v20210322 dated March 23, 2021 at 3:35:53 PM UTC-5. Also for good measure, I increased the EKS EC2 worker instance types from previous t3.xlarge to r5.large at the same time. This resolved the issue of:

 networkPlugin cni failed to teardown pod "traefik-6d4b5f9c9f-7sfg7_default" network: invalid version "": the version is empty]

Previously, which was broke with the above error, I was using the AMI amazon-eks-node-1.19-v20210208 dated February 8, 2021 at 6:10:23 PM UTC-6.

@jayanthvn
Copy link
Contributor

Thanks @nodesocket

We will try to repro this. Can you please share the previous EKS and CNI version prior to 1.19 upgrade? Since cluster upgrade as far as I know shouldn't upgrade CNI since manifests will be applied only for new cluster creates. Did you upgrade CNI after upgrading to 1.19?

@jayanthvn jayanthvn self-assigned this Mar 31, 2021
@nodesocket
Copy link
Author

nodesocket commented Mar 31, 2021

@jayanthvn I don't even know how to upgrade CNI. The previous Kubernetes version we were running on EKS was 1.15 and I upgraded EKS master versions one by one 1.16, 1.17, 1.18 and finally 1.19. Then re-created the Kubernetes workers EC2 instances using the AMI from 1.19 from earlier amazon-eks-node-1.19-v20210208 dated February 8, 2021.

That resulted in the error / issue. Then I upgrade the Kubernetes workers to the latest AMI from 1.19 of amazon-eks-node-1.19-v20210322 dated March 23, 2021 at 3:35:53 PM and changed the instance size from t3.xlarge to r5.large. That resolved the issue.

@jayanthvn
Copy link
Contributor

Thanks @nodesocket , I will try to repro this.

@nodesocket
Copy link
Author

@jayanthvn just edited/added more to my original reply above. Want to make sure you see it.

@nodesocket
Copy link
Author

nodesocket commented Apr 6, 2021

@jayanthvn I just looked at my logs from this cluster, and even though all the pods are running successfully, I am seeing the following network aws-cni/aws-cni: invalid version "": the version is empty logged tons. I think this indicates there is still an underlying issue. Happy to help you debug this.

Apr 6 17:46:46 ip-192-168-10-130 messages E0406 22:46:46.108166    4728 cni.go:387] Error deleting default_de-novo-5f7d9cbd4c-27nd4/ce11fcd43909c0c37c655ad81db1670897d266452ae43e8443b642f3878dd4ea from network aws-cni/aws-cni: invalid version "": the version is empty
Apr 6 17:46:46 ip-192-168-10-130 messages E0406 22:46:46.108701    4728 remote_runtime.go:140] StopPodSandbox "ce11fcd43909c0c37c655ad81db1670897d266452ae43e8443b642f3878dd4ea" from runtime service failed: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "de-novo-5f7d9cbd4c-27nd4_default" network: invalid version "": the version is empty
Apr 6 17:46:46 ip-192-168-10-130 messages E0406 22:46:46.108723    4728 kuberuntime_gc.go:176] Failed to stop sandbox "ce11fcd43909c0c37c655ad81db1670897d266452ae43e8443b642f3878dd4ea" before removing: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "de-novo-5f7d9cbd4c-27nd4_default" network: invalid version "": the version is empty
Apr 6 17:46:46 ip-192-168-10-130 messages E0406 22:46:46.112298    4728 kuberuntime_gc.go:176] Failed to stop sandbox "9e4a8d92e95007b94b3fb1e190d0bc8766a5f1c1c7c3acb2b37915e9a10ef151" before removing: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "traefik-d6c55588-lqhwf_default" network: invalid version "": the version is empty
Apr 6 17:46:46 ip-192-168-10-130 messages E0406 22:46:46.119489    4728 remote_runtime.go:140] StopPodSandbox "778d918c73286c103abbb9c8a4ebba53d946f4786cfa7823b8e9569ea7f872ad" from runtime service failed: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "de-novo-8645bd8f6d-z2d7c_default" network: invalid version "": the version is empty
Apr 6 17:46:46 ip-192-168-10-130 messages W0406 22:46:46.121024    4728 cni.go:333] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "35b84883a6bfc2fc51c4fb2eb6ce1d96cb5fca3e8ca3086ed6d33c91bfb84680"
Apr 6 17:46:46 ip-192-168-10-130 messages E0406 22:46:46.126290    4728 remote_runtime.go:140] StopPodSandbox "c85f237e471998106b9edcbd77db5e9149d20e3e09e6a1d67084644bf100ad30" from runtime service failed: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "traefik-6d4b5f9c9f-cm2w4_default" network: invalid version "": the version is empty
Apr 6 17:46:46 ip-192-168-10-130 messages E0406 22:46:46.129651    4728 kuberuntime_gc.go:176] Failed to stop sandbox "8038b56f619a3aac0b4183589e767e679c7a45a45eed7678c7b13998a08f33ad" before removing: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "traefik-6d4b5f9c9f-hrq7m_default" network: invalid version "": the version is empty
Apr 6 17:46:46 ip-192-168-10-130 messages E0406 22:46:46.132430    4728 cni.go:387] Error deleting default_api-7446cbcdb4-d4vn7/8489f5b892189864e8296c1d2011ecb76a4576ccec8d26c2e80d6dd90d0d2299 from network aws-cni/aws-cni: invalid version "": the version is empty
Apr 6 17:46:46 ip-192-168-10-130 messages E0406 22:46:46.132931    4728 kuberuntime_gc.go:176] Failed to stop sandbox "8489f5b892189864e8296c1d2011ecb76a4576ccec8d26c2e80d6dd90d0d2299" before removing: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "api-7446cbcdb4-d4vn7_default" network: invalid version "": the version is empty

@jayanthvn
Copy link
Contributor

Can you please email (varavaj@amazon.com) me the logs from the script - sudo bash /opt/cni/bin/aws-cni-support.sh

Thanks.

@jayanthvn
Copy link
Contributor

I looked into Justin's cluster,

CNI version in the cluster is "1.5.0" and starting from k8s 1.16 onwards "CNIVersion" should be set (Ref #604 ) which is missing in the CNI spec. The fix to add CNIversion in CNI spec (#605) is in 1.5.4 onwards [https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.5.4]. Hence seeing the invalid version errors -

sh-4.2$ kubectl describe daemonset aws-node -n kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni:v1.5.0

CNI spec -

{
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni"
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    }
  ]
}

@jayanthvn
Copy link
Contributor

Also updating the cluster, will not update the addons - Please ref this -

"Amazon EKS doesn't modify any of your Kubernetes add-ons when you update a cluster. After updating your cluster, we recommend that you update your add-ons to the versions listed in the following table for the new Kubernetes version that you're updating to. Steps to accomplish this are included in the update procedures."

https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html

@nodesocket
Copy link
Author

This was the problem, and fixed by manually adding the CNI addon and selecting the latest version. It would have great if there was some sort of indication in the web console about running an incompatible version of CNI when upgrading Kubernetes versions and how to fix this. I feel like lots of users are going to get bitten by this if they upgrade old EKS clusters.

unnamed

@jayanthvn
Copy link
Contributor

jayanthvn commented Apr 7, 2021

Glad it got fixed :) yes will take the feedback to see how we can have documentation/console warning updated. Closing the issue now, please do reach out if you need any more info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants