-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPAM connectivity failed when upgrading from v1.5.5 to 1.6.0 #872
Comments
@laghao How was the CNI updated? If only the image tag was updated, the issue could be that the required |
In your logs I see
What is your |
Tried to increase
Not because of liveness probe failed |
I updated the CNI using the helm chart aws-vpc-cni The direct upgrade looks broken somehow. |
I tried using the helm chart to upgrade from v1.5.5 to v1.6.0 and it took my |
@laghao what's your kubelet version on worker nodes? If you are using EKS AMI to launch your worker nodes, can you give us the AMI ID as well? |
Hi everyone, I am upgrading from v1.5.5 to v1.6.0 and cni pod fails to start
|
Hi @hahasheminejad! Is there any chance you might be able to run the |
I got the same problem today after updating using eksctl
Suddenly new containers could not start (timeout from cni). So I created a new nodegroup. Nodes would not become Ready and the AWS CNI was logging this error "timed out waiting for IPAM daemon to start."
|
I figured out my issue, hopefully this will help someone else if they find this via Google. The Upgrading I fixed this by removing and re-adding the iamserviceaccount using eksctl
I have reported this to eksctl here |
@hahasheminejad I noticed in your logs that your worker node was completely overwhelmed and pods constantly getting OOM killed:
Did you see this issue on other nodes as well? |
Hi, We are facing the same issue while upgrading from Kuberenetes 1.13(eks.9) Kuberenetes 1.14(eks.9) and using CNI v1.6.1 (from CNI v1.5.5) - Mounted dockershim mount We tried following steps: Removed and recreated the service account - ( initially SA is created by eksctl) Logs: Please let us if there any workaround or when will be fix expected |
Hello, as with the comment above we are also seeing the same issue updating vpc-cni from v1.5.5 to v.1.6.1. We have 4 clusters (which are theoretically all configured the same way). All on v1.15.11-eks-af3caf. DNS and Kube-proxy versions up to date inline with table in AWS official guide across all 4 clusters: CNI VPC plugin has been updated successfully across 3 clusters. In the last cluster the DaemonSet rolled out successfully to 6/7 nodes. On the last node the pod crash looped. I bounced it and it crash looped. Due to failing health.
There are other workloads scheduled already on this node. This has meant I needed to rollback to v1.5.5 only in this cluster. I'm looking at resources and attempting to triage and may be raising to AWS Support seperately but adding here for more information on this issue occurring in general to keep the issue fresh. |
Thanks for reporting the issue @njgibbon! Did you run the Also, if rolling back, would v1.5.7 be an option? |
I've faced similar issues after upgrading to EKS 1.16 and upgrading VPC CNI plugin to 1.6.1 and the latest kube proxy 1.16.8.
After troubleshooting this with AWS Support, the combination of rolling back to our previous EKS 1.15 configuration, i.e. using AWS VPC CNI Plugin 1.5.7 and kubeproxy 1.15.11 worked for me on EKS 1.16. Please note that terminating your existing EC2 instances might (or will?) be needed in order to get back to a running state. Out of the 1.16 upgrade "prerequisites", the only mandatory one, if you were already on 1.15, is to make sure you have all yaml files converted to the new API (v1) version. No more betas. https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html#1-16-prequisites You might want to hold off any other changes for now until AWS further communicates on this issue. |
For kube-proxy on 1.16, make sure that |
This was very hard to track down, but @mogren's comment was what solved it for me. My cluster was created ~2 years ago and I tried to downgrade the CNI plugin back to 1.5.x, but that also didn't solve the problem. I had to manually edit my I think it'd be great to mention that in the upgrade guide. |
@brianstorti We've updated the doc here awsdocs/amazon-eks-user-guide#125, should go live soon. |
@mogren Hi, I experienced almost same issue with @njgibbon . I am running multiple cluster but upgrade only failed in one node in one cluster. I sent the result of running Hopes it helps. |
@spacebarley Hi! Thanks for the logs, they made it clear that you ran into another issue:
The subnet is out of IPs. First, since you were running the v1.5.x CNI earlier, check for leaked ENIs in your account. They will be marked as Available (blue dot) in the AWS Console, and have a tag, |
Closing this issue since it has turned into a bucket of multiple upgrade issues. The things we have seen so far:
Please open a new issue if you find any new problem. |
* add configurable timeout for ipamd startup Adds a configurable timeout to the aws-k8s-agent (ipamd) startup in the entrypoint.sh script. Increases the default timeout from ~30 seconds to 60 seconds. Users can set the IPAMD_TIMEOUT_SECONDS environment variable to change the timeout. Related: #625, #865 #872 * This is a local gRPC call, so just try every 1 second indefinitely Since we have a liveness probe restarting the probe, we can rely on that to kill the pod. Co-authored-by: Claes Mogren <mogren@amazon.com>
commit d938e5e Author: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com> Date: Wed Jul 1 01:19:14 2020 +0000 Json o/p for logs from entrypoint.sh commit 2d20308 Author: Nathan Prabhu <natprabh@amazon.com> Date: Mon Jun 29 18:06:22 2020 -0500 bugfix: make metrics-helper docker logging statement multi-arch compatible commit bf9ded3 Author: Claes Mogren <claes.mogren@gmail.com> Date: Sat Jun 27 14:51:35 2020 -0700 Use install command instead of cp commit e3b7dbb Author: Gyuho Lee <leegyuho@amazon.com> Date: Mon Jun 29 09:40:02 2020 -0700 scripts/lib: bump up tester to v1.4.0 Signed-off-by: Gyuho Lee <leegyuho@amazon.com> commit c369480 Author: Claes Mogren <claes.mogren@gmail.com> Date: Sun Jun 28 12:19:27 2020 -0700 Some refresh cleanups commit 8c266e9 Author: Claes Mogren <claes.mogren@gmail.com> Date: Sun Jun 28 18:37:46 2020 -0700 Run staticcheck and clean up commit 8dfc5b1 Author: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com> Date: Sun Jun 28 17:39:20 2020 -0700 Fix integration test script for code pipeline (aws#1062) Co-authored-by: Claes Mogren <mogren@amazon.com> commit 52306be Author: Murcherla <nithu0115@gmail.com> Date: Wed Jun 24 23:37:24 2020 -0500 minor nits, fast follow up to PR 903 commit 4ddd248 Author: Claes Mogren <mogren@amazon.com> Date: Sun Jun 14 23:20:22 2020 -0700 Add bandwidth plugin commit 6d35fda Author: Robert Sheehy <gameboy1092@gmail.com> Date: Fri May 22 21:11:12 2020 -0500 Chain interface to other CNI plugins commit 30f98bd Author: Penugonda <saiteja313@gmail.com> Date: Thu Jun 25 15:14:00 2020 -0400 removed custom networking default vars, introspection var commit aa8b818 Author: Penugonda <saiteja313@gmail.com> Date: Wed Jun 24 19:11:38 2020 -0400 updated manifest configs with default env vars commit a073d66 Author: Nithish Murcherla <nithu0115@gmail.com> Date: Wed Jun 24 16:51:38 2020 -0500 refresh subnet/CIDR information every 30 seconds and update ip rules to map pods (aws#903) Co-authored-by: Claes Mogren <mogren@amazon.com> commit a0da387 Author: Claes Mogren <mogren@amazon.com> Date: Wed Jun 24 12:30:45 2020 -0700 Default to random-fully (aws#1048) commit 9fea153 Author: Claes Mogren <mogren@amazon.com> Date: Sun Jun 14 22:37:10 2020 -0700 Update probe settings * Reduce readiness probe startup delay * Increase liveness polling period * Reduce shutdown grace period to 10 seconds commit ad7df34 Author: Jay Pipes <jaypipes@gmail.com> Date: Wed Jun 24 02:06:23 2020 -0400 Remove timeout for ipamd startup (aws#874) * add configurable timeout for ipamd startup Adds a configurable timeout to the aws-k8s-agent (ipamd) startup in the entrypoint.sh script. Increases the default timeout from ~30 seconds to 60 seconds. Users can set the IPAMD_TIMEOUT_SECONDS environment variable to change the timeout. Related: aws#625, aws#865 aws#872 * This is a local gRPC call, so just try every 1 second indefinitely Since we have a liveness probe restarting the probe, we can rely on that to kill the pod. Co-authored-by: Claes Mogren <mogren@amazon.com> commit 1af40d2 Author: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com> Date: Fri Jun 19 10:14:44 2020 -0700 Changelog and config file changes for v1.6.3 commit 14d5135 Author: Ari Becker <ari-becker@users.noreply.github.com> Date: Wed Jun 17 09:39:21 2020 +0300 Generated the different configurations commit 00395cb Author: Ari Becker <ari-becker@users.noreply.github.com> Date: Tue Jun 16 14:33:55 2020 +0300 Fix discovery RBAC issues in Kubernetes 1.17 commit 7e224af Author: Gyuho Lee <leegyuho@amazon.com> Date: Mon Jun 15 16:04:44 2020 -0700 scripts/lib/aws: bump up tester to v1.3.9 Includes improvements to log fetcher + MNG deletion when metrics server is installed. Signed-off-by: Gyuho Lee <leegyuho@amazon.com> commit 36286ba Author: Claes Mogren <mogren@amazon.com> Date: Mon Jun 15 07:56:59 2020 -0700 Remove Printf and format test (aws#1027) commit af54066 Author: Gyuho Lee <leegyuho@amazon.com> Date: Sat Jun 13 01:31:08 2020 -0700 scripts/lib/aws: tester v1.3.6, enable color outputs (aws#1025) Includes various bug fixes + color output if $TERM is supported. Fallback to plain text output automatic. ref. https://github.com/aws/aws-k8s-tester/blob/master/CHANGELOG/CHANGELOG-1.3.md#v136-2020-06-12 Signed-off-by: Gyuho Lee <leegyuho@amazon.com> commit 6d52e1b Author: jayanthvn <1111446+jayanthvn@users.noreply.github.com> Date: Fri Jun 12 16:26:33 2020 -0700 added warning if delete on termination is set to false for the primar… (aws#1024) * Added a warning message if delete on termination is set to false for the primary ENI
I updated my eks cluster to 1.15.10 and that worked.
Then I tried to update the cni-k8s from v1.5.5 to v1.6.0 on my test k8s test nodes(2) and as it's a daemonset I had 1 aws-node running and the other having following error:
I delete the pod but it's still having the same Error:
More details:
The text was updated successfully, but these errors were encountered: