-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.16.3 High CPU Usage on Medium Instance Size #2807
Comments
Great debugging, this looks like a bug in the logic for rapid scaling: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L660 The code seeks to assign IPs until the desired state can be met, but if no more ENIs can be attached to the node, this for loop will spin indefinitely... This is an unfortunate miss. While we fix this, I recommend downgrading to
|
Thanks for the quick response!
No additional ENI attached |
@aballman I did a bit more digging and have a fix to prevent the livelock, but what is curious is why the datastore is still returning "too low" in your case. When all ENIs are attached and all IPs are assigned to those ENIs (including warm IPs), then max pods should be hit and the datastore should stop trying to allocate IPs. It should actually happen sooner than max, since some pods run in host networking namespace. For |
My clusters are setup with karpenter and have a lot of autoscaling happening. I don't have any medium instances at the moment. I'll see if I can get one launched and report back. |
The |
Well that is definitely wrong. An instance cannot support more pods than it has IPs available to assign to them. We should track that issue separately, where there are two options:
|
I'm using bottlerocket, would that be the same repo to report the issue to? |
I'd like to report that we are also seeing this on v1.16.3, right after an upgrade to EKS 1.26 and AWS VPC CNI 1.16.3, it's a night and day difference, with the exact same error. The arrow notes the moment we upgraded. The VPC CNI pods were restarted and it seems the CPU usage "normalized" by itself, I can't explain what changed aside from the pods no longer being in CrashLoopBackOff:
This leads me to conclude this issue is a bit... Intermittent? I'm more inclined to believe when some pods were moved to different nodes and that reduces the number of IPs needed, solving the above issue. We use the default EKS AMI and a bog standard installation of the VPC CNI too, our only configuration is related to resources: {"resources":{"limits":{"memory":"128Mi"},"requests":{"cpu":"5m","memory":"50Mi"}}} We'll attempt the downgrade and report back further. |
@aballman ah good catch, so I am not sure if https://github.com/bottlerocket-os/bottlerocket/issues is the right repo to report this issue or not. Would you mind filing an AWS support case? That will definitely get it routed to the right team, albeit a bit more slowly than the convenience of GitHub Issues. |
@Angelin01 which instance type are you seeing this on? And are you using Bottlerocket as well? As a side note, #2810 should merge soon and be released in |
@jdn5126 We do not use bottlerocket. I also talked to another team that just created a test EKS cluster today on 1.28, not using SpotInst ocean and instead using AWS' Node groups, same issue. |
@Angelin01 Got it, thanks for confirming. Seems there are more edge cases than just max pods, so we'll have to push to get v1.16.4 out sooner. |
For reference, these nodes have the following number of pods, respectively:
Two of these are VERY close to the limit here, so it may indeed be related to max pods. But yes, I'd still verify other cases, as I said a brand new cluster experienced the same issue. I don't know if you have a "yank" mechanism, but I'd pull the 1.16.3 release. |
Confirmed downgrade to v1.16.2 fixes the issue:
@jdn5126 I appreciate the fast responses, I hope the fix doesn't prove to be too problematic. I also apologize for the comment spam. |
@Angelin01 happy to help, this is an unfortunate regression that we need to make sure we have coverage for in the future. We do not have a "yank" mechanism, just documentation to recommend against updating or updating with caution. And then hopefully a quick turnaround in releasing a patched version. @aballman I spoke briefly to the EKS Nodes team, and it looks like the incorrect max pods is a Bottlerocket issue, since they set |
@jdn5126 Thanks, the follow up is much appreciated. I'll file an issue there. |
We have just bumped into this issue. We run a fleet of 100+ EKS clusters and we upgraded the CNI plugin to the latest version in about a dozen of them. Initially we didn't see any issue, but after a couple of days we noticed that our logging ingestion (and cost) had gone thru the roof (we have centralized logging for We released an hot-patch to pin the version of the AWS CNI plugin to We run the AWS provided EKS AMI, not bottlerocket.
For those, like us, using the AWS EKS Add-ons to manage this sort of components, this feels a sub-par solution. I think that the operators that handover the responsibility of managing this components to AWS are expecting a quick fix-process for identified issues like this one.
Mid-March can be about two weeks to go - there's still potential for |
@PerGon Yep, we fully understand and are working on a patch release as soon as possible, so it will be before March 15th. Pulling a release from the field is a bad idea for many reasons, so our primary mechanisms for alerting are EKS console notices and release notes: https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3 |
Understandable. Hence the suggestion:
Releasing something like a But I understand a lot of other things could be in the way of doing that. Was just thinking out loud to try to help reducing the chance others would face the same issue by upgrading to the latest version. |
yes, we are targeting for a quick release. |
Can confirm rolling back to For those using terraform-eks-module, the following configuration will pin the VPC-CNI EKS add-on to cluster_addons = {
vpc-cni = {
# https://github.com/aws/amazon-vpc-cni-k8s/issues/2807
addon_version = "v1.16.2-eksbuild.1"
# most_recent = true
}
} |
VPC CNI The EKS Managed Addon should be available in all regions within the next 48 hours. |
I am going to close this issue now that |
@jeremyruffell it takes ~24-48 hours after the GitHub release for the EKS Managed Addon to be available in all regions, with |
What happened:
I updated to v1.16.3. Specifically on
medium
instance size CPU usage byaws-node
pod is maxed outOn a
t3.medium
(I have also seen this happen on ac6g.medium
)On an
m7i.2xlarge
Attach logs
This log is spewing constantly in
/var/log/aws-routed-eni/ipamd.log
on themedium
other instance sizes are typical volume of logsWhat you expected to happen:
Typical CPU usage for
medium
instancesHow to reproduce it (as minimally and precisely as possible):
Spin up a
medium
size in an eks cluster with v1.16.3 cni. In my case it's at3.medium
Anything else we need to know?:
Environment:
kubectl version
):v1.29.0-eks-c417bb3
v1.16.3
cat /etc/os-release
):Bottlerocket OS 1.19.1 (aws-k8s-1.29)
uname -a
):6.1.72
The text was updated successfully, but these errors were encountered: