-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EKS] Increased pod density on smaller instance types #138
Comments
@tabern could you please elaborate a bit what this feature brings? Right now the number of pods on a single node is limited by |
That's exactly the problem. What if we want to run 30 very small pods on a t.small? |
@max-rocket-internet gotcha. Does it mean instances will get more IPs/ENAs or changes are coming to CNI? |
It means we need to run a different CNI that is not limited by the number of IPs. Currently is more or less a DIY endeavour but it would be great to have a supported CNI from AWS for this use 🙂 |
Yeah, running weave-net (and overriding the pods-per-node limitations) isn't much of an additional maintenance burden but it would have been nice to have that available by default. |
Any idea how exactly are you going to proceed with this one? |
Sorry it's been a bit of time with out a lot of information. We're committed to enabling this feature and will be wrapping this into the next generation VPC CNI plugin. Please let us know what you think on #398 |
The comment by @mikestef9 on #398 refers to this issue for updates regarding the specific issue of pod-density. Since there has been no update on this issue in over a year, could someone from the EKS team give us an update? |
We are working on integrating with an upcoming VPC feature that will allow many more IP addresses to be attached per instance type. For example, a t3.medium will go from allowing 15 IPs per instance, to 240, a 1500% increase. No timeline to share, but it is a high priority for the team. |
@mikestef9 hi! Will be pod density increased for bigger instances types as well? |
It will be a 1500% increase in IP addresses on every instance type. However, I don't feel that matters on larger instance types. For example, a c5.4xl today supports 234 IP addresses for pods. Which particular instance type are you using? |
We are using m5.xlarge and still have enough resources to schedule additional pods but we out of free IPs. |
Got it. I'm consider "smaller" to mean any instance type 2xl and below. In this case, m5.xlarge will go from supporting 56 IPs to 896, which will be more than enough to consume all instance resources by pods. |
Pods can be very very small 😉 But nevertheless, this is a great step |
To just get clarity, this is 16x the IPs while still using IPv4? Whereas longer term, for huge numbers of IPs etc, it's expected that EKS will shift to IPv6 instead? |
Exactly. The same upcoming EC2/VPC feature that will allow us to increase IPv4s per instance, will also allow us to allocate a /80 IPv6 address block per instance. That's what we will leverage for IPv6 support, which is a top priority for us in 2021. |
@mikestef9 Sounds awesome. I'm currently evaluating EKS and the current pod limitation is a blocker for our workload. Could you please share an approximate release date? Thanks. |
@mikestef9, is it possible to optionally take sig-scalability's defined thresholds into account and limit the max pods per node on a managed nodegroup to min(110, 10*#cores). Reference: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md |
What problem are you trying to solve by having us change to that formula? You think 110 is too high for an instance type like m5.large? This feature is targeted at such users of m5.large where the previous limit of 29 was too low. The max pods formula for MNG now is
This is based on internal testing done by our scalability team. However, it's impossible to simulate all possible combinations of real world workloads. As a best practice, you should be setting resource requests/limits on your pods. The point is IP address is no longer the limiting factor for pods per node when using prefix assignment. |
I understand that this feature solves the issue with too few pods being allowed on a node that can potentially handle more. Depending on the type of workloads, the opposite may also be needed i.e. setting max pods on the node to less than the IP/ENI limit would impose. Setting maximums like the 110 and 250 is a good start, but it would be much better if it was a nodegroup setting that one can use to self-restrict nodes to a lower number. We do set requests/limits per pod, but running at high pod densities leaves few resources to be shared by burstable workloads. For example, some Java apps need the extra resources buffer to scale up as opposed to out. When there's too many of these on a single node, memory pressure causes pods to get evicted from the node. While this is a normal behavior, the startup time of such pods is not the best so we'd rather prevent such occurrences as much as possible. |
Understood, that makes sense. Today, you can override the max pods setting when using managed node groups, but it requires extra effort. You need to use a launch template, specify the EKS AMI ID as the "custom" image ID in the LT, then manually add the bootstrap script in user data, like
I think it's a valid feature request to expose max pods directly through the MNG API, can you open a separate containers roadmap issue with that request? Side note - this will be much easier with native Bottlerocket support in managed node groups #950, which is coming soon. You'll simply need to add the following in the launch template user data (no need to set the image ID in LT)
|
Request submitted: #1492 Thanks! |
@sstoyanovucsd terraform-aws-modules/terraform-aws-eks has a working pattern that doesn't require a custom AMI and terraform-aws-modules/terraform-aws-eks#1433 shows how to optimise this as well as set other bootstrap.sh options. |
Blog is out that dives into this feature in more detail https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/ |
Thanks @mikestef9 !
|
@gpothier have you updated the kubelet args to override the defaults? |
@stevehipwell No I haven't, but according to the blog post @mikestef9 linked, the MNG should take care of that:
Or did I misunderstand something? |
@gpothier sorry I hadn't read the blog post, I'll leave this one to @mikestef9. |
@mikestef9 what happens when we're using custom networking and ENI prefixes with the official AMI? We manually set |
@mikestef9 could you also confirm that the other EKS ecosystem components work correctly with |
@stevehipwell I tested AWS Load Balancer Controller v2.2 on |
Support for prefix delegation was in v2.2.2 of LB controller https://github.com/kubernetes-sigs/aws-load-balancer-controller/releases/tag/v2.2.2 @gpothier are you specifying an image id in a launch template used with the managed node group? @stevehipwell all of the VPC CNI settings that may affect max pods are taken into account, including custom networking |
@mikestef9 I didn't create the launch template explicitly, so I didn't specify an image id myself, but the launch template does exist and its image id is ami-0bb07d9c8d6ca41e8. The cluster and node group were created by terraform, using the terraform-aws-eks module. |
I'm not very familiar with the Terraform EKS module. But if it is creating a launch template and specifying an image id (even if it's the official EKS AMI image id), that's considered a custom AMI to managed node groups, and the max pods override won't be set. |
Thanks @mikestef9 this is actually the answer I needed to my above question. |
Thanks a lot @mikestef9. As far as I can tell, the launch templates were created by the MNG, not by terraform. The
Also, in the AWS console, the node groups' launch templates appear to have been created by the MNG: the Created by field says "arn:aws:sts::015328124252:assumed-role/AWSServiceRoleForAmazonEKSNodegroup/EKS". |
Hi @mikestef9, do you think you could give me a pointer on how to troubleshoot the Managed Node Group not updating the max pods per node configuration? Given that as far as I can tell I meet all the requirements, in particular the launch template is the one created by the MNG so I don't have control over it, I am a bit at a loss. |
Do you have multiple instance types specified in the managed node group? If so, MNG uses the min value calculated for all instance types. So if you have a non nitro instance like m4.2xlarge for example, the node group will use 58 as the max pod value. |
Thanks @mikestef9 that was it! Although all the existing instances were indeed Nitro (t3.medium), the allowed instances included non-nitro ones. I recreated the MNG allowing only t3.medium and t3.small instances and the pod limit is now 110. This raises a question though: shouldn't the max pods per node property be set independently for each node, according the the node's capacity? |
Glad to hear it. Managed node groups must specify the max pods value as part of the launch template that we create behind the scenes for each node group. That launch template is associated with an autoscaling group that we also create. The autoscaling group gets assigned the list of desired instance types, but there is no way to know ahead of time which instance type the ASG will spin up. So to be safe, we pick the lowest value of all instance types in the list. |
Wouldn't it be much better to determine the max pod value during bootstrapping of the node (i.e. in the bootstrap.sh)? In this case it would work with different node types, because each node type could get the appropriate max pod value. |
The recommended value of max pods is a function of instance type and the version/configuration of VPC CNI running on the cluster. At the time of bootstrapping, we don't have a way to determine the latter. We can't make a call to the API server (k get ds aws-node) and retrieve the CNI version/settings because calls from there will not be authenticated until the aws-auth config map is updated first. |
I see. Thank you for the explanation. |
@mikestef9 it looks like the AMI bootstrap hasn't been updated to work correctly with this change and if used on a small instance could cause resource issues for kubelet. |
@mikestef9 related to my comment above, how come EKS has decided to go over the K8s large clusters guide recommendation of a maximum 110 pods per node? |
Community Note
Tell us about your request
All instance types using the VPC CNI plugin should support at least the Kubernetes recommended pods per node limits.
Which service(s) is this request for?
EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Today, the max number of pods that can run on worker nodes using the VPC CNI plugin is limited by the number of ENIs and secondary IPv4 addresses the instance supports. This number is lower if you are using CNI custom networking, which removes the primary ENI for use by pods. VPC CNI should support at least the Kubernetes recommended pods per node thresholds, regardless of networking mode. Not supporting these maximums means nodes may run out of IP addresses before CPU/memory is fully utilized.
Are you currently working around this issue?
Using larger instance types, or adding more nodes to a cluster that aren't fully utilized.
Additional context
Take the m5.2xlarge for example, which has 8 vCPUs. Based on Kubernetes recommended limits of pods per node of min(110, 10*#cores), this instance type should support 80 pods. However when using custom networking today, it only supports 44 pods.
Edit Feature is released: https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/
The text was updated successfully, but these errors were encountered: