[EKS] Increased pod density on smaller instance types #138

tabern · 2019-01-30T19:56:09Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
All instance types using the VPC CNI plugin should support at least the Kubernetes recommended pods per node limits.

Which service(s) is this request for?
EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Today, the max number of pods that can run on worker nodes using the VPC CNI plugin is limited by the number of ENIs and secondary IPv4 addresses the instance supports. This number is lower if you are using CNI custom networking, which removes the primary ENI for use by pods. VPC CNI should support at least the Kubernetes recommended pods per node thresholds, regardless of networking mode. Not supporting these maximums means nodes may run out of IP addresses before CPU/memory is fully utilized.

Are you currently working around this issue?
Using larger instance types, or adding more nodes to a cluster that aren't fully utilized.

Additional context
Take the m5.2xlarge for example, which has 8 vCPUs. Based on Kubernetes recommended limits of pods per node of min(110, 10*#cores), this instance type should support 80 pods. However when using custom networking today, it only supports 44 pods.

Edit Feature is released: https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/

ghost · 2019-01-31T05:13:33Z

@tabern could you please elaborate a bit what this feature brings?

Right now the number of pods on a single node is limited by --max-pod flag in kubelet, which for EKS is calculated based on the max number of IP addresses instance can have. This comes from AWS CNI driver logic to provide an IP-address per pod from VPC subnet. So for r4.16xl it is 737 pods.

max-rocket-internet · 2019-02-04T13:28:55Z

which for EKS is calculated based on the max number of IP addresses instance can have

That's exactly the problem. What if we want to run 30 very small pods on a t.small?

ghost · 2019-02-04T13:50:29Z

@max-rocket-internet gotcha. Does it mean instances will get more IPs/ENAs or changes are coming to CNI?

max-rocket-internet · 2019-02-04T14:57:09Z

It means we need to run a different CNI that is not limited by the number of IPs. Currently is more or less a DIY endeavour but it would be great to have a supported CNI from AWS for this use 🙂

laverya · 2019-02-14T18:13:06Z

Yeah, running weave-net (and overriding the pods-per-node limitations) isn't much of an additional maintenance burden but it would have been nice to have that available by default.

lgg42 · 2019-02-20T08:28:28Z

Any idea how exactly are you going to proceed with this one?
Seems very much alike to #71

tabern · 2019-07-04T04:14:16Z

Sorry it's been a bit of time with out a lot of information. We're committed to enabling this feature and will be wrapping this into the next generation VPC CNI plugin.

Please let us know what you think on #398

gitnik · 2020-11-02T15:54:31Z

The comment by @mikestef9 on #398 refers to this issue for updates regarding the specific issue of pod-density. Since there has been no update on this issue in over a year, could someone from the EKS team give us an update?

mikestef9 · 2020-11-02T16:33:21Z

We are working on integrating with an upcoming VPC feature that will allow many more IP addresses to be attached per instance type. For example, a t3.medium will go from allowing 15 IPs per instance, to 240, a 1500% increase. No timeline to share, but it is a high priority for the team.

bambooiris · 2020-11-25T05:53:00Z

@mikestef9 hi! Will be pod density increased for bigger instances types as well?
This is very important because we are thinking to switch to a different CNI plugin, but if you will increase the IP addresses count any time soon we will stay with AWS CNI :)

mikestef9 · 2020-11-25T06:37:01Z

It will be a 1500% increase in IP addresses on every instance type. However, I don't feel that matters on larger instance types. For example, a c5.4xl today supports 234 IP addresses for pods. Which particular instance type are you using?

bambooiris · 2020-11-25T08:25:49Z

We are using m5.xlarge and still have enough resources to schedule additional pods but we out of free IPs.

mikestef9 · 2020-11-25T17:29:59Z

Got it. I'm consider "smaller" to mean any instance type 2xl and below. In this case, m5.xlarge will go from supporting 56 IPs to 896, which will be more than enough to consume all instance resources by pods.

billinghamj · 2020-11-25T20:41:05Z

Pods can be very very small 😉 But nevertheless, this is a great step

billinghamj · 2020-11-25T20:43:56Z

To just get clarity, this is 16x the IPs while still using IPv4? Whereas longer term, for huge numbers of IPs etc, it's expected that EKS will shift to IPv6 instead?

mikestef9 · 2020-11-25T21:39:47Z

Exactly. The same upcoming EC2/VPC feature that will allow us to increase IPv4s per instance, will also allow us to allocate a /80 IPv6 address block per instance. That's what we will leverage for IPv6 support, which is a top priority for us in 2021.

davidroth · 2020-12-01T14:42:41Z

We are working on integrating with an upcoming VPC feature that will allow many more IP addresses to be attached per instance type. For example, a t3.medium will go from allowing 15 IPs per instance, to 240, a 1500% increase. No timeline to share, but it is a high priority for the team.

@mikestef9 Sounds awesome. I'm currently evaluating EKS and the current pod limitation is a blocker for our workload. Could you please share an approximate release date? Thanks.

sstoyanovucsd · 2021-09-02T19:07:51Z

@mikestef9, is it possible to optionally take sig-scalability's defined thresholds into account and limit the max pods per node on a managed nodegroup to min(110, 10*#cores).

Reference: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md

mikestef9 · 2021-09-02T19:28:48Z

What problem are you trying to solve by having us change to that formula? You think 110 is too high for an instance type like m5.large? This feature is targeted at such users of m5.large where the previous limit of 29 was too low.

The max pods formula for MNG now is

<=30 vCPUs min(110, max IPs based on CNI settings)
>30 vCPUs min(250, max IPs based on CNI settings)

This is based on internal testing done by our scalability team. However, it's impossible to simulate all possible combinations of real world workloads. As a best practice, you should be setting resource requests/limits on your pods. The point is IP address is no longer the limiting factor for pods per node when using prefix assignment.

sstoyanovucsd · 2021-09-02T21:32:08Z

I understand that this feature solves the issue with too few pods being allowed on a node that can potentially handle more. Depending on the type of workloads, the opposite may also be needed i.e. setting max pods on the node to less than the IP/ENI limit would impose. Setting maximums like the 110 and 250 is a good start, but it would be much better if it was a nodegroup setting that one can use to self-restrict nodes to a lower number.

We do set requests/limits per pod, but running at high pod densities leaves few resources to be shared by burstable workloads. For example, some Java apps need the extra resources buffer to scale up as opposed to out. When there's too many of these on a single node, memory pressure causes pods to get evicted from the node. While this is a normal behavior, the startup time of such pods is not the best so we'd rather prevent such occurrences as much as possible.

mikestef9 · 2021-09-02T21:47:45Z

Understood, that makes sense. Today, you can override the max pods setting when using managed node groups, but it requires extra effort. You need to use a launch template, specify the EKS AMI ID as the "custom" image ID in the LT, then manually add the bootstrap script in user data, like

#!/bin/bash
set -ex
/etc/eks/bootstrap.sh my-cluster --kubelet-extra-args "--max-pods=25"

I think it's a valid feature request to expose max pods directly through the MNG API, can you open a separate containers roadmap issue with that request?

Side note - this will be much easier with native Bottlerocket support in managed node groups #950, which is coming soon. You'll simply need to add the following in the launch template user data (no need to set the image ID in LT)

[settings.kubernetes]
max-pods = 25

sstoyanovucsd · 2021-09-02T22:23:05Z

Request submitted: #1492

Thanks!

stevehipwell · 2021-09-03T07:27:58Z

@sstoyanovucsd terraform-aws-modules/terraform-aws-eks has a working pattern that doesn't require a custom AMI and terraform-aws-modules/terraform-aws-eks#1433 shows how to optimise this as well as set other bootstrap.sh options.

mikestef9 · 2021-09-03T17:43:25Z

Blog is out that dives into this feature in more detail

https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/

gpothier · 2021-09-06T04:08:12Z

Thanks @mikestef9 !
Quick question: how do I troubleshoot the Managed Node Group not updating the max pods per node configuration? I have the 1.9.0 CNI plugin (through the addon), I added the ENABLE_PREFIX_DELEGATION and WARM_PREFIX_TARGET values to the aws-node DaemonSet, and I deleted and recreated the MNG, but by max pods per node is still 17 (on t3.medium instances).

gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.9.0-eksbuild.1
amazon-k8s-cni:v1.9.0-eksbuild.1
gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe daemonset -n kube-system aws-node | grep ENABLE_PREFIX_DELEGATION
      ENABLE_PREFIX_DELEGATION:            true
gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe daemonset -n kube-system aws-node | grep WARM_PREFIX_TARGET
      WARM_PREFIX_TARGET:                  1
gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe node |grep pods
  pods:                        17
  pods:                        17
  Normal  NodeAllocatableEnforced  24m                kubelet     Updated Node Allocatable limit across pods
  pods:                        17
  pods:                        17
  Normal  NodeAllocatableEnforced  24m                kubelet     Updated Node Allocatable limit across pods

stevehipwell · 2021-09-06T08:15:39Z

@gpothier have you updated the kubelet args to override the defaults?

gpothier · 2021-09-06T12:25:28Z

@stevehipwell No I haven't, but according to the blog post @mikestef9 linked, the MNG should take care of that:

As part of this launch, we’ve updated EKS managed node groups to automatically calculate and set the recommended max pod value based on instance type and VPC CNI configuration values, as long as you are using at least VPC CNI version 1.9

Or did I misunderstand something?

stevehipwell · 2021-09-06T13:22:06Z

@gpothier sorry I hadn't read the blog post, I'll leave this one to @mikestef9.

stevehipwell · 2021-09-06T13:27:31Z

@mikestef9 what happens when we're using custom networking and ENI prefixes with the official AMI? We manually set USE_MAX_PODS=false in the env and add --max-pods to KUBELET_EXTRA_ARGS to both be picked up by bootstrap.sh.

stevehipwell · 2021-09-06T14:14:06Z

@mikestef9 could you also confirm that the other EKS ecosystem components work correctly with ENABLE_PREFIX_DELEGATION set, I'm specifically thinking of the aws-load-balancer-controller but it'd be good to know that NTH and the CSI drivers have all been tested and work correctly.

thanhma · 2021-09-06T14:27:59Z

@stevehipwell I tested AWS Load Balancer Controller v2.2 on ENABLE_PREFIX_DELEGATION enabled cluster haven't seen any problem yet.

mikestef9 · 2021-09-07T01:06:16Z

Support for prefix delegation was in v2.2.2 of LB controller

https://github.com/kubernetes-sigs/aws-load-balancer-controller/releases/tag/v2.2.2

@gpothier are you specifying an image id in a launch template used with the managed node group?

@stevehipwell all of the VPC CNI settings that may affect max pods are taken into account, including custom networking

gpothier · 2021-09-07T01:57:17Z

@mikestef9 I didn't create the launch template explicitly, so I didn't specify an image id myself, but the launch template does exist and its image id is ami-0bb07d9c8d6ca41e8. The cluster and node group were created by terraform, using the terraform-aws-eks module.

mikestef9 · 2021-09-07T15:13:27Z

I'm not very familiar with the Terraform EKS module. But if it is creating a launch template and specifying an image id (even if it's the official EKS AMI image id), that's considered a custom AMI to managed node groups, and the max pods override won't be set.

stevehipwell · 2021-09-07T15:17:27Z

I'm not very familiar with the Terraform EKS module. But if it is creating a launch template and specifying an image id (even if it's the official EKS AMI image id), that's considered a custom AMI to managed node groups, and the max pods override won't be set.

Thanks @mikestef9 this is actually the answer I needed to my above question.

gpothier · 2021-09-07T17:03:18Z

Thanks a lot @mikestef9. As far as I can tell, the launch templates were created by the MNG, not by terraform. The node_groups submodule of the terraform-aws-eks module has the create_launch_template option set to false by default (and I do not override it). And I checked that there is no mention of the node groups' launch templates in the terraform state (the ones that appear here are used by the NAT gateways):

gpothier@tadzim4:~/ownCloud-Caligrafix/dev/ecaligrafix/infrastructure $ terraform-1.0.3 state list |grep launch_template
aws_launch_template.nat_gateway_template[0]
aws_launch_template.nat_gateway_template[1]
gpothier@tadzim4:~/ownCloud-Caligrafix/dev/ecaligrafix/infrastructure $

Also, in the AWS console, the node groups' launch templates appear to have been created by the MNG: the Created by field says "arn:aws:sts::015328124252:assumed-role/AWSServiceRoleForAmazonEKSNodegroup/EKS".

gpothier · 2021-09-10T13:02:00Z

Hi @mikestef9, do you think you could give me a pointer on how to troubleshoot the Managed Node Group not updating the max pods per node configuration? Given that as far as I can tell I meet all the requirements, in particular the launch template is the one created by the MNG so I don't have control over it, I am a bit at a loss.

mikestef9 · 2021-09-10T13:28:59Z

Do you have multiple instance types specified in the managed node group? If so, MNG uses the min value calculated for all instance types. So if you have a non nitro instance like m4.2xlarge for example, the node group will use 58 as the max pod value.

gpothier · 2021-09-10T16:49:29Z

Thanks @mikestef9 that was it! Although all the existing instances were indeed Nitro (t3.medium), the allowed instances included non-nitro ones. I recreated the MNG allowing only t3.medium and t3.small instances and the pod limit is now 110.

This raises a question though: shouldn't the max pods per node property be set independently for each node, according the the node's capacity?

mikestef9 · 2021-09-10T17:03:20Z

Glad to hear it. Managed node groups must specify the max pods value as part of the launch template that we create behind the scenes for each node group. That launch template is associated with an autoscaling group that we also create. The autoscaling group gets assigned the list of desired instance types, but there is no way to know ahead of time which instance type the ASG will spin up. So to be safe, we pick the lowest value of all instance types in the list.

lwimmer · 2021-09-13T15:27:28Z

Wouldn't it be much better to determine the max pod value during bootstrapping of the node (i.e. in the bootstrap.sh)?

In this case it would work with different node types, because each node type could get the appropriate max pod value.

mikestef9 · 2021-09-13T17:57:25Z

The recommended value of max pods is a function of instance type and the version/configuration of VPC CNI running on the cluster. At the time of bootstrapping, we don't have a way to determine the latter. We can't make a call to the API server (k get ds aws-node) and retrieve the CNI version/settings because calls from there will not be authenticated until the aws-auth config map is updated first.

lwimmer · 2021-09-14T07:47:47Z

The recommended value of max pods is a function of instance type and the version/configuration of VPC CNI running on the cluster. At the time of bootstrapping, we don't have a way to determine the latter. We can't make a call to the API server (k get ds aws-node) and retrieve the CNI version because calls from there will not be authenticated until the aws-auth CM is updated first.

I see. Thank you for the explanation.

stevehipwell · 2021-10-12T14:09:27Z

@mikestef9 it looks like the AMI bootstrap hasn't been updated to work correctly with this change and if used on a small instance could cause resource issues for kubelet.

awslabs/amazon-eks-ami#782

stevehipwell · 2021-10-13T09:40:37Z

@mikestef9 related to my comment above, how come EKS has decided to go over the K8s large clusters guide recommendation of a maximum 110 pods per node?

tabern added the EKS Amazon Elastic Kubernetes Service label Jan 30, 2019

mogren mentioned this issue Jun 12, 2019

Support increased ENI limits when awsvpcTrunking is enabled for the account aws/amazon-vpc-cni-k8s#506

Closed

tabern closed this as completed Jul 4, 2019

deliahu mentioned this issue Jul 8, 2019

Pods stuck in ContainerCreating (AWS CNI pod limit) cortexlabs/cortex#219

Closed

3 tasks

deliahu mentioned this issue Aug 7, 2019

Remove per-node pod limit cortexlabs/cortex#294

Closed

3 tasks

mikestef9 changed the title ~~High-density pod scheduling~~ [EKS] Increased pod density on smaller instance types Apr 28, 2020

mikestef9 reopened this Apr 28, 2020

mikestef9 self-assigned this Apr 28, 2020

mikestef9 mentioned this issue Jul 23, 2020

[EKS]: Next Generation AWS VPC CNI Plugin #398

Closed

sstoyanovucsd mentioned this issue Sep 2, 2021

[EKS] [request]: Expose max pods through the MNG API #1492

Open

adegoodyer mentioned this issue Nov 17, 2022

EKS managed default node group and max-pods terraform-aws-modules/terraform-aws-eks#2297

Closed

1 task

[EKS] Increased pod density on smaller instance types #138

[EKS] Increased pod density on smaller instance types #138

Comments

tabern commented Jan 30, 2019 • edited by mikestef9 Loading

Community Note

ghost commented Jan 31, 2019

max-rocket-internet commented Feb 4, 2019

ghost commented Feb 4, 2019

max-rocket-internet commented Feb 4, 2019

laverya commented Feb 14, 2019

lgg42 commented Feb 20, 2019

tabern commented Jul 4, 2019

gitnik commented Nov 2, 2020

mikestef9 commented Nov 2, 2020

bambooiris commented Nov 25, 2020

mikestef9 commented Nov 25, 2020

bambooiris commented Nov 25, 2020

mikestef9 commented Nov 25, 2020

billinghamj commented Nov 25, 2020

billinghamj commented Nov 25, 2020

mikestef9 commented Nov 25, 2020

davidroth commented Dec 1, 2020

sstoyanovucsd commented Sep 2, 2021

mikestef9 commented Sep 2, 2021 • edited Loading

sstoyanovucsd commented Sep 2, 2021

mikestef9 commented Sep 2, 2021 • edited Loading

sstoyanovucsd commented Sep 2, 2021

stevehipwell commented Sep 3, 2021

mikestef9 commented Sep 3, 2021

gpothier commented Sep 6, 2021

stevehipwell commented Sep 6, 2021

gpothier commented Sep 6, 2021

stevehipwell commented Sep 6, 2021

stevehipwell commented Sep 6, 2021

stevehipwell commented Sep 6, 2021

thanhma commented Sep 6, 2021 • edited Loading

mikestef9 commented Sep 7, 2021 • edited Loading

gpothier commented Sep 7, 2021

mikestef9 commented Sep 7, 2021

stevehipwell commented Sep 7, 2021

gpothier commented Sep 7, 2021

gpothier commented Sep 10, 2021

mikestef9 commented Sep 10, 2021

gpothier commented Sep 10, 2021 • edited Loading

mikestef9 commented Sep 10, 2021 • edited Loading

lwimmer commented Sep 13, 2021

mikestef9 commented Sep 13, 2021 • edited Loading

lwimmer commented Sep 14, 2021

stevehipwell commented Oct 12, 2021

stevehipwell commented Oct 13, 2021

tabern commented Jan 30, 2019 •

edited by mikestef9

Loading

mikestef9 commented Sep 2, 2021 •

edited

Loading

mikestef9 commented Sep 2, 2021 •

edited

Loading

thanhma commented Sep 6, 2021 •

edited

Loading

mikestef9 commented Sep 7, 2021 •

edited

Loading

gpothier commented Sep 10, 2021 •

edited

Loading

mikestef9 commented Sep 10, 2021 •

edited

Loading

mikestef9 commented Sep 13, 2021 •

edited

Loading