Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS] Increased pod density on smaller instance types #138

Closed
tabern opened this issue Jan 30, 2019 · 83 comments
Closed

[EKS] Increased pod density on smaller instance types #138

tabern opened this issue Jan 30, 2019 · 83 comments
Assignees
Labels
EKS Amazon Elastic Kubernetes Service

Comments

@tabern
Copy link
Contributor

tabern commented Jan 30, 2019

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
All instance types using the VPC CNI plugin should support at least the Kubernetes recommended pods per node limits.

Which service(s) is this request for?
EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Today, the max number of pods that can run on worker nodes using the VPC CNI plugin is limited by the number of ENIs and secondary IPv4 addresses the instance supports. This number is lower if you are using CNI custom networking, which removes the primary ENI for use by pods. VPC CNI should support at least the Kubernetes recommended pods per node thresholds, regardless of networking mode. Not supporting these maximums means nodes may run out of IP addresses before CPU/memory is fully utilized.

Are you currently working around this issue?
Using larger instance types, or adding more nodes to a cluster that aren't fully utilized.

Additional context
Take the m5.2xlarge for example, which has 8 vCPUs. Based on Kubernetes recommended limits of pods per node of min(110, 10*#cores), this instance type should support 80 pods. However when using custom networking today, it only supports 44 pods.

Edit Feature is released: https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/

@tabern tabern added the EKS Amazon Elastic Kubernetes Service label Jan 30, 2019
@ghost
Copy link

ghost commented Jan 31, 2019

@tabern could you please elaborate a bit what this feature brings?

Right now the number of pods on a single node is limited by --max-pod flag in kubelet, which for EKS is calculated based on the max number of IP addresses instance can have. This comes from AWS CNI driver logic to provide an IP-address per pod from VPC subnet. So for r4.16xl it is 737 pods.

@max-rocket-internet
Copy link

which for EKS is calculated based on the max number of IP addresses instance can have

That's exactly the problem. What if we want to run 30 very small pods on a t.small?

@ghost
Copy link

ghost commented Feb 4, 2019

@max-rocket-internet gotcha. Does it mean instances will get more IPs/ENAs or changes are coming to CNI?

@max-rocket-internet
Copy link

It means we need to run a different CNI that is not limited by the number of IPs. Currently is more or less a DIY endeavour but it would be great to have a supported CNI from AWS for this use 🙂

@laverya
Copy link

laverya commented Feb 14, 2019

Yeah, running weave-net (and overriding the pods-per-node limitations) isn't much of an additional maintenance burden but it would have been nice to have that available by default.

@lgg42
Copy link

lgg42 commented Feb 20, 2019

Any idea how exactly are you going to proceed with this one?
Seems very much alike to #71

@tabern
Copy link
Contributor Author

tabern commented Jul 4, 2019

Sorry it's been a bit of time with out a lot of information. We're committed to enabling this feature and will be wrapping this into the next generation VPC CNI plugin.

Please let us know what you think on #398

@tabern tabern closed this as completed Jul 4, 2019
@mikestef9 mikestef9 changed the title High-density pod scheduling [EKS] Increased pod density on smaller instance types Apr 28, 2020
@mikestef9 mikestef9 reopened this Apr 28, 2020
@mikestef9 mikestef9 self-assigned this Apr 28, 2020
@gitnik
Copy link

gitnik commented Nov 2, 2020

The comment by @mikestef9 on #398 refers to this issue for updates regarding the specific issue of pod-density. Since there has been no update on this issue in over a year, could someone from the EKS team give us an update?

@mikestef9
Copy link
Contributor

We are working on integrating with an upcoming VPC feature that will allow many more IP addresses to be attached per instance type. For example, a t3.medium will go from allowing 15 IPs per instance, to 240, a 1500% increase. No timeline to share, but it is a high priority for the team.

@bambooiris
Copy link

@mikestef9 hi! Will be pod density increased for bigger instances types as well?
This is very important because we are thinking to switch to a different CNI plugin, but if you will increase the IP addresses count any time soon we will stay with AWS CNI :)

@mikestef9
Copy link
Contributor

It will be a 1500% increase in IP addresses on every instance type. However, I don't feel that matters on larger instance types. For example, a c5.4xl today supports 234 IP addresses for pods. Which particular instance type are you using?

@bambooiris
Copy link

We are using m5.xlarge and still have enough resources to schedule additional pods but we out of free IPs.

@mikestef9
Copy link
Contributor

Got it. I'm consider "smaller" to mean any instance type 2xl and below. In this case, m5.xlarge will go from supporting 56 IPs to 896, which will be more than enough to consume all instance resources by pods.

@billinghamj
Copy link

Pods can be very very small 😉 But nevertheless, this is a great step

@billinghamj
Copy link

To just get clarity, this is 16x the IPs while still using IPv4? Whereas longer term, for huge numbers of IPs etc, it's expected that EKS will shift to IPv6 instead?

@mikestef9
Copy link
Contributor

Exactly. The same upcoming EC2/VPC feature that will allow us to increase IPv4s per instance, will also allow us to allocate a /80 IPv6 address block per instance. That's what we will leverage for IPv6 support, which is a top priority for us in 2021.

@davidroth
Copy link

We are working on integrating with an upcoming VPC feature that will allow many more IP addresses to be attached per instance type. For example, a t3.medium will go from allowing 15 IPs per instance, to 240, a 1500% increase. No timeline to share, but it is a high priority for the team.

@mikestef9 Sounds awesome. I'm currently evaluating EKS and the current pod limitation is a blocker for our workload. Could you please share an approximate release date? Thanks.

@sstoyanovucsd
Copy link

@mikestef9, is it possible to optionally take sig-scalability's defined thresholds into account and limit the max pods per node on a managed nodegroup to min(110, 10*#cores).

Reference: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md

@mikestef9
Copy link
Contributor

mikestef9 commented Sep 2, 2021

What problem are you trying to solve by having us change to that formula? You think 110 is too high for an instance type like m5.large? This feature is targeted at such users of m5.large where the previous limit of 29 was too low.

The max pods formula for MNG now is

<=30 vCPUs min(110, max IPs based on CNI settings)
>30 vCPUs min(250, max IPs based on CNI settings)

This is based on internal testing done by our scalability team. However, it's impossible to simulate all possible combinations of real world workloads. As a best practice, you should be setting resource requests/limits on your pods. The point is IP address is no longer the limiting factor for pods per node when using prefix assignment.

@sstoyanovucsd
Copy link

I understand that this feature solves the issue with too few pods being allowed on a node that can potentially handle more. Depending on the type of workloads, the opposite may also be needed i.e. setting max pods on the node to less than the IP/ENI limit would impose. Setting maximums like the 110 and 250 is a good start, but it would be much better if it was a nodegroup setting that one can use to self-restrict nodes to a lower number.

We do set requests/limits per pod, but running at high pod densities leaves few resources to be shared by burstable workloads. For example, some Java apps need the extra resources buffer to scale up as opposed to out. When there's too many of these on a single node, memory pressure causes pods to get evicted from the node. While this is a normal behavior, the startup time of such pods is not the best so we'd rather prevent such occurrences as much as possible.

@mikestef9
Copy link
Contributor

mikestef9 commented Sep 2, 2021

Understood, that makes sense. Today, you can override the max pods setting when using managed node groups, but it requires extra effort. You need to use a launch template, specify the EKS AMI ID as the "custom" image ID in the LT, then manually add the bootstrap script in user data, like

#!/bin/bash
set -ex
/etc/eks/bootstrap.sh my-cluster --kubelet-extra-args "--max-pods=25"

I think it's a valid feature request to expose max pods directly through the MNG API, can you open a separate containers roadmap issue with that request?

Side note - this will be much easier with native Bottlerocket support in managed node groups #950, which is coming soon. You'll simply need to add the following in the launch template user data (no need to set the image ID in LT)

[settings.kubernetes]
max-pods = 25

@sstoyanovucsd
Copy link

Request submitted: #1492

Thanks!

@stevehipwell
Copy link

@sstoyanovucsd terraform-aws-modules/terraform-aws-eks has a working pattern that doesn't require a custom AMI and terraform-aws-modules/terraform-aws-eks#1433 shows how to optimise this as well as set other bootstrap.sh options.

@mikestef9
Copy link
Contributor

Blog is out that dives into this feature in more detail

https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/

@gpothier
Copy link

gpothier commented Sep 6, 2021

Thanks @mikestef9 !
Quick question: how do I troubleshoot the Managed Node Group not updating the max pods per node configuration? I have the 1.9.0 CNI plugin (through the addon), I added the ENABLE_PREFIX_DELEGATION and WARM_PREFIX_TARGET values to the aws-node DaemonSet, and I deleted and recreated the MNG, but by max pods per node is still 17 (on t3.medium instances).

gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.9.0-eksbuild.1
amazon-k8s-cni:v1.9.0-eksbuild.1
gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe daemonset -n kube-system aws-node | grep ENABLE_PREFIX_DELEGATION
      ENABLE_PREFIX_DELEGATION:            true
gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe daemonset -n kube-system aws-node | grep WARM_PREFIX_TARGET
      WARM_PREFIX_TARGET:                  1
gpothier@tadzim4:~ (⎈ |ecaligrafix-playground-eks:default)$ kubectl describe node |grep pods
  pods:                        17
  pods:                        17
  Normal  NodeAllocatableEnforced  24m                kubelet     Updated Node Allocatable limit across pods
  pods:                        17
  pods:                        17
  Normal  NodeAllocatableEnforced  24m                kubelet     Updated Node Allocatable limit across pods

@stevehipwell
Copy link

@gpothier have you updated the kubelet args to override the defaults?

@gpothier
Copy link

gpothier commented Sep 6, 2021

@stevehipwell No I haven't, but according to the blog post @mikestef9 linked, the MNG should take care of that:

As part of this launch, we’ve updated EKS managed node groups to automatically calculate and set the recommended max pod value based on instance type and VPC CNI configuration values, as long as you are using at least VPC CNI version 1.9

Or did I misunderstand something?

@stevehipwell
Copy link

@gpothier sorry I hadn't read the blog post, I'll leave this one to @mikestef9.

@stevehipwell
Copy link

@mikestef9 what happens when we're using custom networking and ENI prefixes with the official AMI? We manually set USE_MAX_PODS=false in the env and add --max-pods to KUBELET_EXTRA_ARGS to both be picked up by bootstrap.sh.

@stevehipwell
Copy link

@mikestef9 could you also confirm that the other EKS ecosystem components work correctly with ENABLE_PREFIX_DELEGATION set, I'm specifically thinking of the aws-load-balancer-controller but it'd be good to know that NTH and the CSI drivers have all been tested and work correctly.

@thanhma
Copy link

thanhma commented Sep 6, 2021

@stevehipwell I tested AWS Load Balancer Controller v2.2 on ENABLE_PREFIX_DELEGATION enabled cluster haven't seen any problem yet.

@mikestef9
Copy link
Contributor

mikestef9 commented Sep 7, 2021

Support for prefix delegation was in v2.2.2 of LB controller

https://github.com/kubernetes-sigs/aws-load-balancer-controller/releases/tag/v2.2.2

@gpothier are you specifying an image id in a launch template used with the managed node group?

@stevehipwell all of the VPC CNI settings that may affect max pods are taken into account, including custom networking

@gpothier
Copy link

gpothier commented Sep 7, 2021

@mikestef9 I didn't create the launch template explicitly, so I didn't specify an image id myself, but the launch template does exist and its image id is ami-0bb07d9c8d6ca41e8. The cluster and node group were created by terraform, using the terraform-aws-eks module.

@mikestef9
Copy link
Contributor

I'm not very familiar with the Terraform EKS module. But if it is creating a launch template and specifying an image id (even if it's the official EKS AMI image id), that's considered a custom AMI to managed node groups, and the max pods override won't be set.

@stevehipwell
Copy link

I'm not very familiar with the Terraform EKS module. But if it is creating a launch template and specifying an image id (even if it's the official EKS AMI image id), that's considered a custom AMI to managed node groups, and the max pods override won't be set.

Thanks @mikestef9 this is actually the answer I needed to my above question.

@gpothier
Copy link

gpothier commented Sep 7, 2021

Thanks a lot @mikestef9. As far as I can tell, the launch templates were created by the MNG, not by terraform. The node_groups submodule of the terraform-aws-eks module has the create_launch_template option set to false by default (and I do not override it). And I checked that there is no mention of the node groups' launch templates in the terraform state (the ones that appear here are used by the NAT gateways):

gpothier@tadzim4:~/ownCloud-Caligrafix/dev/ecaligrafix/infrastructure $ terraform-1.0.3 state list |grep launch_template
aws_launch_template.nat_gateway_template[0]
aws_launch_template.nat_gateway_template[1]
gpothier@tadzim4:~/ownCloud-Caligrafix/dev/ecaligrafix/infrastructure $ 

Also, in the AWS console, the node groups' launch templates appear to have been created by the MNG: the Created by field says "arn:aws:sts::015328124252:assumed-role/AWSServiceRoleForAmazonEKSNodegroup/EKS".

@gpothier
Copy link

Hi @mikestef9, do you think you could give me a pointer on how to troubleshoot the Managed Node Group not updating the max pods per node configuration? Given that as far as I can tell I meet all the requirements, in particular the launch template is the one created by the MNG so I don't have control over it, I am a bit at a loss.

@mikestef9
Copy link
Contributor

Do you have multiple instance types specified in the managed node group? If so, MNG uses the min value calculated for all instance types. So if you have a non nitro instance like m4.2xlarge for example, the node group will use 58 as the max pod value.

@gpothier
Copy link

gpothier commented Sep 10, 2021

Thanks @mikestef9 that was it! Although all the existing instances were indeed Nitro (t3.medium), the allowed instances included non-nitro ones. I recreated the MNG allowing only t3.medium and t3.small instances and the pod limit is now 110.

This raises a question though: shouldn't the max pods per node property be set independently for each node, according the the node's capacity?

@mikestef9
Copy link
Contributor

mikestef9 commented Sep 10, 2021

Glad to hear it. Managed node groups must specify the max pods value as part of the launch template that we create behind the scenes for each node group. That launch template is associated with an autoscaling group that we also create. The autoscaling group gets assigned the list of desired instance types, but there is no way to know ahead of time which instance type the ASG will spin up. So to be safe, we pick the lowest value of all instance types in the list.

@lwimmer
Copy link

lwimmer commented Sep 13, 2021

Wouldn't it be much better to determine the max pod value during bootstrapping of the node (i.e. in the bootstrap.sh)?

In this case it would work with different node types, because each node type could get the appropriate max pod value.

@mikestef9
Copy link
Contributor

mikestef9 commented Sep 13, 2021

The recommended value of max pods is a function of instance type and the version/configuration of VPC CNI running on the cluster. At the time of bootstrapping, we don't have a way to determine the latter. We can't make a call to the API server (k get ds aws-node) and retrieve the CNI version/settings because calls from there will not be authenticated until the aws-auth config map is updated first.

@lwimmer
Copy link

lwimmer commented Sep 14, 2021

The recommended value of max pods is a function of instance type and the version/configuration of VPC CNI running on the cluster. At the time of bootstrapping, we don't have a way to determine the latter. We can't make a call to the API server (k get ds aws-node) and retrieve the CNI version because calls from there will not be authenticated until the aws-auth CM is updated first.

I see. Thank you for the explanation.

@stevehipwell
Copy link

@mikestef9 it looks like the AMI bootstrap hasn't been updated to work correctly with this change and if used on a small instance could cause resource issues for kubelet.

awslabs/amazon-eks-ami#782

@stevehipwell
Copy link

@mikestef9 related to my comment above, how come EKS has decided to go over the K8s large clusters guide recommendation of a maximum 110 pods per node?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service
Projects
None yet
Development

No branches or pull requests