-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Node Startup Latency #1099
Comments
💯 for removing the |
Here's a timing chart that is pretty steady from a K8s 1.24 cluster with all of the enhancements mentioned above: c6a.4xlarge - i-02ca66ee7f998e438
|
How did you measure these startup times? And what were the startup times before implementing the changes? I am using the terraform karpenter example to start Nvidia GPU nodes and am currently seeing startup times of 2-3 minutes for a node to be ready using the default eks-AMI. |
The measurements were taken using some custom tooling that we'll be open sourcing very soon. You'll be able to run the tooling as a daemonset to capture timing metrics for your nodes in a standardized format like shown above. GPU instance types often take much longer to boot within EC2. I have not focused on exotic instance types like baremetal and gpus. The timings above are for c6a.4xlarge, but I have tested on m5 and c6i with similar results. Notably, I will update this issue once the timing tooling is available. Here are timings from before the optimizations: c5.xlarge - i-08f01b534e902dbf7
|
Here's a gif of startups taking around 25 seconds by cherry-picking a v1.26 kubelet change to v1.24 (along with all the optimizations mentioned above and auto-scaled with Karpenter). |
@bwagner5 how does this compare to Bottlerocket? Also are we likely to see this change backported for EKS? |
bit +1 on getting all these improvements to BR too |
I still need to do testing on BR, but at least some will carry-over like the VPC CNI improvements. Some don't apply like the yum updates. I'll have to see if we could include container caching in BR. I'll get back to you on back porting to EKS. |
@bwagner5 |
I tested it with Karpenter which will automatically hardcode the API URL, CA Bundle, and kube-dns IP as params to the eks bootstrap.sh script. The EKS DescribeCluster call that occurs in the bootstrap.sh script shouldn't take much time though, so I suspect you can get similar results without hardcoding the params, at least for single node launches. You may run into rate limiting on the API call though when doing large node scale outs. |
Understood, wont it help with speeding up downloading aws-node and kube-proxy images if ECR,S3 VPC endpoints are enabled. |
@kakarotbyte You may see some improvement using the endpoints but I wouldn't expect it to be significant (haven't tested that though). However, in my tests I used the AMI's cached images of aws-node and kube-proxy though. |
@kakarotbyte - I've been lucky enough to have the joy of exchanging ideas and observations with @bwagner5 . I can tell you - from my extensive testing - that using those VPC endpoints terminated inside the VPC doesn't give you an appreciable improvement in provisioning speed. But using a super-primitive and minimal HTTP client instead of curl/wget or the Using a custom-rolled AMI, built on a custom OS, with a custom kernel (we have really specific use-cases and latency requirements at the company I'm with)...+ using @bwagner5 's improvements, I have done some tests. IF:
EXPERIMENT:
RESULT on
|
FYI I've open sourced the node latency timing tool that I've been using to create the timing charts and emit metrics for my testing here: https://github.com/awslabs/node-latency-for-k8s Would love feedback on how / if this works well on other OS distributions (I've only been using the eks-optimized AL2). |
@bwagner5 - With whatever changes have already been implemented and merged/released, I'm seeing consistent "Node Ready" state at (or below) an average of ~26(ish) seconds on vanilla EKS AL2 nodes. I don't even need my own custom AMI anymore, to be honest. 👏👏 |
Dynamically provisioning nodes quickly is important to ensure workloads can scale out quickly due to demand or recover in the event of infrastructure instability. This issue will track node startup latency improvement work on the EKS Optimized AL2 AMI.
Current work:
Cache Common Startup Container Images #938
There are several container images that are commonly needed to bootstrap a node (get it to a Ready state for pods):
Disable Startup Yum Update Check #1074
There is currently a yum update run that blocks executing the eks-bootstrap script. This update check generally results in 0 updates if you are updating AMIs frequently, but the check takes 5-8 seconds because it hydrates the yum cache. This check also causes version skew across a cluster, where the same AMI ID may run different software versions depending on when it was launched which could cause problems with rollback and node churn.
Remove unnecessary Sleeps in the VPC CNI Initialization aws/amazon-vpc-cni-k8s#2104 (released in v1.12.0)
The VPC CNI had some unnecessary sleeps that resulted in 2-3 seconds of latency starting up. VPC CNI is required to be fully initialized before pods can be created on a node, so the initialization process should be as fast as possible.
Remove init-container from VPC CNI aws/amazon-vpc-cni-k8s#2137
The VPC CNI uses an init container to initialize some kernel settings related to networking that must be run as
Privileged
. The sequencing of the init container was resulting in some latency on startup. Generally, the VPC CNI would take 9-10 secs to full initialize. The PR above removed the init container and runs it as a regular container in the pod. This allows some parallelization of the initialization and removes the kubelet sequencing latency. The PR results in the VPC CNI's full initialization time to 4 sec. Half of the remaining latency is the container pulls which is solved in the caching PR above (#938). Integrating these two PRs together results in a 2 sec full initialization of the VPC CNI.Add CLUSTER_ENDPOINT parameter to the VPC CNI to avoid kube-proxy race aws/amazon-vpc-cni-k8s#2138
With all the optimizations listed above, a new concern introduced is race between the VPC CNI and kube-proxy. The VPC CNI uses the kubernetes service cluster IP to reach kube-apiserver. This wasn't a concern before since the increased latency of the VPC CNI resulted in kube-proxy almost always winning the race to initialize. After the optimizations, the VPC CNI loses about half of the time and then hangs due to a 5 second timeout on reaching the kube-apiserver. The whole race can be avoided by passing the CLUSTER_ENDPOINT (the kube-apiserver load balancer endpoint) to the VPC CNI to use for initialization. The VPC CNI still needs to wait on kube-proxy to finish before completing the CNI plugin initialization, but the work to get to that point can be parallelized.
The text was updated successfully, but these errors were encountered: