-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster autoscaler improvements for AI workloads #5170
Comments
/cc |
1 similar comment
/cc |
/cc |
1 similar comment
/cc |
Thanks for creating this issue, I believe this is an important use case Cluster Autoscaler should support. Some of the pain points you're raising require fundamental changes to how CA operates, but some should already work properly with the right setup. Let me go through them one by one:
Why does CA wait for tens of minutes in this scenario? CA runs iterations of the main loop using a fixed interval, so I guess this has something to do with coscheduling plugin holding pods from reaching CA? This plugin isn't really compatible with CA, which is one of the reasons why it wasn't moved to in-tree plugins, see discussion in kubernetes/kubernetes#105802
I think this is one of the reasons behind https://github.com/kubernetes-sigs/kueue We should make sure it works well with CA.
Yup, no way around that without making CA understand some pods should be grouped.
I don't think this one is true. GPU nodes are evaluated based on GPU utilization, not cpu/memory utilization: autoscaler/cluster-autoscaler/core/scaledown/eligibility/eligibility.go Lines 158 to 168 in 4ff4903
CA drains nodes one at a time, empty nodes (i.e. containing only daemonsets) are removed in bulk. Parallel drain is WIP though, this is tracked in #5079
There are multiple expanders to choose from, so yes, this can be complex (except for a managed setting where CA flags are fine-tuned already). However, even random expander shouldn't cause overprovisioning. If that happens, it's could be due to a bug in the cloudprovider-specific code. I think the most CA-friendly way of addressing the fundamental CA compatibility problem is through some new k8s API representing a group of pods. If CA understood such API, it could trigger a scale up for all of them in a single go (or error out, e.g. due to lack of quota). kubernetes/enhancements#3371 looks promising, but autoscaling support hasn't been fully fleshed out yet. |
Some supplements:
We just proposed a KEP kubernetes/enhancements#3521 to make scheduling switchable, I think it also helps.
This also looks like the capacity of gang-scheduling, we have the coscheduling plugin, and further, a new pod group api is on design, see kubernetes/enhancements#3371 .
And yes, k-sigs/kueue is also helpful in job queueing with limited resources. Closely integration with autoscaling is one of our goals. |
We had the same issue, in our case our users are deploying more than 2k spark pods on k8s at the same time which costs CA about 20m to scale out. We use a workaround by using another UID from the pod labels. |
/cc |
I think instead of custom labels, CA should just use a hash of all relevant Pod fields. Today there's a comparison function: autoscaler/cluster-autoscaler/utils/utils.go Lines 64 to 101 in e8d3e9b
It should be quite straightforward to calculate a hash in a similar manner instead of relying on owner refs as hints for similarity. That'd also have an extra optimization benefit of considering identical pods as similar even if they belong to completely different controllers. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Which component are you using?:
Cluster autoscaler component which scales Kubernetes cluster
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
We see that the current cluster autoscaler (CA) reacts to pending pods, this may not work well for AI/HPC workloads, we outline the scenarios below:
Describe the solution you'd like.:
Looking for feedback on a better solution to have holistic autoscaling for large AI/HPC workloads
Describe any alternative solutions you've considered.:
We have not encountered any alternative solution yet, please share if you have a solution to above shortcomings.
Additional context.:
I had opened the same issue here: kubernetes/community#6840 which will be closed.
The text was updated successfully, but these errors were encountered: