-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPU Optimization Proposal #1562
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,88 @@ | ||||||||||||||||||||||
# Cluster Autoscaling Optimization for GPU clusters | ||||||||||||||||||||||
##### Author: Jeffwan | ||||||||||||||||||||||
|
||||||||||||||||||||||
## Introduction | ||||||||||||||||||||||
Cluster Autoscaler makes it extremely easy to scale kubernetes cluster in response to pod status. At the same time, having it automate Kubernetes cluster to work efficiently. | ||||||||||||||||||||||
|
||||||||||||||||||||||
However, there's still challenges in specific scenarios for accelerator computing clusters. Accelerator computing instances are quite different from normal instance. They are powerful, scalable instances that provide GPU-based parallel computing capabilities. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Accelerator computing nodes are expensive. (ec2 p3.16xlarge on demand instance cost $24.48/hr). Efficiently scaling the cluster will save a substtantial amount of money. GPU nodes are well suited for tasks that have heavy computation needs like machine learning and HPC applications. They can not be interrupted in some cases and that brings us more challenges in scale down. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Here's the problems I find in current upstream cluster autoscaler. | ||||||||||||||||||||||
|
||||||||||||||||||||||
### Problems in cluster autoscaler for GPU node | ||||||||||||||||||||||
|
||||||||||||||||||||||
#### Accelerator Label | ||||||||||||||||||||||
GKE uses gpu label `cloud.google.com/gke-accelerator` to inspect if a node is an accelerator node, other cloud providers don't define their label yet. In order for CA to behavior normally in GPU cluster, it's essential to have this gpu label. | ||||||||||||||||||||||
|
||||||||||||||||||||||
#### Scale too much without GPU label support | ||||||||||||||||||||||
Pending pods will trigger scale up decision to bring up number of nodes in one node group. When new node is added, CA knows it is `upcoming` and it knows some pods will go on that node once it boots up. So it won't trigger another scale-up for those pods. Device plugin triggers after the node becomes ready and the extra resource required by pod will show up in node spec even later. CA will see that it was wrong and the pods for which it triggered scale-up can't actually go to a node it added for them (because it doesn't have the special resource yet and CA doesn't know it will be added later). So CA will go and create more nodes. | ||||||||||||||||||||||
|
||||||||||||||||||||||
The problem is GPU node becomes ready first but at that time, there's no allocable GPU resources. The root reason could be | ||||||||||||||||||||||
|
||||||||||||||||||||||
* Device Plugin is not ready. | ||||||||||||||||||||||
* Device Plugin is ready but has not sent list of devices to kubelet. | ||||||||||||||||||||||
* kubelet takes time to advertise resources to API server so that node spec is not up to date. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Pod that requests GPU can not be scheduled during this period. In order to address this issue, we either need to work with Nvidia folks on device plugin to change behavior on upstream, or make CA mark nodes with GPU label and unallocated GPU resource as NonReady. | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
#### GPU Resource is not given consideration in scaling down | ||||||||||||||||||||||
CA filter candidates with low node cpu and memory utilization rate in every loop. | ||||||||||||||||||||||
|
||||||||||||||||||||||
* If utilization rate is high, even all GPUs are idle, scale-down will not be triggered. | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Most users add taints to their GPU node groups that prevent non-GPU pods running there. Note that even without autoscaling, if you don't do that, your GPU job may not run at all if too many non-GPU pods are scheduled on an unprotected node with idle GPUs. If we're talking only about GPU workloads that could be spread across other nodes blocking scale-down, it can be solved by raising utilization threshold. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem with "economic" solution of mixing GPU and non-GPU pods on the same node is that unless you very carefully control scheduling, there's always a risk that GPU will end up idle, which is a rather expensive way of running non-GPU pods. Either way, I still don't see what is the problem here that couldn't be solved with just safe-to-evict and raising utilization threshold? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with safe-to-evict solution for training jobs. GPU resource is not even considered in the utilization calculation here. So raising utilization threshold won't help this case. autoscaler/cluster-autoscaler/simulator/cluster.go Lines 155 to 164 in 14ed6ec
I think this part we can do some improvement. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be clear: this comment is to a section of the doc concerned with idle GPUs + other pods using up CPU/memory above threshold. I'm fairly certain raising threshold will help with that. I'm not sure what kind of benefit is to be gained from adding GPU utilization into the mix though, other than consistency? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @seh Your argument applies to any workload, regardless if it uses GPU or not. It's autoscalers job to kick pods around to limit resource usage. An application may not agree to be restarted in which case it must specify it somehow.
PDB is a generic Kubernetes way of telling all automation (including CA) not to touch a given pod. The annotation exists because many people requested a way to specifically prevent CA from touching the pod, but allow other automation to restart it.
Which means GPU utilization for node is 100%. Once we add GPU utilization check as described in comments above CA will not touch that node. You'll also be able to set threshold to 0, in which case no node will be removed if any of its GPUs are used. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The autoscaler normally doesn't "move" pods unless it thinks they're not utilizing a machine's resources adequately. That situation does not—or should not—apply here. In my example, we have a machine with one GPU, a pod running there that requests and is using one GPU, and no other nodes advertising available GPUs. Why would an autoscaler consider this pod as a candidate to move? By analogy, if a machine has one CPU and a pod there is "using" (really, requested) the whole CPU, the autoscaler wouldn't consider moving that pod. Why do we trust use of a CPU but not use of a GPU?
The toleration I mentioned is a convenience provided by the "ExtendedResourceToleration" admission control plugin; we used to write them manually. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Well, that's great news! That's what I'm asking for here. If we're going to count GPU utilization and not move pods that are utilizing the machine's resources adequately, then I have no further complaint. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In this case the pod won't be moved. Only if you had another node with unused GPU (e.g. a machine with 4 GPUs, 1 of them unused), it may. After the change discussed here, it won't be moved in that case, either. Is that clearer? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that helps. Thank you for clarifying. |
||||||||||||||||||||||
Consider use case for machine learning inference or some training case that could tolerant failures, we actually like to give a try to see if all workloads on that particular node can be moved to other nodes. Then user can scale down this GPU node and efficiently reduce their cost. | ||||||||||||||||||||||
|
||||||||||||||||||||||
* If utilization rate is low, GPUs are in use, scale-down will be triggered. | ||||||||||||||||||||||
If there's a distributed training task on that node, killing task will lead to entire training job failing. In this case, this node can not be a scale down candicate. | ||||||||||||||||||||||
|
||||||||||||||||||||||
``` | ||||||||||||||||||||||
// CalculateUtilization calculates utilization of a node, defined as maximum of (cpu, memory) utilization. | ||||||||||||||||||||||
// Per resource utilization is the sum of requests for it divided by allocatable. It also returns the individual | ||||||||||||||||||||||
// cpu and memory utilization. | ||||||||||||||||||||||
func CalculateUtilization(node *apiv1.Node, nodeInfo *schedulercache.NodeInfo, skipDaemonSetPods, skipMirrorPods bool) (utilInfo UtilizationInfo, err error) { | ||||||||||||||||||||||
cpu, err := calculateUtilizationOfResource(node, nodeInfo, apiv1.ResourceCPU, skipDaemonSetPods, skipMirrorPods) | ||||||||||||||||||||||
if err != nil { | ||||||||||||||||||||||
return UtilizationInfo{}, err | ||||||||||||||||||||||
} | ||||||||||||||||||||||
mem, err := calculateUtilizationOfResource(node, nodeInfo, apiv1.ResourceMemory, skipDaemonSetPods, skipMirrorPods) | ||||||||||||||||||||||
if err != nil { | ||||||||||||||||||||||
return UtilizationInfo{}, err | ||||||||||||||||||||||
} | ||||||||||||||||||||||
return UtilizationInfo{CpuUtil: cpu, MemUtil: mem, Utilization: math.Max(cpu, mem)}, nil | ||||||||||||||||||||||
} | ||||||||||||||||||||||
|
||||||||||||||||||||||
``` | ||||||||||||||||||||||
> node utilization calculation logic | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
### Proposed Solution | ||||||||||||||||||||||
1. Either move `utils/gpu` logic to cloud provider or have a new option passed from commandline to indicate gpu node label cloud provider like to use | ||||||||||||||||||||||
2. Scale down case is kind of tricky because training and serving seems like different use case and have confliction. Fit GPU resource into utilization formula doesn't solve this issue. Instead, I think it's better to have a flag to indicate if gpu nodes can be scaled down or not. | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why using GPU in resource utilization formula doesn't solve the issue? The problem I'm aware of is high CPU or memory utilization preventing scale-down of nodes with GPU. This can be solved with changing utilization calculation to only care about GPU as described in #1367 (comment). Note that utilization check is just one condition for scale-down (and one that we should ideally get rid of at some point anyway). Having low utilization doesn't mean a node will actually be deleted. We have plenty of mechanisms for preventing scale-down if you don't want particular node removed (scale-down-disabled annotation) or specific pod restarted (safe-to-evict: false annotation, PDB to name just a few). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For GPU node, in training case, GPU utilization rate is not helpful to make decision. That's why I think ignore CPU and memory utilization formula can not resolve this issue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I look at the comment and it's reasonable to create PDB and annotate node. The speciality is most of batch/ML workloads are short term lived. It's kind of tedious to do that if jobs created frequently. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think maybe we can make training case to more generic, like a flag to indicate if we can remove a node immediately (kill in-flight job) or wait until workloads done. Looks like this is not supported but important for batch jobs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Note that utilization check is just one condition for scale-down (and one that we should ideally get rid of at some point anyway)." Do you mean in the future CA has plan to add some other triggers? I think entire cluster node utilization might be a reasonable one. That tradeoff here is scale up speed vs cost. Do you think this feature can be added? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Utilization is not a trigger, it's only the first step in scale-down decision, a precondition. The other part is 'drain simulation' that calculates if all pods are safe to move (if they fit elsewhere in the cluster, if they're owned by controller that will restart them, if PDB allows moving them, all the stuff mentioned in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node). The drain part is the important one, the utilization part is there to limit the number of evictions (don't restart the pods if the gain is relatively small) and for performance reasons (avoid expensive drain simulation if it's unlikely the node can be deleted). Regarding your use-case: it seems like putting "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" on your pods is a simple solution. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to defining this per job, not per cluster. Users may want to run diverse workloads, that's what Kubernetes is for :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree. Use label per job for fine grained control makes sense. At the beginning, I thought covering both cases using cluster level tag which may makes it a little bit confused and contradictory somehow. |
||||||||||||||||||||||
|
||||||||||||||||||||||
Here's the pseudocode of updated scale down logic. | ||||||||||||||||||||||
``` | ||||||||||||||||||||||
// IdleGPUNodeScaledDownEnabled - client pass this option into CA to control behavior | ||||||||||||||||||||||
|
||||||||||||||||||||||
// calculate node utilization rate | ||||||||||||||||||||||
|
||||||||||||||||||||||
if (IdleGPUNodeScaledDownEnabled) { | ||||||||||||||||||||||
// resuse existing logic | ||||||||||||||||||||||
|
||||||||||||||||||||||
if (node.label[GPULabel] && node.utilization < threhold) { | ||||||||||||||||||||||
// Try to move pods to other nodes if possible | ||||||||||||||||||||||
} | ||||||||||||||||||||||
|
||||||||||||||||||||||
} else { | ||||||||||||||||||||||
// resuse existing logic | ||||||||||||||||||||||
|
||||||||||||||||||||||
if (node.label[GPULabel] && isGPUInUse(node)) { | ||||||||||||||||||||||
// remove from scale down list if exists. | ||||||||||||||||||||||
} | ||||||||||||||||||||||
} | ||||||||||||||||||||||
|
||||||||||||||||||||||
``` | ||||||||||||||||||||||
|
||||||||||||||||||||||
### Related Issues | ||||||||||||||||||||||
#1367 | ||||||||||||||||||||||
#1135 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, CA will look at template node for a node group to verify if the node is expected to have GPUs. If so, it'll wait for them and not add surplus nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aleksandra-malinowska Yeah. That's different cases or phases.
What you mentioned are two cases
2.nodeGroup is GPU family and job requires GPU, CA will check upcoming nodes and don't add surplus node until node is ready.
The situation I meet is second one, the problem is GPU takes time to be allocatable. When then new create node is ready, but GPU resource is not ready, autoscaler will think it's unschedulable and create new nodes. This is hacked in GKE but not other cloud providers yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't happen if GPU resource of nodes (or node template, if a group is empty) is set correctly. EDIT: do you mean that without the label, node with unready GPU can masquerade as non-GPU node?
To be fair, I think there were successful user reports of applying GKE label to make it work elsewhere ;) But I agree it's not perfect, we could probably move it to a constant in cloud provider or sth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. As you said, this shouldn't happen. I mean user schedules GPU workloads on non-GPU cluster. CA has support to avoid scaling up in this case because it can not fit into this kind of nodeGroup.
Yes. Definitely, I think all cloud providers can hack this way. But each has its own GPU label. Either support custom GPU label from arguments or move to cloud providers (preferred.)