Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU Optimization Proposal #1562

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions cluster-autoscaler/proposals/gpu_optimization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Cluster Autoscaling Optimization for GPU clusters
##### Author: Jeffwan

## Introduction
Cluster Autoscaler makes it extremely easy to scale kubernetes cluster in response to pod status. At the same time, having it automate Kubernetes cluster to work efficiently.

However, there's still challenges in specific scenarios for accelerator computing clusters. Accelerator computing instances are quite different from normal instance. They are powerful, scalable instances that provide GPU-based parallel computing capabilities.

Accelerator computing nodes are expensive. (ec2 p3.16xlarge on demand instance cost $24.48/hr). Efficiently scaling the cluster will save a substtantial amount of money. GPU nodes are well suited for tasks that have heavy computation needs like machine learning and HPC applications. They can not be interrupted in some cases and that brings us more challenges in scale down.

Here's the problems I find in current upstream cluster autoscaler.

### Problems in cluster autoscaler for GPU node

#### Accelerator Label
GKE uses gpu label `cloud.google.com/gke-accelerator` to inspect if a node is an accelerator node, other cloud providers don't define their label yet. In order for CA to behavior normally in GPU cluster, it's essential to have this gpu label.

#### Scale too much without GPU label support
Pending pods will trigger scale up decision to bring up number of nodes in one node group. When new node is added, CA knows it is `upcoming` and it knows some pods will go on that node once it boots up. So it won't trigger another scale-up for those pods. Device plugin triggers after the node becomes ready and the extra resource required by pod will show up in node spec even later. CA will see that it was wrong and the pods for which it triggered scale-up can't actually go to a node it added for them (because it doesn't have the special resource yet and CA doesn't know it will be added later). So CA will go and create more nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CA will see that it was wrong and the pods for which it triggered scale-up can't actually go to a node it added for them (because it doesn't have the special resource yet and CA doesn't know it will be added later). So CA will go and create more nodes.

Not really, CA will look at template node for a node group to verify if the node is expected to have GPUs. If so, it'll wait for them and not add surplus nodes.

Copy link
Member Author

@Jeffwan Jeffwan Jan 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aleksandra-malinowska Yeah. That's different cases or phases.
What you mentioned are two cases

  1. nodeGroup is not GPU family but job requires GPU, scaling up is not helpful.
    2.nodeGroup is GPU family and job requires GPU, CA will check upcoming nodes and don't add surplus node until node is ready.

The situation I meet is second one, the problem is GPU takes time to be allocatable. When then new create node is ready, but GPU resource is not ready, autoscaler will think it's unschedulable and create new nodes. This is hacked in GKE but not other cloud providers yet.

Copy link
Contributor

@aleksandra-malinowska aleksandra-malinowska Jan 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeGroup is not GPU family but job requires GPU, scaling up is not helpful.

This shouldn't happen if GPU resource of nodes (or node template, if a group is empty) is set correctly. EDIT: do you mean that without the label, node with unready GPU can masquerade as non-GPU node?

This is hacked in GKE but not other cloud providers yet.

To be fair, I think there were successful user reports of applying GKE label to make it work elsewhere ;) But I agree it's not perfect, we could probably move it to a constant in cloud provider or sth.

Copy link
Member Author

@Jeffwan Jeffwan Jan 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeGroup is not GPU family but job requires GPU, scaling up is not helpful.

This shouldn't happen if GPU resource of nodes (or node template, if a group is empty) is set correctly. EDIT: do you mean that without the label, node with unready GPU can masquerade as non-GPU node?

No. As you said, this shouldn't happen. I mean user schedules GPU workloads on non-GPU cluster. CA has support to avoid scaling up in this case because it can not fit into this kind of nodeGroup.

This is hacked in GKE but not other cloud providers yet.

To be fair, I think there were successful user reports of applying GKE label to make it work elsewhere ;) But I agree it's not perfect, we could probably move it to a constant in cloud provider or sth.

Yes. Definitely, I think all cloud providers can hack this way. But each has its own GPU label. Either support custom GPU label from arguments or move to cloud providers (preferred.)


The problem is GPU node becomes ready first but at that time, there's no allocable GPU resources. The root reason could be

* Device Plugin is not ready.
* Device Plugin is ready but has not sent list of devices to kubelet.
* kubelet takes time to advertise resources to API server so that node spec is not up to date.

Pod that requests GPU can not be scheduled during this period. In order to address this issue, we either need to work with Nvidia folks on device plugin to change behavior on upstream, or make CA mark nodes with GPU label and unallocated GPU resource as NonReady.


#### GPU Resource is not given consideration in scaling down
CA filter candidates with low node cpu and memory utilization rate in every loop.

* If utilization rate is high, even all GPUs are idle, scale-down will not be triggered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most users add taints to their GPU node groups that prevent non-GPU pods running there. Note that even without autoscaling, if you don't do that, your GPU job may not run at all if too many non-GPU pods are scheduled on an unprotected node with idle GPUs.

If we're talking only about GPU workloads that could be spread across other nodes blocking scale-down, it can be solved by raising utilization threshold.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Agree. This is what user normally do for training jobs. There's a case that I am not sure if that's common enough. For tensorflow distributed training, parameter server and workers are just deployed on one GPU cluster but PS pod utilize more on CPU and memory and worker utilize GPU.

  2. "If we're talking only about GPU workloads that could be spread across other nodes blocking scale-down," That's the inference case and workloads could mix together on some economic GPU nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with "economic" solution of mixing GPU and non-GPU pods on the same node is that unless you very carefully control scheduling, there's always a risk that GPU will end up idle, which is a rather expensive way of running non-GPU pods.

Either way, I still don't see what is the problem here that couldn't be solved with just safe-to-evict and raising utilization threshold?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with safe-to-evict solution for training jobs.

GPU resource is not even considered in the utilization calculation here. So raising utilization threshold won't help this case.

func CalculateUtilization(node *apiv1.Node, nodeInfo *schedulercache.NodeInfo, skipDaemonSetPods, skipMirrorPods bool) (utilInfo UtilizationInfo, err error) {
cpu, err := calculateUtilizationOfResource(node, nodeInfo, apiv1.ResourceCPU, skipDaemonSetPods, skipMirrorPods)
if err != nil {
return UtilizationInfo{}, err
}
mem, err := calculateUtilizationOfResource(node, nodeInfo, apiv1.ResourceMemory, skipDaemonSetPods, skipMirrorPods)
if err != nil {
return UtilizationInfo{}, err
}
return UtilizationInfo{CpuUtil: cpu, MemUtil: mem, Utilization: math.Max(cpu, mem)}, nil

I think this part we can do some improvement.

Copy link
Contributor

@aleksandra-malinowska aleksandra-malinowska Jan 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: this comment is to a section of the doc concerned with idle GPUs + other pods using up CPU/memory above threshold. I'm fairly certain raising threshold will help with that.

I'm not sure what kind of benefit is to be gained from adding GPU utilization into the mix though, other than consistency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seh Your argument applies to any workload, regardless if it uses GPU or not. It's autoscalers job to kick pods around to limit resource usage. An application may not agree to be restarted in which case it must specify it somehow.

Why should the pod author need to learn about how to convince this particular autoscaler

PDB is a generic Kubernetes way of telling all automation (including CA) not to touch a given pod. The annotation exists because many people requested a way to specifically prevent CA from touching the pod, but allow other automation to restart it.

The machine has one GPU, and the pod is using one GPU

Which means GPU utilization for node is 100%. Once we add GPU utilization check as described in comments above CA will not touch that node. You'll also be able to set threshold to 0, in which case no node will be removed if any of its GPUs are used.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, how should the autoscaler determine which pods to move and which not to move? [...] For you, just using GPUs may mean you don't want the pod evicted, ever - but this may simply not work for other users.

The autoscaler normally doesn't "move" pods unless it thinks they're not utilizing a machine's resources adequately. That situation does not—or should not—apply here.

In my example, we have a machine with one GPU, a pod running there that requests and is using one GPU, and no other nodes advertising available GPUs. Why would an autoscaler consider this pod as a candidate to move? By analogy, if a machine has one CPU and a pod there is "using" (really, requested) the whole CPU, the autoscaler wouldn't consider moving that pod. Why do we trust use of a CPU but not use of a GPU?

it can probably be addressed similarly to adding a toleration you described - by admission controller or other automation, which can add annotations or set up PDBs (maxUnavailable: 0 should have the same effect).

The toleration I mentioned is a convenience provided by the "ExtendedResourceToleration" admission control plugin; we used to write them manually.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which means GPU utilization for node is 100%. Once we add GPU utilization check as described in comments above CA will not touch that node.

Well, that's great news! That's what I'm asking for here. If we're going to count GPU utilization and not move pods that are utilizing the machine's resources adequately, then I have no further complaint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no other nodes advertising available GPUs

In this case the pod won't be moved. Only if you had another node with unused GPU (e.g. a machine with 4 GPUs, 1 of them unused), it may. After the change discussed here, it won't be moved in that case, either. Is that clearer?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that helps. Thank you for clarifying.

Consider use case for machine learning inference or some training case that could tolerant failures, we actually like to give a try to see if all workloads on that particular node can be moved to other nodes. Then user can scale down this GPU node and efficiently reduce their cost.

* If utilization rate is low, GPUs are in use, scale-down will be triggered.
If there's a distributed training task on that node, killing task will lead to entire training job failing. In this case, this node can not be a scale down candicate.

```
// CalculateUtilization calculates utilization of a node, defined as maximum of (cpu, memory) utilization.
// Per resource utilization is the sum of requests for it divided by allocatable. It also returns the individual
// cpu and memory utilization.
func CalculateUtilization(node *apiv1.Node, nodeInfo *schedulercache.NodeInfo, skipDaemonSetPods, skipMirrorPods bool) (utilInfo UtilizationInfo, err error) {
cpu, err := calculateUtilizationOfResource(node, nodeInfo, apiv1.ResourceCPU, skipDaemonSetPods, skipMirrorPods)
if err != nil {
return UtilizationInfo{}, err
}
mem, err := calculateUtilizationOfResource(node, nodeInfo, apiv1.ResourceMemory, skipDaemonSetPods, skipMirrorPods)
if err != nil {
return UtilizationInfo{}, err
}
return UtilizationInfo{CpuUtil: cpu, MemUtil: mem, Utilization: math.Max(cpu, mem)}, nil
}

```
> node utilization calculation logic


### Proposed Solution
1. Either move `utils/gpu` logic to cloud provider or have a new option passed from commandline to indicate gpu node label cloud provider like to use
2. Scale down case is kind of tricky because training and serving seems like different use case and have confliction. Fit GPU resource into utilization formula doesn't solve this issue. Instead, I think it's better to have a flag to indicate if gpu nodes can be scaled down or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using GPU in resource utilization formula doesn't solve the issue? The problem I'm aware of is high CPU or memory utilization preventing scale-down of nodes with GPU. This can be solved with changing utilization calculation to only care about GPU as described in #1367 (comment).

Note that utilization check is just one condition for scale-down (and one that we should ideally get rid of at some point anyway). Having low utilization doesn't mean a node will actually be deleted. We have plenty of mechanisms for preventing scale-down if you don't want particular node removed (scale-down-disabled annotation) or specific pod restarted (safe-to-evict: false annotation, PDB to name just a few).
Unless there is a use-case that can't be addressed with existing mechanisms I'd rather we improve documentation and recommend an existing solution, rather than add another flag.

Copy link
Member Author

@Jeffwan Jeffwan Jan 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GPU node, in training case, GPU utilization rate is not helpful to make decision.
For example, given a 8 GPU machine, there're 2 tasks and one take 6 GPUs and the other takes 2 GPUs. Once first task finishes and GPU resource release. The GPU utilization rate is 25%. But it can not be killed because failure of this task will result in a failure of entire training job. It's kind of a true of false issue but not comparison with threshold.

That's why I think ignore CPU and memory utilization formula can not resolve this issue.
GPU utilization rate will definitely help on model inference case. Most of workloads are stateless and can be easily migrated to other nodes to reduce cost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I look at the comment and it's reasonable to create PDB and annotate node. The speciality is most of batch/ML workloads are short term lived. It's kind of tedious to do that if jobs created frequently.

Copy link
Member Author

@Jeffwan Jeffwan Jan 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we can make training case to more generic, like a flag to indicate if we can remove a node immediately (kill in-flight job) or wait until workloads done. Looks like this is not supported but important for batch jobs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Note that utilization check is just one condition for scale-down (and one that we should ideally get rid of at some point anyway)." Do you mean in the future CA has plan to add some other triggers? I think entire cluster node utilization might be a reasonable one. That tradeoff here is scale up speed vs cost. Do you think this feature can be added?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utilization is not a trigger, it's only the first step in scale-down decision, a precondition. The other part is 'drain simulation' that calculates if all pods are safe to move (if they fit elsewhere in the cluster, if they're owned by controller that will restart them, if PDB allows moving them, all the stuff mentioned in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node).

The drain part is the important one, the utilization part is there to limit the number of evictions (don't restart the pods if the gain is relatively small) and for performance reasons (avoid expensive drain simulation if it's unlikely the node can be deleted).

Regarding your use-case: it seems like putting "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" on your pods is a simple solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to defining this per job, not per cluster. Users may want to run diverse workloads, that's what Kubernetes is for :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Use label per job for fine grained control makes sense. At the beginning, I thought covering both cases using cluster level tag which may makes it a little bit confused and contradictory somehow.


Here's the pseudocode of updated scale down logic.
```
// IdleGPUNodeScaledDownEnabled - client pass this option into CA to control behavior

// calculate node utilization rate

if (IdleGPUNodeScaledDownEnabled) {
// resuse existing logic

if (node.label[GPULabel] && node.utilization < threhold) {
// Try to move pods to other nodes if possible
}

} else {
// resuse existing logic

if (node.label[GPULabel] && isGPUInUse(node)) {
// remove from scale down list if exists.
}
}

```

### Related Issues
#1367
#1135