Could GPU support move to cloud provider? #1135

ringtail · 2018-08-10T02:34:00Z

autoscaler has GKE GPU support and I want to add alicloud GPU to autoscaler. But I found the code is outside the scope of cloud provider, Is there any plan to move those code to cloud provider?

MaciekPytel · 2018-08-10T13:19:05Z

That's a very reasonable request, but I'd like to understand some details to figure out how this could be done.

Basically, most of gpu.go is a one big hack for the problem discussed in #597 (comment). Ideally, once the problem is fixed upstream, this whole file should be removed and there should be no special handling for GPU required on cloudprovider side.
That being said the issue upstream has been there for almost a year now and we don't know when it'll finally get fixed. The question is how are you planning to solve it for alicloud? You need some way to recognize a node will have GPU, before the resource shows up in Node requests. In GKE we use the fact that every node with GPU always has a cloud.google.com/gke-accelerator label. Is there some similar label in alicloud that you're planning to use? Or is it something completely different?

ringtail · 2018-08-12T09:57:45Z

Yes,We do have the similar label aliyun.accelerator/nvidia_name. We need to check wether a node is a gpu node and what is the gpu type of this node. We can abstract two interface function in cloud provider with nodeInfo as param. one is to judge weather a node has gpu and the other is to get the gpu type of this node.

aleksandra-malinowska · 2018-08-13T09:44:03Z

Technically, wouldn't TemplateNodeInfo work for this? I.e., core autoscaler algorithm could get the template node for each node group at the beginning of the iteration. Then, check each node for GPUs and consider it not ready (for all cloud providers) if the template has GPUs, but the node doesn't. Node template has to have correct GPU information for scaling from 0 to work already; the only thing to do would be to optimize (cache) getting template nodes, so it doesn't require N=number of node groups extra API calls each loop.

MaciekPytel · 2018-08-13T13:15:45Z

IIRC for the base GPU implementation all you need is FilterOutNodesWithUnreadyGpus. As a short term fix you could probably just change that function to look at 2 different GPU labels instead of one.

Moving the implementation to cloudprovider makes sense, but it may take some time before we have time to do it. Also all this gpu code is only a temporary hack, so it would only be a temporary addition until the upstream issue is fixed.

GetNodeTargetGpus is only used for resource limits (--gpu-total) which is experimental, not very well tested and not used anywhere in prod, so I'm not sure how much you care about that.

ringtail · 2018-08-15T03:05:36Z

@MaciekPytel Yes,That's the point. I'm preparing the opensource process of Alicloud cloudprovider of autoscaler and I don't know how to handle the code about GPU support for Alicloud. So your advice is to just put them aside near the GKE GPU code and wait the addition to cloudprovider?

MaciekPytel · 2018-08-16T09:50:20Z

@ringtail Correct. I'm not sure we can move it to cloudprovider in time for 1.12 and adding additional label to existing code should be a simple temporary workaround. Unless I misunderstood something and you need a more complex change?

ringtail · 2018-08-16T10:37:48Z

Hmm,that make sense. Thank you!😁

MaciekPytel · 2018-08-16T11:46:02Z

/reopen
Let's keep this open to track a more permanent solution (moving GPU stuff to cloudprovider or fixing the issues that make it necessary).

k8s-ci-robot · 2018-08-16T11:46:02Z

@MaciekPytel: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen
Let's keep this open to track a more permanent solution (moving GPU stuff to cloudprovider or fixing the issues that make it necessary).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mjstevens777 · 2018-10-17T02:22:52Z

@MaciekPytel I have been running into the same issue on AWS EKS. The recommended GPU setup uses the nvidia device plugin, which (assuming I've configured it correctly) doesn't register the GPUs until after the instance is ready. As a workaround I am adding the label cloud.google.com/gke-accelerator to the nodes with GPUs, but if it was possible to choose a custom label on the command line, I would appreciate that feature.

ringtail · 2018-10-17T02:26:59Z

@mjstevens777 agree! we have a similar label.

Jeffwan · 2019-04-04T17:35:06Z

We have a PR merged and every cloud provider could implement their own GPULabel() and GetAvailableGPUTypes(). Let me if there's anything I missed.

ringtail · 2019-04-05T10:13:21Z

@Jeffwan That's really cool. I'll check the GPU types.

ringtail · 2019-04-17T03:34:18Z

@Jeffwan @aleksandra-malinowska @MaciekPytel Cloud I close this issue.

ringtail · 2019-04-17T03:35:21Z

We have a PR merged and every cloud provider could implement their own GPULabel() and GetAvailableGPUTypes(). Let me if there's anything I missed.

I have already updated the GPU types. Thanks !

MaciekPytel · 2019-04-17T16:26:32Z

You're right, no reason to keep it open.

ringtail · 2019-04-18T03:28:27Z

You're right, no reason to keep it open.

OK

dkozlov · 2019-08-27T15:01:29Z

And what about custom extended resources advertised by custom device plugins? https://kubernetes.io/docs/tasks/administer-cluster/extended-resource-node/

* Trigger cq reconciliation on snapshot update * Review remarks

aleksandra-malinowska added the area/cluster-autoscaler label Aug 13, 2018

ringtail closed this as completed Aug 16, 2018

MaciekPytel reopened this Aug 16, 2018

Jeffwan mentioned this issue Jan 7, 2019

Add GPU Optimization Proposal #1562

Closed

aleksandra-malinowska mentioned this issue Jan 14, 2019

Scaling up from 0 on EKS #1580

Closed

aleksandra-malinowska assigned Jeffwan Jan 14, 2019

Jeffwan mentioned this issue Jan 16, 2019

Move GPULabel and GPUTypes to cloud provider #1584

Merged

MaciekPytel closed this as completed Apr 17, 2019

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024

Trigger cq reconciliation on snapshot change (kubernetes#1135)

c2ff1bd

* Trigger cq reconciliation on snapshot update * Review remarks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could GPU support move to cloud provider? #1135

Could GPU support move to cloud provider? #1135

ringtail commented Aug 10, 2018

MaciekPytel commented Aug 10, 2018

ringtail commented Aug 12, 2018

aleksandra-malinowska commented Aug 13, 2018

MaciekPytel commented Aug 13, 2018

ringtail commented Aug 15, 2018

MaciekPytel commented Aug 16, 2018

ringtail commented Aug 16, 2018

MaciekPytel commented Aug 16, 2018

k8s-ci-robot commented Aug 16, 2018

mjstevens777 commented Oct 17, 2018

ringtail commented Oct 17, 2018

Jeffwan commented Apr 4, 2019

ringtail commented Apr 5, 2019

ringtail commented Apr 17, 2019

ringtail commented Apr 17, 2019

MaciekPytel commented Apr 17, 2019

ringtail commented Apr 18, 2019

dkozlov commented Aug 27, 2019

Could GPU support move to cloud provider? #1135

Could GPU support move to cloud provider? #1135

Comments

ringtail commented Aug 10, 2018

MaciekPytel commented Aug 10, 2018

ringtail commented Aug 12, 2018

aleksandra-malinowska commented Aug 13, 2018

MaciekPytel commented Aug 13, 2018

ringtail commented Aug 15, 2018

MaciekPytel commented Aug 16, 2018

ringtail commented Aug 16, 2018

MaciekPytel commented Aug 16, 2018

k8s-ci-robot commented Aug 16, 2018

mjstevens777 commented Oct 17, 2018

ringtail commented Oct 17, 2018

Jeffwan commented Apr 4, 2019

ringtail commented Apr 5, 2019

ringtail commented Apr 17, 2019

ringtail commented Apr 17, 2019

MaciekPytel commented Apr 17, 2019

ringtail commented Apr 18, 2019

dkozlov commented Aug 27, 2019