Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could GPU support move to cloud provider? #1135

Closed
ringtail opened this issue Aug 10, 2018 · 18 comments
Closed

Could GPU support move to cloud provider? #1135

ringtail opened this issue Aug 10, 2018 · 18 comments

Comments

@ringtail
Copy link
Contributor

autoscaler has GKE GPU support and I want to add alicloud GPU to autoscaler. But I found the code is outside the scope of cloud provider, Is there any plan to move those code to cloud provider?

@MaciekPytel
Copy link
Contributor

That's a very reasonable request, but I'd like to understand some details to figure out how this could be done.

Basically, most of gpu.go is a one big hack for the problem discussed in #597 (comment). Ideally, once the problem is fixed upstream, this whole file should be removed and there should be no special handling for GPU required on cloudprovider side.
That being said the issue upstream has been there for almost a year now and we don't know when it'll finally get fixed. The question is how are you planning to solve it for alicloud? You need some way to recognize a node will have GPU, before the resource shows up in Node requests. In GKE we use the fact that every node with GPU always has a cloud.google.com/gke-accelerator label. Is there some similar label in alicloud that you're planning to use? Or is it something completely different?

@ringtail
Copy link
Contributor Author

Yes,We do have the similar label aliyun.accelerator/nvidia_name. We need to check wether a node is a gpu node and what is the gpu type of this node. We can abstract two interface function in cloud provider with nodeInfo as param. one is to judge weather a node has gpu and the other is to get the gpu type of this node.

@aleksandra-malinowska
Copy link
Contributor

Technically, wouldn't TemplateNodeInfo work for this? I.e., core autoscaler algorithm could get the template node for each node group at the beginning of the iteration. Then, check each node for GPUs and consider it not ready (for all cloud providers) if the template has GPUs, but the node doesn't. Node template has to have correct GPU information for scaling from 0 to work already; the only thing to do would be to optimize (cache) getting template nodes, so it doesn't require N=number of node groups extra API calls each loop.

@MaciekPytel
Copy link
Contributor

IIRC for the base GPU implementation all you need is FilterOutNodesWithUnreadyGpus. As a short term fix you could probably just change that function to look at 2 different GPU labels instead of one.

Moving the implementation to cloudprovider makes sense, but it may take some time before we have time to do it. Also all this gpu code is only a temporary hack, so it would only be a temporary addition until the upstream issue is fixed.

GetNodeTargetGpus is only used for resource limits (--gpu-total) which is experimental, not very well tested and not used anywhere in prod, so I'm not sure how much you care about that.

@ringtail
Copy link
Contributor Author

@MaciekPytel Yes,That's the point. I'm preparing the opensource process of Alicloud cloudprovider of autoscaler and I don't know how to handle the code about GPU support for Alicloud. So your advice is to just put them aside near the GKE GPU code and wait the addition to cloudprovider?

@MaciekPytel
Copy link
Contributor

@ringtail Correct. I'm not sure we can move it to cloudprovider in time for 1.12 and adding additional label to existing code should be a simple temporary workaround. Unless I misunderstood something and you need a more complex change?

@ringtail
Copy link
Contributor Author

Hmm,that make sense. Thank you!😁

@MaciekPytel
Copy link
Contributor

/reopen
Let's keep this open to track a more permanent solution (moving GPU stuff to cloudprovider or fixing the issues that make it necessary).

@k8s-ci-robot
Copy link
Contributor

@MaciekPytel: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen
Let's keep this open to track a more permanent solution (moving GPU stuff to cloudprovider or fixing the issues that make it necessary).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MaciekPytel MaciekPytel reopened this Aug 16, 2018
@mjstevens777
Copy link

@MaciekPytel I have been running into the same issue on AWS EKS. The recommended GPU setup uses the nvidia device plugin, which (assuming I've configured it correctly) doesn't register the GPUs until after the instance is ready. As a workaround I am adding the label cloud.google.com/gke-accelerator to the nodes with GPUs, but if it was possible to choose a custom label on the command line, I would appreciate that feature.

@ringtail
Copy link
Contributor Author

@mjstevens777 agree! we have a similar label.

@Jeffwan
Copy link
Member

Jeffwan commented Apr 4, 2019

We have a PR merged and every cloud provider could implement their own GPULabel() and GetAvailableGPUTypes(). Let me if there's anything I missed.

@ringtail
Copy link
Contributor Author

ringtail commented Apr 5, 2019

@Jeffwan That's really cool. I'll check the GPU types.

@ringtail
Copy link
Contributor Author

@Jeffwan @aleksandra-malinowska @MaciekPytel Cloud I close this issue.

@ringtail
Copy link
Contributor Author

We have a PR merged and every cloud provider could implement their own GPULabel() and GetAvailableGPUTypes(). Let me if there's anything I missed.

I have already updated the GPU types. Thanks !

@MaciekPytel
Copy link
Contributor

You're right, no reason to keep it open.

@ringtail
Copy link
Contributor Author

You're right, no reason to keep it open.

OK

@dkozlov
Copy link

dkozlov commented Aug 27, 2019

And what about custom extended resources advertised by custom device plugins? https://kubernetes.io/docs/tasks/administer-cluster/extended-resource-node/

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024
* Trigger cq reconciliation on snapshot update

* Review remarks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants