-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could GPU support move to cloud provider? #1135
Comments
That's a very reasonable request, but I'd like to understand some details to figure out how this could be done. Basically, most of gpu.go is a one big hack for the problem discussed in #597 (comment). Ideally, once the problem is fixed upstream, this whole file should be removed and there should be no special handling for GPU required on cloudprovider side. |
Yes,We do have the similar label |
Technically, wouldn't TemplateNodeInfo work for this? I.e., core autoscaler algorithm could get the template node for each node group at the beginning of the iteration. Then, check each node for GPUs and consider it not ready (for all cloud providers) if the template has GPUs, but the node doesn't. Node template has to have correct GPU information for scaling from 0 to work already; the only thing to do would be to optimize (cache) getting template nodes, so it doesn't require N=number of node groups extra API calls each loop. |
IIRC for the base GPU implementation all you need is FilterOutNodesWithUnreadyGpus. As a short term fix you could probably just change that function to look at 2 different GPU labels instead of one. Moving the implementation to cloudprovider makes sense, but it may take some time before we have time to do it. Also all this gpu code is only a temporary hack, so it would only be a temporary addition until the upstream issue is fixed. GetNodeTargetGpus is only used for resource limits ( |
@MaciekPytel Yes,That's the point. I'm preparing the opensource process of Alicloud cloudprovider of autoscaler and I don't know how to handle the code about GPU support for Alicloud. So your advice is to just put them aside near the GKE GPU code and wait the addition to cloudprovider? |
@ringtail Correct. I'm not sure we can move it to cloudprovider in time for 1.12 and adding additional label to existing code should be a simple temporary workaround. Unless I misunderstood something and you need a more complex change? |
Hmm,that make sense. Thank you!😁 |
/reopen |
@MaciekPytel: you can't re-open an issue/PR unless you authored it or you are assigned to it. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@MaciekPytel I have been running into the same issue on AWS EKS. The recommended GPU setup uses the nvidia device plugin, which (assuming I've configured it correctly) doesn't register the GPUs until after the instance is ready. As a workaround I am adding the label |
@mjstevens777 agree! we have a similar label. |
We have a PR merged and every cloud provider could implement their own GPULabel() and GetAvailableGPUTypes(). Let me if there's anything I missed. |
@Jeffwan That's really cool. I'll check the GPU types. |
@Jeffwan @aleksandra-malinowska @MaciekPytel Cloud I close this issue. |
I have already updated the GPU types. Thanks ! |
You're right, no reason to keep it open. |
OK |
And what about custom extended resources advertised by custom device plugins? https://kubernetes.io/docs/tasks/administer-cluster/extended-resource-node/ |
* Trigger cq reconciliation on snapshot update * Review remarks
autoscaler has GKE GPU support and I want to add alicloud GPU to autoscaler. But I found the code is outside the scope of cloud provider, Is there any plan to move those code to cloud provider?
The text was updated successfully, but these errors were encountered: