-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move GPULabel and GPUTypes to cloud provider #1584
Conversation
/hold |
/cc @aleksandra-malinowska @MaciekPytel @ringtail @feiskyer @hello2mao Please let me know if cloud provider has special label for accelerator node. We can add them in this PR. Otherwise, by default CA will use |
bcf5b51
to
3a9cf87
Compare
@Jeffwan in Alibaba Cloud the label is |
@Jeffwan Thanks for shimming code for Alibaba Cloud. Could we reach an agreement with the interface and develop the code by ourself. Because some logic may be different between providers. |
@Jeffwan no special GPU label for Baidu Cloud, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think this is good direction.
One general suggestion. though. I think if we add new methods to cloudprovider interface, the implementation of those should be optional. After all not all CPs necessarily support GPUs.
Also maybe it would be better to add just one method returns object encapsulating all GPU related information. It will be more natural to extend if later on, e.g if we need some information in which zone which GPU types are available.
So I would rather see interface change as
func GetGpuInfo (GpuInfo, error)
Where error can be cloudprovider.ErrNotImplemented
for cloudproviders not supporting GPUs
I am not sure if GpuInfo should be struct{..}
or interface. I am opting towards the latter because it allows hiding implementation details, yet I am open to arguments here.
Also one more nit. As we are changing this part of code we can also change availableGPUTypes
to be map[string]bool
. The lookup value can then be used directly and we do not have to unpack the second tuple value to check if key is found in the map.
Thanks @Jeffwan! That seems like a good first step, at least for now that it seems most cloud providers support GPUs via special label. If it works for all of us, let's do it.
Eh, I'm not sure if it'll make any difference. Instead of modifying the CloudProvider interface, we'll have to modify this special struct (and all of its implementations). But I have no strong feelings about it either way.
+1 |
@losipiuk @aleksandra-malinowska
Agree. The only concern I have is
The reason not using bool is in the future CA potential wants to know more Info for specific GPU type, like GPU memory to calculate utilization. (if sharing GPU is supported later). This might be a good case to make GPUInfo more extensible. |
@ringtail If we‘d like to step further, I think we can remove gpu.go and move everything in cloud providers or withhold some interface in cloud provider like GPUInfo() and reference it in main logic (scale_up and scale_down). Could you give some examples or use cases? |
The current approach with returning label or simple gpu info is perfectly fine, though I agree with @losipiuk there should be a more explicit way to say 'not implemented' than returning gke label. Maybe the way to proceed would be to allow ErrNotImplemented to make it explicit it's not supported by a given cloudprovider and let cloudprovider owners do their own implementation. To keep compatibility for users who add gke label manually we can just handle cloudprovider returning ErrNotImplemented by defaulting to gke label (it's ugly, but it keeps current behavior while still being explicit about what is officially supported by given cloudprovider and what is not). I'm against moving utils/gpu logic into cloudprovider though. cloudprovider is meant to be a client for communicating with underlying platform. FilterOutNodesWithUnreadyGpus is a hack, but it's also a complex part of CA logic based on knowledge of internal implementation (it's abusing how clusterstate is implemented to hack around the driver installation problem). Moving this kind of logic to cloudprovider means we have to start treating a lot of internal CA logic as a quasi-API, as we do with cloudprovider today. This kind of complex, provider-specific logic is why we have Processors. If we think we really need custom GPU-hacks I'd suggest moving StaticAutoscaler.obtainNodeLists() into a new processor and implementing custom logic there. (If someone were to actually do this - I'm really interested to learn how you're planning to deal with the problem - it took us a while to come up with the current hack). |
Thanks. I will make the change to support ErrNotImplemented. For moving GPU utils/gpu logic into cloudprovider, unless we get reasonable requirements, it will not be considered in this PR's scope |
@Jeffwan @MaciekPytel @aleksandra-malinowska - Where are we with this PR? Who is waiting for who :) |
Sorry for leaving this hanging for so long. I don't think all outstanding comments have been applied yet - in particular we should decide if we want to keep gke-accelarator label working on cloudproviders that have their own label to avoid breaking people who are already using it as a workaround? Current version will work fine for people on EKS, but it may break people running self-hosted clusters on AWS. |
@MaciekPytel I sent you a message on slack to ask this. I am not sure how to func (aws *awsCloudProvider) Name() (string, errors.AutoscalerError) { |
3a9cf87
to
c8afd66
Compare
6c042eb
to
9a5a9ef
Compare
There's no label for GPU type on Azure yet, does this change still supports the default device-plugin based GPU scaling? |
Yes. It won't change default behavior, just for cloud provider which has its own label, otherwise, default label is still being used. |
So,What's the status of this PR. Could we follow your interface to implement? |
If we wanted to keep backward compatibility we'd need to respect gke label even if cloudprovider defines a different one (ie. AWS would respect both EKS and GKE labels). There are already people running on AWS who depend on GKE label (ex. #1659). I'd imagine that would be implemented in the code calling the cloudprovider method (ex. in NodeHasGpu, etc). @ringtail Is the current interface (GPULabel() + GetAvailableGPUTypes()) sufficient for you? |
@mwielgus I think I am waiting this pr to merge so that I can implement for my provider. I don't know what this pr is waiting for. |
9a5a9ef
to
c9261da
Compare
rebase the upstream change. |
c9261da
to
86ce312
Compare
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MaciekPytel The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
86ce312
to
9066688
Compare
No problem. I rebase upstream changes and need another /lgtm. |
/hold cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Change-Id: If23e786352836d5c76ce563c9e428683028c9005
This PR is to address #1135
Please don't merge this PR until I confirm with all cloud provider owners and check if they're ok about labels and supported GPU types.