Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move GPULabel and GPUTypes to cloud provider #1584

Merged
merged 1 commit into from
Mar 26, 2019

Conversation

Jeffwan
Copy link
Member

@Jeffwan Jeffwan commented Jan 16, 2019

This PR is to address #1135

  1. Add GPULabel() and GetAvailableGPUTypes() in cloudprodiver interface
  2. Caller will pass GPULabel to methods in gpu.go which original relies on GPULabel
  3. Pass cloud.CloudProvider to methods in scale_up.go and scale_down.go that don't use GPULabel directly.

Please don't merge this PR until I confirm with all cloud provider owners and check if they're ok about labels and supported GPU types.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 16, 2019
@Jeffwan
Copy link
Member Author

Jeffwan commented Jan 16, 2019

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2019
@Jeffwan
Copy link
Member Author

Jeffwan commented Jan 16, 2019

/cc @aleksandra-malinowska @MaciekPytel @ringtail @feiskyer @hello2mao Please let me know if cloud provider has special label for accelerator node. We can add them in this PR. Otherwise, by default CA will use cloud.google.com/gke-accelerator

@Jeffwan Jeffwan force-pushed the gpu_to_cloudprovider branch from bcf5b51 to 3a9cf87 Compare January 16, 2019 07:27
@ringtail
Copy link
Contributor

@Jeffwan in Alibaba Cloud the label is aliyun.accelerator/nvidia_name

@ringtail
Copy link
Contributor

@Jeffwan Thanks for shimming code for Alibaba Cloud. Could we reach an agreement with the interface and develop the code by ourself. Because some logic may be different between providers.

@hello2mao
Copy link
Contributor

@Jeffwan no special GPU label for Baidu Cloud, thanks.

Copy link
Contributor

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think this is good direction.

One general suggestion. though. I think if we add new methods to cloudprovider interface, the implementation of those should be optional. After all not all CPs necessarily support GPUs.
Also maybe it would be better to add just one method returns object encapsulating all GPU related information. It will be more natural to extend if later on, e.g if we need some information in which zone which GPU types are available.

So I would rather see interface change as

func GetGpuInfo (GpuInfo, error)

Where error can be cloudprovider.ErrNotImplemented for cloudproviders not supporting GPUs

I am not sure if GpuInfo should be struct{..} or interface. I am opting towards the latter because it allows hiding implementation details, yet I am open to arguments here.

Also one more nit. As we are changing this part of code we can also change availableGPUTypes
to be map[string]bool. The lookup value can then be used directly and we do not have to unpack the second tuple value to check if key is found in the map.

@aleksandra-malinowska
Copy link
Contributor

Thanks @Jeffwan! That seems like a good first step, at least for now that it seems most cloud providers support GPUs via special label. If it works for all of us, let's do it.

Also maybe it would be better to add just one method returns object encapsulating all GPU related information. It will be more natural to extend if later on, e.g if we need some information in which zone which GPU types are available.

Eh, I'm not sure if it'll make any difference. Instead of modifying the CloudProvider interface, we'll have to modify this special struct (and all of its implementations). But I have no strong feelings about it either way.

Also one more nit. As we are changing this part of code we can also change availableGPUTypes
to be map[string]bool. The lookup value can then be used directly and we do not have to unpack the second tuple value to check if key is found in the map.

+1

@Jeffwan
Copy link
Member Author

Jeffwan commented Jan 17, 2019

@losipiuk @aleksandra-malinowska

Where error can be cloudprovider.ErrNotImplemented for cloudproviders not supporting GPUs

Agree. The only concern I have is gpu.go is an utility and no matter CloudProvider supports it or not, scaling logic go through helper functions now. And even cloud provider doesn't support it, I know some of the users label node and use it in a hack way.
After we use error to indicate the GPU support, process will be changed. Using default value is a compromise way. If we want to use struct or interface to capture all GPU infos, err would be necessary.

Also one more nit. As we are changing this part of code we can also change availableGPUTypes
to be map[string]bool. The lookup value can then be used directly and we do not have to unpack the second tuple value to check if key is found in the map.

The reason not using bool is in the future CA potential wants to know more Info for specific GPU type, like GPU memory to calculate utilization. (if sharing GPU is supported later). This might be a good case to make GPUInfo more extensible.

@Jeffwan
Copy link
Member Author

Jeffwan commented Jan 17, 2019

@ringtail If we‘d like to step further, I think we can remove gpu.go and move everything in cloud providers or withhold some interface in cloud provider like GPUInfo() and reference it in main logic (scale_up and scale_down). Could you give some examples or use cases?

@MaciekPytel
Copy link
Contributor

The current approach with returning label or simple gpu info is perfectly fine, though I agree with @losipiuk there should be a more explicit way to say 'not implemented' than returning gke label.

Maybe the way to proceed would be to allow ErrNotImplemented to make it explicit it's not supported by a given cloudprovider and let cloudprovider owners do their own implementation. To keep compatibility for users who add gke label manually we can just handle cloudprovider returning ErrNotImplemented by defaulting to gke label (it's ugly, but it keeps current behavior while still being explicit about what is officially supported by given cloudprovider and what is not).


I'm against moving utils/gpu logic into cloudprovider though. cloudprovider is meant to be a client for communicating with underlying platform. FilterOutNodesWithUnreadyGpus is a hack, but it's also a complex part of CA logic based on knowledge of internal implementation (it's abusing how clusterstate is implemented to hack around the driver installation problem). Moving this kind of logic to cloudprovider means we have to start treating a lot of internal CA logic as a quasi-API, as we do with cloudprovider today.

This kind of complex, provider-specific logic is why we have Processors. If we think we really need custom GPU-hacks I'd suggest moving StaticAutoscaler.obtainNodeLists() into a new processor and implementing custom logic there. (If someone were to actually do this - I'm really interested to learn how you're planning to deal with the problem - it took us a while to come up with the current hack).

@Jeffwan
Copy link
Member Author

Jeffwan commented Jan 28, 2019

Thanks. I will make the change to support ErrNotImplemented. For moving GPU utils/gpu logic into cloudprovider, unless we get reasonable requirements, it will not be considered in this PR's scope

@mwielgus
Copy link
Contributor

mwielgus commented Mar 1, 2019

@Jeffwan @MaciekPytel @aleksandra-malinowska - Where are we with this PR? Who is waiting for who :)

@MaciekPytel
Copy link
Contributor

Sorry for leaving this hanging for so long. I don't think all outstanding comments have been applied yet - in particular we should decide if we want to keep gke-accelarator label working on cloudproviders that have their own label to avoid breaking people who are already using it as a workaround? Current version will work fine for people on EKS, but it may break people running self-hosted clusters on AWS.
Alternatively we can leave this as is and put it into 'Action required' part of Kubernetes release notes for the release it lands in (1.14 if we merge this in time).

@Jeffwan
Copy link
Member Author

Jeffwan commented Mar 8, 2019

To keep compatibility for users who add gke label manually we can just handle cloudprovider returning ErrNotImplemented by defaulting to gke label (it's ugly, but it keeps current behavior while still being explicit about what is officially supported by given cloudprovider and what is not).

@MaciekPytel I sent you a message on slack to ask this. I am not sure how to returning ErrNotImplemented by defaulting to gke label ? You mean for cloud providers doesn't have this label, we return something like this? The logic calling these two functions assume they always return value back, how should we use ErrNotImplemented there?

func (aws *awsCloudProvider) Name() (string, errors.AutoscalerError) {
return "cloud.google.com/gke-accelerator", cloudprovider.ErrNotImplemented
}

@Jeffwan Jeffwan force-pushed the gpu_to_cloudprovider branch from 3a9cf87 to c8afd66 Compare March 9, 2019 00:42
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 9, 2019
@Jeffwan Jeffwan force-pushed the gpu_to_cloudprovider branch 2 times, most recently from 6c042eb to 9a5a9ef Compare March 9, 2019 01:03
@feiskyer
Copy link
Member

There's no label for GPU type on Azure yet, does this change still supports the default device-plugin based GPU scaling?

@Jeffwan
Copy link
Member Author

Jeffwan commented Mar 11, 2019

There's no label for GPU type on Azure yet, does this change still supports the default device-plugin based GPU scaling?

Yes. It won't change default behavior, just for cloud provider which has its own label, otherwise, default label is still being used.

@ringtail
Copy link
Contributor

There's no label for GPU type on Azure yet, does this change still supports the default device-plugin based GPU scaling?

Yes. It won't change default behavior, just for cloud provider which has its own label, otherwise, default label is still being used.

So,What's the status of this PR. Could we follow your interface to implement?

@MaciekPytel
Copy link
Contributor

I am not sure how to returning ErrNotImplemented by defaulting to gke label ? You mean for cloud providers doesn't have this label, we return something like this? The logic calling these two functions assume they always return value back, how should we use ErrNotImplemented there?

If we wanted to keep backward compatibility we'd need to respect gke label even if cloudprovider defines a different one (ie. AWS would respect both EKS and GKE labels). There are already people running on AWS who depend on GKE label (ex. #1659). I'd imagine that would be implemented in the code calling the cloudprovider method (ex. in NodeHasGpu, etc).
That being said I'm not sure how many people actually use it - maybe we can just handle it with "action required" release note?

@ringtail Is the current interface (GPULabel() + GetAvailableGPUTypes()) sufficient for you?

@ringtail
Copy link
Contributor

@MaciekPytel @Jeffwan @ringtail - Where we are with this PR. Who is waiting for what/who?

@mwielgus I think I am waiting this pr to merge so that I can implement for my provider. I don't know what this pr is waiting for.

@Jeffwan Jeffwan force-pushed the gpu_to_cloudprovider branch from 9a5a9ef to c9261da Compare March 22, 2019 17:08
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 22, 2019
@Jeffwan
Copy link
Member Author

Jeffwan commented Mar 22, 2019

rebase the upstream change.

@Jeffwan Jeffwan force-pushed the gpu_to_cloudprovider branch from c9261da to 86ce312 Compare March 22, 2019 20:25
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 22, 2019
@MaciekPytel
Copy link
Contributor

/approve
/lgtm
Sorry to keep it waiting so long.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 25, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 25, 2019
@Jeffwan Jeffwan force-pushed the gpu_to_cloudprovider branch from 86ce312 to 9066688 Compare March 25, 2019 20:12
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 25, 2019
@Jeffwan
Copy link
Member Author

Jeffwan commented Mar 25, 2019

No problem. I rebase upstream changes and need another /lgtm.
@MaciekPytel @mwielgus

@Jeffwan
Copy link
Member Author

Jeffwan commented Mar 25, 2019

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 25, 2019
Copy link
Contributor

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants