Move GPULabel and GPUTypes to cloud provider #1584

Jeffwan · 2019-01-16T06:48:41Z

This PR is to address #1135

Add GPULabel() and GetAvailableGPUTypes() in cloudprodiver interface
Caller will pass GPULabel to methods in gpu.go which original relies on GPULabel
Pass cloud.CloudProvider to methods in scale_up.go and scale_down.go that don't use GPULabel directly.

Please don't merge this PR until I confirm with all cloud provider owners and check if they're ok about labels and supported GPU types.

Jeffwan · 2019-01-16T06:51:47Z

/hold

Jeffwan · 2019-01-16T06:56:15Z

/cc @aleksandra-malinowska @MaciekPytel @ringtail @feiskyer @hello2mao Please let me know if cloud provider has special label for accelerator node. We can add them in this PR. Otherwise, by default CA will use cloud.google.com/gke-accelerator

ringtail · 2019-01-16T09:51:35Z

@Jeffwan in Alibaba Cloud the label is aliyun.accelerator/nvidia_name

ringtail · 2019-01-16T09:55:39Z

@Jeffwan Thanks for shimming code for Alibaba Cloud. Could we reach an agreement with the interface and develop the code by ourself. Because some logic may be different between providers.

hello2mao · 2019-01-16T10:39:42Z

@Jeffwan no special GPU label for Baidu Cloud, thanks.

losipiuk

Thanks. I think this is good direction.

One general suggestion. though. I think if we add new methods to cloudprovider interface, the implementation of those should be optional. After all not all CPs necessarily support GPUs.
Also maybe it would be better to add just one method returns object encapsulating all GPU related information. It will be more natural to extend if later on, e.g if we need some information in which zone which GPU types are available.

So I would rather see interface change as

func GetGpuInfo (GpuInfo, error)

Where error can be cloudprovider.ErrNotImplemented for cloudproviders not supporting GPUs

I am not sure if GpuInfo should be struct{..} or interface. I am opting towards the latter because it allows hiding implementation details, yet I am open to arguments here.

Also one more nit. As we are changing this part of code we can also change availableGPUTypes
to be map[string]bool. The lookup value can then be used directly and we do not have to unpack the second tuple value to check if key is found in the map.

aleksandra-malinowska · 2019-01-16T11:27:14Z

Thanks @Jeffwan! That seems like a good first step, at least for now that it seems most cloud providers support GPUs via special label. If it works for all of us, let's do it.

Also maybe it would be better to add just one method returns object encapsulating all GPU related information. It will be more natural to extend if later on, e.g if we need some information in which zone which GPU types are available.

Eh, I'm not sure if it'll make any difference. Instead of modifying the CloudProvider interface, we'll have to modify this special struct (and all of its implementations). But I have no strong feelings about it either way.

Also one more nit. As we are changing this part of code we can also change availableGPUTypes
to be map[string]bool. The lookup value can then be used directly and we do not have to unpack the second tuple value to check if key is found in the map.

+1

cluster-autoscaler/expander/price/price.go

Jeffwan · 2019-01-17T01:12:26Z

@losipiuk @aleksandra-malinowska

Where error can be cloudprovider.ErrNotImplemented for cloudproviders not supporting GPUs

Agree. The only concern I have is gpu.go is an utility and no matter CloudProvider supports it or not, scaling logic go through helper functions now. And even cloud provider doesn't support it, I know some of the users label node and use it in a hack way.
After we use error to indicate the GPU support, process will be changed. Using default value is a compromise way. If we want to use struct or interface to capture all GPU infos, err would be necessary.

Also one more nit. As we are changing this part of code we can also change availableGPUTypes
to be map[string]bool. The lookup value can then be used directly and we do not have to unpack the second tuple value to check if key is found in the map.

The reason not using bool is in the future CA potential wants to know more Info for specific GPU type, like GPU memory to calculate utilization. (if sharing GPU is supported later). This might be a good case to make GPUInfo more extensible.

Jeffwan · 2019-01-17T01:17:48Z

@ringtail If we‘d like to step further, I think we can remove gpu.go and move everything in cloud providers or withhold some interface in cloud provider like GPUInfo() and reference it in main logic (scale_up and scale_down). Could you give some examples or use cases?

MaciekPytel · 2019-01-23T18:22:23Z

The current approach with returning label or simple gpu info is perfectly fine, though I agree with @losipiuk there should be a more explicit way to say 'not implemented' than returning gke label.

Maybe the way to proceed would be to allow ErrNotImplemented to make it explicit it's not supported by a given cloudprovider and let cloudprovider owners do their own implementation. To keep compatibility for users who add gke label manually we can just handle cloudprovider returning ErrNotImplemented by defaulting to gke label (it's ugly, but it keeps current behavior while still being explicit about what is officially supported by given cloudprovider and what is not).

I'm against moving utils/gpu logic into cloudprovider though. cloudprovider is meant to be a client for communicating with underlying platform. FilterOutNodesWithUnreadyGpus is a hack, but it's also a complex part of CA logic based on knowledge of internal implementation (it's abusing how clusterstate is implemented to hack around the driver installation problem). Moving this kind of logic to cloudprovider means we have to start treating a lot of internal CA logic as a quasi-API, as we do with cloudprovider today.

This kind of complex, provider-specific logic is why we have Processors. If we think we really need custom GPU-hacks I'd suggest moving StaticAutoscaler.obtainNodeLists() into a new processor and implementing custom logic there. (If someone were to actually do this - I'm really interested to learn how you're planning to deal with the problem - it took us a while to come up with the current hack).

Jeffwan · 2019-01-28T00:14:42Z

Thanks. I will make the change to support ErrNotImplemented. For moving GPU utils/gpu logic into cloudprovider, unless we get reasonable requirements, it will not be considered in this PR's scope

mwielgus · 2019-03-01T11:38:28Z

@Jeffwan @MaciekPytel @aleksandra-malinowska - Where are we with this PR? Who is waiting for who :)

MaciekPytel · 2019-03-07T13:54:33Z

Sorry for leaving this hanging for so long. I don't think all outstanding comments have been applied yet - in particular we should decide if we want to keep gke-accelarator label working on cloudproviders that have their own label to avoid breaking people who are already using it as a workaround? Current version will work fine for people on EKS, but it may break people running self-hosted clusters on AWS.
Alternatively we can leave this as is and put it into 'Action required' part of Kubernetes release notes for the release it lands in (1.14 if we merge this in time).

Jeffwan · 2019-03-08T09:42:18Z

To keep compatibility for users who add gke label manually we can just handle cloudprovider returning ErrNotImplemented by defaulting to gke label (it's ugly, but it keeps current behavior while still being explicit about what is officially supported by given cloudprovider and what is not).

@MaciekPytel I sent you a message on slack to ask this. I am not sure how to returning ErrNotImplemented by defaulting to gke label ? You mean for cloud providers doesn't have this label, we return something like this? The logic calling these two functions assume they always return value back, how should we use ErrNotImplemented there?

func (aws *awsCloudProvider) Name() (string, errors.AutoscalerError) {
return "cloud.google.com/gke-accelerator", cloudprovider.ErrNotImplemented
}

feiskyer · 2019-03-11T02:46:28Z

There's no label for GPU type on Azure yet, does this change still supports the default device-plugin based GPU scaling?

Jeffwan · 2019-03-11T06:36:42Z

There's no label for GPU type on Azure yet, does this change still supports the default device-plugin based GPU scaling?

Yes. It won't change default behavior, just for cloud provider which has its own label, otherwise, default label is still being used.

ringtail · 2019-03-11T11:49:19Z

There's no label for GPU type on Azure yet, does this change still supports the default device-plugin based GPU scaling?

Yes. It won't change default behavior, just for cloud provider which has its own label, otherwise, default label is still being used.

So,What's the status of this PR. Could we follow your interface to implement?

MaciekPytel · 2019-03-11T12:20:35Z

I am not sure how to returning ErrNotImplemented by defaulting to gke label ? You mean for cloud providers doesn't have this label, we return something like this? The logic calling these two functions assume they always return value back, how should we use ErrNotImplemented there?

If we wanted to keep backward compatibility we'd need to respect gke label even if cloudprovider defines a different one (ie. AWS would respect both EKS and GKE labels). There are already people running on AWS who depend on GKE label (ex. #1659). I'd imagine that would be implemented in the code calling the cloudprovider method (ex. in NodeHasGpu, etc).
That being said I'm not sure how many people actually use it - maybe we can just handle it with "action required" release note?

@ringtail Is the current interface (GPULabel() + GetAvailableGPUTypes()) sufficient for you?

ringtail · 2019-03-22T06:13:42Z

@MaciekPytel @Jeffwan @ringtail - Where we are with this PR. Who is waiting for what/who?

@mwielgus I think I am waiting this pr to merge so that I can implement for my provider. I don't know what this pr is waiting for.

Jeffwan · 2019-03-22T17:08:37Z

rebase the upstream change.

MaciekPytel · 2019-03-25T17:57:06Z

/approve
/lgtm
Sorry to keep it waiting so long.

k8s-ci-robot · 2019-03-25T17:57:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [MaciekPytel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Jeffwan · 2019-03-25T20:58:16Z

No problem. I rebase upstream changes and need another /lgtm.
@MaciekPytel @mwielgus

Jeffwan · 2019-03-25T21:52:40Z

/hold cancel

mwielgus

/lgtm

Change-Id: If23e786352836d5c76ce563c9e428683028c9005

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 16, 2019

k8s-ci-robot requested review from aleksandra-malinowska and feiskyer January 16, 2019 06:49

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2019

Jeffwan force-pushed the gpu_to_cloudprovider branch from bcf5b51 to 3a9cf87 Compare January 16, 2019 07:27

losipiuk reviewed Jan 16, 2019

View reviewed changes

aleksandra-malinowska reviewed Jan 16, 2019

View reviewed changes

cluster-autoscaler/expander/price/price.go Show resolved Hide resolved

Jeffwan force-pushed the gpu_to_cloudprovider branch from 3a9cf87 to c8afd66 Compare March 9, 2019 00:42

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 9, 2019

Jeffwan force-pushed the gpu_to_cloudprovider branch 2 times, most recently from 6c042eb to 9a5a9ef Compare March 9, 2019 01:03

Jeffwan force-pushed the gpu_to_cloudprovider branch from 9a5a9ef to c9261da Compare March 22, 2019 17:08

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 22, 2019

Jeffwan force-pushed the gpu_to_cloudprovider branch from c9261da to 86ce312 Compare March 22, 2019 20:25

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 22, 2019

k8s-ci-robot assigned MaciekPytel Mar 25, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 25, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 25, 2019

Move GPULabel and GPUTypes to cloud provider

9066688

Jeffwan force-pushed the gpu_to_cloudprovider branch from 86ce312 to 9066688 Compare March 25, 2019 20:12

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 25, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 25, 2019

mwielgus approved these changes Mar 26, 2019

View reviewed changes

k8s-ci-robot assigned mwielgus Mar 26, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 26, 2019

k8s-ci-robot merged commit 4c19ede into kubernetes:master Mar 26, 2019

ringtail mentioned this pull request Apr 14, 2019

change the available GPU types in alibaba cloud #1897

Merged

Jeffwan mentioned this pull request Apr 24, 2019

Scale in of GPU node group sometimes results in 2 nodes being created #1949

Closed

Jeffwan mentioned this pull request Sep 18, 2019

Cluster autoscaler scales up and scale down before GPU is become available #2338

Closed

Jeffwan mentioned this pull request Oct 11, 2019

Autoscaling GPU based node on AWS doubled #2446

Closed

danielmellado mentioned this pull request Oct 24, 2019

Rebase to upstream/cluster-autoscaler-release-1.16 openshift/kubernetes-autoscaler#119

Closed

Jeffwan mentioned this pull request Dec 13, 2019

CA for aws cannot scale asg from zero node #2642

Closed

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this pull request Feb 22, 2024

Upgrade version of sleep image for consistency (kubernetes#1584)

ff6c38c

Change-Id: If23e786352836d5c76ce563c9e428683028c9005

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move GPULabel and GPUTypes to cloud provider #1584

Move GPULabel and GPUTypes to cloud provider #1584

Jeffwan commented Jan 16, 2019

Jeffwan commented Jan 16, 2019

Jeffwan commented Jan 16, 2019

ringtail commented Jan 16, 2019

ringtail commented Jan 16, 2019

hello2mao commented Jan 16, 2019

losipiuk left a comment

aleksandra-malinowska commented Jan 16, 2019

Jeffwan commented Jan 17, 2019 •

edited

Loading

Jeffwan commented Jan 17, 2019

MaciekPytel commented Jan 23, 2019

Jeffwan commented Jan 28, 2019

mwielgus commented Mar 1, 2019

MaciekPytel commented Mar 7, 2019

Jeffwan commented Mar 8, 2019 •

edited

Loading

feiskyer commented Mar 11, 2019

Jeffwan commented Mar 11, 2019

ringtail commented Mar 11, 2019

MaciekPytel commented Mar 11, 2019

ringtail commented Mar 22, 2019

Jeffwan commented Mar 22, 2019

MaciekPytel commented Mar 25, 2019

k8s-ci-robot commented Mar 25, 2019

Jeffwan commented Mar 25, 2019

Jeffwan commented Mar 25, 2019

mwielgus left a comment

Move GPULabel and GPUTypes to cloud provider #1584

Move GPULabel and GPUTypes to cloud provider #1584

Conversation

Jeffwan commented Jan 16, 2019

Jeffwan commented Jan 16, 2019

Jeffwan commented Jan 16, 2019

ringtail commented Jan 16, 2019

ringtail commented Jan 16, 2019

hello2mao commented Jan 16, 2019

losipiuk left a comment

Choose a reason for hiding this comment

aleksandra-malinowska commented Jan 16, 2019

Jeffwan commented Jan 17, 2019 • edited Loading

Jeffwan commented Jan 17, 2019

MaciekPytel commented Jan 23, 2019

Jeffwan commented Jan 28, 2019

mwielgus commented Mar 1, 2019

MaciekPytel commented Mar 7, 2019

Jeffwan commented Mar 8, 2019 • edited Loading

feiskyer commented Mar 11, 2019

Jeffwan commented Mar 11, 2019

ringtail commented Mar 11, 2019

MaciekPytel commented Mar 11, 2019

ringtail commented Mar 22, 2019

Jeffwan commented Mar 22, 2019

MaciekPytel commented Mar 25, 2019

k8s-ci-robot commented Mar 25, 2019

Jeffwan commented Mar 25, 2019

Jeffwan commented Mar 25, 2019

mwielgus left a comment

Choose a reason for hiding this comment

Jeffwan commented Jan 17, 2019 •

edited

Loading

Jeffwan commented Mar 8, 2019 •

edited

Loading