Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: proposal custom cloud provider over gRPC #3140

Merged

Conversation

hectorj2f
Copy link
Contributor

As discussed over Slack with @MaciekPytel, I share a proposal to make CA more plugable and allow calling custom cloud providers without having to fork the CA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 17, 2020
@k8s-ci-robot k8s-ci-robot requested review from feiskyer and Jeffwan May 17, 2020 10:51
@hectorj2f
Copy link
Contributor Author

/assign @MaciekPytel

Copy link
Contributor

@MaciekPytel MaciekPytel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is sound, but this design is missing a lot of details (most importantly all the stuff related to NodeGroups).

cluster-autoscaler/proposals/plugable-provider-grpc.md Outdated Show resolved Hide resolved
cluster-autoscaler/proposals/plugable-provider-grpc.md Outdated Show resolved Hide resolved
cluster-autoscaler/proposals/plugable-provider-grpc.md Outdated Show resolved Hide resolved
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall i think it's a nice idea, i added a few of my nits.

i also agree with @MaciekPytel's comments.

@bhargav-naik
Copy link

@hectorj2f

I went through your proposal and had a query:

  • How can ExternalCloudProvider plugin nodeInfoComparator? (a very common requirement for cloudProvider implementation)

@bhargav-naik
Copy link

Some more points:

  • Is it a design goal to give flexibility to let cloudProvider to use any language/framework of choice for their implementation?
  • If someone wants to override the autoscaling_processor, they will still have to maintain their fork and add override the processor.
  • Why do we want to go for a solve which will increase runtime complexity (need of separate process for CloudProvider and have a gRPC communication between CA and CloudProvider process) instead of making changes in compile time workflow ?

@hardikdr
Copy link
Member

We would also need to define the contract between gRPC client and server, basically the messages, error-codes, and so on. Wondering if that should be part of this proposal as well?

  • Also, there could be a scope of learning a little from here the csi-model, around using error-schemes, secret management, and so on, only if relatable.

@hectorj2f
Copy link
Contributor Author

@bhargav-naik

Is it a design goal to give flexibility to let cloudProvider to use any language/framework of choice for their implementation?

I think a bit of common sense would help here, golang is the language commonly used in Kubernetes. If users want to code on a different language that would be up to them. Besides of that, this is not a requirement neither a motivation for this proposal.

If someone wants to override the autoscaling_processor, they will still have to maintain their fork and add override the processor.

This external provider could be extended to satisfy that purpose. It should not be considered as the initial requirement in my opinion, (solving a problem at the time). We could iterate over it and leave the design open to expose the processors per provider in the future.

Why do we want to go for a solve which will increase runtime complexity (need of separate process for CloudProvider and have a gRPC communication between CA and CloudProvider process) instead of making changes in compile time workflow ?

I have to argue about this. The runtime complexity would depend on where this external provider would seat. I believe the purpose is being flexible and dynamic here. gRPC doesn't offer a bit penalty in terms of computation.

Regarding #3127, I think the development and maintenance cost of moving out all the cloud providers is really high. This proposal tries to add a new external provider without re-designing the whole CA. The impact is minimal and everyone could benefit from it.

@bhargav-naik
Copy link

Inline replies:
I think a bit of common sense would help here, golang is the language commonly used in Kubernetes. If users want to code on a different language that would be up to them. Besides of that, this is not a requirement neither a motivation for this proposal.

  • In absence of this requirement, what is the motivation to decouple CloudProvider using gRPC.

This external provider could be extended to satisfy that purpose. It should not be considered as the initial requirement in my opinion, (solving a problem at the time). We could iterate over it and leave the design open to expose the processors per provider in the future.

I have to argue about this. The runtime complexity would depend on where this external provider would seat. I believe the purpose is being flexible and dynamic here. gRPC doesn't offer a bit penalty in terms of computation.

  • Now user will have to solve on how to run 1 extra process and deal with issues/debugging of distributed process. If the cloudProvider is able to plugin it's implementation in the same runtime wouldn't it be simpler.

@hectorj2f
Copy link
Contributor Author

hectorj2f commented May 21, 2020

I called out this as it was called out as a requirement in #3127. Also it came out as requirement primarily for nodeInfoComparator.

Does #3127 cover this scenario ?

Now user will have to solve on how to run 1 extra process and deal with issues/debugging of distributed process. If the cloudProvider is able to plugin it's implementation in the same runtime wouldn't it be simpler.

You are making deployment assumptions that will depend on how and where the grpc server leaves. I suggest you have a look at grafana go plugins.

If the cloudProvider is able to plugin it's implementation in the same runtime wouldn't it be simpler.

Honestly, I didn't like your initial approach because was unrealistic and hard to maintain in a longer term. However you just made some changes that might ease this integration, that makes me like it more now. There are many questions to be answered there but there is some progress.

@bhargav-naik
Copy link

bhargav-naik commented May 22, 2020

Inline replies:
Does #3127 cover this scenario ?

  • I have updated the current proposal to enable injection of autoscaling_processor in a similar manner handled in the original proposal. Please have a look at it and provide inputs.

Honestly, I didn't like your initial approach because was unrealistic and hard to maintain in a longer term.

  • The motivation for earlier approach was to decouple the Cloudprovider completely from the core (not having client libraries even in a vendored form). Though I understand the apprehension on such a big change as it radically changes the dev/build/release cycles. I leave that call with the project maintainers. Though I believe that in the long term it will be benificial as core, cloudprovider modules can have their own independent release per CloudProvider. End user can just use the required go module (based on their CP) and not have to maitain their separate builds. They will not get any unnecessary dependencies (even in vendored form).

@mwielgus
Copy link
Contributor

What is the status of this PR? Did you address the comments?

@hectorj2f
Copy link
Contributor Author

@mwielgus I didn't address them yet. I'd try to address them during this week.

@Bessonov
Copy link

@hectorj2f great stuff, keep going! It can resolve a bunch of use cases: #3644

@hectorj2f hectorj2f force-pushed the hectorj2f/cloudprovider_grpc branch from 731269c to 469984e Compare December 2, 2020 18:48
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 2, 2020
@hectorj2f hectorj2f force-pushed the hectorj2f/cloudprovider_grpc branch from c82924c to ec51d46 Compare December 2, 2020 18:53
@hectorj2f
Copy link
Contributor Author

@MaciekPytel @Bessonov @hardikdr @elmiko I finally managed to get some time to invest on this KEP to address the last comments. Let me know if you want me to specify the error codes as well.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 10, 2020
cluster-autoscaler/proposals/plugable-provider-grpc.md Outdated Show resolved Hide resolved
cluster-autoscaler/proposals/plugable-provider-grpc.md Outdated Show resolved Hide resolved
cluster-autoscaler/proposals/plugable-provider-grpc.md Outdated Show resolved Hide resolved
cluster-autoscaler/proposals/plugable-provider-grpc.md Outdated Show resolved Hide resolved
cluster-autoscaler/proposals/plugable-provider-grpc.md Outdated Show resolved Hide resolved
cluster-autoscaler/proposals/plugable-provider-grpc.md Outdated Show resolved Hide resolved
cluster-autoscaler/proposals/plugable-provider-grpc.md Outdated Show resolved Hide resolved
cluster-autoscaler/proposals/plugable-provider-grpc.md Outdated Show resolved Hide resolved
cluster-autoscaler/proposals/plugable-provider-grpc.md Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 2, 2021
@hectorj2f hectorj2f requested a review from MaciekPytel January 2, 2021 12:53
@hectorj2f hectorj2f force-pushed the hectorj2f/cloudprovider_grpc branch from 474f987 to a0f94ff Compare January 5, 2021 11:28
Hector Fernandez and others added 4 commits January 8, 2021 11:06
Signed-off-by: Hector Fernandez <hfernandez@mesosphere.com>
Signed-off-by: Hector Fernandez <hfernandez@mesosphere.com>
@hectorj2f hectorj2f force-pushed the hectorj2f/cloudprovider_grpc branch 2 times, most recently from 0beaa93 to c4c4de2 Compare January 8, 2021 10:12
@hectorj2f
Copy link
Contributor Author

It seems i hit the docker request limit in my travis build, toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading.

@ricoleabricot
Copy link
Contributor

Hi @hectorj2f 👋
I really like the idea, I just implemented a cloudprovider for OVHcloud so I know the pain behind copy-pasting the go sdk for building the provider.
Also, I currently work on gRPC eco-system so, if you need any help, we should definitively check this out together !

Signed-off-by: Hector Fernandez <hectorj@gmail.com>
@hectorj2f hectorj2f force-pushed the hectorj2f/cloudprovider_grpc branch from c4c4de2 to 68c9844 Compare January 11, 2021 19:51
@hectorj2f
Copy link
Contributor Author

@MaciekPytel I addressed the last minor comments.

Copy link
Contributor

@MaciekPytel MaciekPytel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Left some minor comments, but I don't think those are blocking for merging this.

### Refresh

Refresh is called before every main loop and can be used to dynamically update cloud provider state.
In particular the list of node groups returned by NodeGroups can change as a result of this action.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: To clarify - results of other calls (such as NodeGroups(), NodeGroup.MinSize(), etc) must not change unless Refresh is called. We should document that explicitly.

int32 minSize = 2;

// MaxSize of the node group on the cloud provider
int32 maxSize = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should actually include min, max and debug. min/max can change (after Refresh() call), while id is the consistent group identifier. For something like NodeGroupForNode it seems more consistent to return permanent id and not some settings that can change later on.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 25, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, hectorj2f, Kafei59, MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 25, 2021
@k8s-ci-robot k8s-ci-robot merged commit 9cffba5 into kubernetes:master Jan 25, 2021
@MaciekPytel
Copy link
Contributor

One more comment - in the meantime #3789 extended the cloudprovider API with another optional call. It should trivially convert into grpc (struct with a field for each option in NodeGroupAutoscalingOptions) or alternatively it can be skipped in initial version.

@hectorj2f
Copy link
Contributor Author

@MaciekPytel Good. I'll update the proposal to include that new api call, and address the minor comments.

@hectorj2f hectorj2f deleted the hectorj2f/cloudprovider_grpc branch January 25, 2021 17:08
@migueleliasweb
Copy link

Hi all, is there any work being done to implement this provider? If not, would a PR be welcomed?

@hectorj2f
Copy link
Contributor Author

Hi @migueleliasweb,
Unfortunately there is not any on-going work. Our company switched gears to focus on the CAPI autoscaler as a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.