Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gRPC expander plugin #4452

Merged
merged 2 commits into from
Feb 21, 2022

Conversation

evansheng
Copy link
Contributor

Adds a new expander grpc, implementing the gRPC plugin outlined in: #4134

This PR adds a gRPC client as an expander in CA, and calls out to a gRPC server with config specified by two new flags --grpcExpanderCert and --grpcExpanderUrl. Should initialization of the client fail, or any returned response be invalid or malformed, the expander fails open, and doesn't filter any Options.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 9, 2021
@evansheng
Copy link
Contributor Author

I've built this and deployed it locally, alongside an internal gRPC server - verified that this is working e2e successfully. I'm able to return options successfully via bestOptions.
I've also successfully tested failure modes in which the gRPC server doesn't exist, we fall back to appropriate specified expanders in the list.

@@ -48,6 +49,7 @@ type AutoscalerOptions struct {
EstimatorBuilder estimator.EstimatorBuilder
Processors *ca_processors.AutoscalingProcessors
Backoff backoff.Backoff
GRPCOpts grpcplugin.GRPCOpts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed here. It's not used by anything except the grpc plugin and I think it's just leaking implementation details of this specific expander.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -101,9 +103,12 @@ func initializeDefaultOptions(opts *AutoscalerOptions) error {
if opts.CloudProvider == nil {
opts.CloudProvider = cloudBuilder.NewCloudProvider(opts.AutoscalingOptions)
}
if strings.Contains(opts.ExpanderNames, expander.GRPCExpanderName) && opts.GRPCExpanderURL != "" && opts.GRPCExpanderCert != "" {
opts.GRPCOpts = grpcplugin.GRPCOpts{opts.GRPCExpanderCert, opts.GRPCExpanderURL}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment above - I don't think code in core/ should contain any logic specific to a particular expander. I'd much rather pass AutoscalingOptions to expander factory (so it has access to all flags) or just define expander-specific flags directly in the expander.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opted for passing the grpc expander flags as two more args, instead of either creating an import cycle (core & expander factory), or moving the AutoscalerOptions struct to config

@@ -25,6 +25,8 @@ require (
golang.org/x/crypto v0.0.0-20210513164829-c07d793c2f9a
golang.org/x/oauth2 v0.0.0-20210514164344-f6687ab2804c
google.golang.org/api v0.46.0
google.golang.org/grpc v1.38.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes already uses grpc in a different version: https://github.com/kubernetes/kubernetes/blob/master/go.mod#L102, which makes it a transitive dependency of CA. In my experience is that different grpc versions don't always play well together. Right now the difference in version number is pretty small, but is there some way in which we can guarantee the same grpc version is used by CA and kubernetes going forward?

Copy link
Contributor Author

@evansheng evansheng Nov 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schedulerframework "k8s.io/kubernetes/pkg/scheduler/framework"

"google.golang.org/grpc"
"google.golang.org/grpc/credentials"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a big dependency to push on anyone forking CA (which is how a lot of cloudproviders work), especially given that they're quite likely to already use grpc, possibly in some other way. Can we try and make it easy to get rid of this dependency by tightly controlling where we import this expander from (ideally try and avoid importing it directly anywhere outside of expander/)?

Copy link
Contributor Author

@evansheng evansheng Nov 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call - removing the grpcOpts struct and passing flags directly to expander factory directly fixes this

@evansheng
Copy link
Contributor Author

added the autogenerated protobuf code to exclusion from golint

@evansheng
Copy link
Contributor Author

/assign @aleksandra-malinowska

@evansheng
Copy link
Contributor Author

@MaciekPytel I believe this PR is ready for another round of review when you get the chance

@evansheng evansheng force-pushed the es--grpc-expander-plugin branch from 269d9d0 to d821603 Compare November 30, 2021 22:00
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @evansheng , i'm still reviewing this code, but one thing i am wondering about is documenting how to use this expander. is the enhancement meant to be the main documentation or have you considered adding detailed usage instructions to one of our READMEs?

i would like to try this out if possible, but i am still understanding how to setup the grpc expander. perhaps a small "fake" expander would be nice to demonstrate how a user could build one. or is the mock expander meant to be an example?

@evansheng
Copy link
Contributor Author

@elmiko added a README on how to use this new expander, and added some starter code
PTAL when you get the change

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @evansheng , that readme and example are exactly what i was looking for. i'd like to test drive this a little more, but i'm +1 for merging.

@evansheng
Copy link
Contributor Author

@MaciekPytel friendly ping for another review on this pr

message BestOptionsRequest {
repeated Option options = 1;
// key is node id from options
map<string, k8s.io.api.core.v1.Node> nodeInfoMap = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So over gRPC nodeInfoMap is really a Node map: is that required to save rpc calls throughput and simplify the proto?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good point. If it's a node map we should name it nodeMap (or maybe templateNodeMap?) and not nodeInfoMap. NodeInfo is a well defined concept that represent a Node + pods running on that node.

The limitation of passing just the nodes is that any information about daemonsets and static pods is no longer available to expander. It's probably not a blocker (most expander implementations can probably live with that), but it's probably worth calling out in the README.

Alternatively you could consider actually making it an actual nodeInfoMap by including both the node and pods running on it in the value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

called out in readme, and renamed variable to be more appropriate

@@ -67,6 +70,9 @@ func ExpanderStrategyFromStrings(expanderFlags []string, cloudProvider cloudprov
stopChannel := make(chan struct{})
lister := kubernetes.NewConfigMapListerForNamespace(kubeClient, stopChannel, configNamespace)
filters = append(filters, priority.NewFilter(lister.ConfigMaps(configNamespace), autoscalingKubeClients.Recorder))
case expander.GRPCExpanderName:
klog.V(1).Info("GRPC expander chosen")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: that feels somewhat redundant, we already log all flag values by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// copy the protos/expander.pb.go to your other application's repo, so it has access to the protobuf definitions

// Serve should be called by the main() function in main.go of the Expander Server repo to start serving
func Serve(certPath string, keyPath string, port uint) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to move this file to a separate package (grpcplugin/example?) to make it clear it's an example implementation and not part of CA integration.

Also this is already really helpful, but if you already have all the logic maybe also consider adding the main function and a simple Makefile so it's a working example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

including a simple main function but leaving out the makefile, as it is up to the user how they would like to structure their project repo

message BestOptionsRequest {
repeated Option options = 1;
// key is node id from options
map<string, k8s.io.api.core.v1.Node> nodeInfoMap = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good point. If it's a node map we should name it nodeMap (or maybe templateNodeMap?) and not nodeInfoMap. NodeInfo is a well defined concept that represent a Node + pods running on that node.

The limitation of passing just the nodes is that any information about daemonsets and static pods is no longer available to expander. It's probably not a blocker (most expander implementations can probably live with that), but it's probably worth calling out in the README.

Alternatively you could consider actually making it an actual nodeInfoMap by including both the node and pods running on it in the value.


package grpcplugin;
import "k8s.io/api/core/v1/generated.proto";
//import "google/protobuf/struct.proto";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove if unused.


// if no Cert file specified, use insecure
if expanderCert == "" {
dialOpt = grpc.WithInsecure()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like an unnecessary risk. Kubernetes itself moved away from allowing insecure connections and I think we also shouldn't allow them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this option


// call gRPC server to get BestOption
klog.V(2).Info("GPRC call of best options to server with ", len(nodeGroupIDOptionMap), " options")
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making 5s a const

}
}

// populateNodeInfoForGRPC modifies the nodeInfo object, and replaces it with the v1.Node to pass through grpc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comment is misleading, original nodeInfos are not modified (and they must not be!).

Copy link
Contributor Author

@evansheng evansheng Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewrote comment so it is more accurate

for _, option := range bestOptionsResponseOptions {
if option == nil {
klog.Errorf("gRPC server returned nil Option")
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't continue make more sense here? Other options may not be nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixed

options = append(options, nodeGroupIDOptionMap[option.NodeGroupId])
} else {
klog.Errorf("gRPC server returned invalid nodeGroup ID: ", option.NodeGroupId)
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, continue seems more natural.

}
)

func TestPopulateOptionsForGrpc(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider a table-based test to cover various inputs, including edge cases (ex. empty list of options).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some tests, as requested

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 29, 2022
@evansheng evansheng force-pushed the es--grpc-expander-plugin branch from 4a607d9 to 9855a94 Compare February 2, 2022 23:55
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 2, 2022
@evansheng
Copy link
Contributor Author

evansheng commented Feb 3, 2022

@MaciekPytel addressed your comments, PTAL! sorry for the delay

@elmiko
Copy link
Contributor

elmiko commented Feb 14, 2022

we talked about this pr at the sig meeting today, i think things are mostly looking good, but there was a suggestion from the sig that these commits should be squashed before merging. ideally to the minimum number of logical, atomic, commits as possible.

@evansheng evansheng force-pushed the es--grpc-expander-plugin branch from 9855a94 to a2b24e0 Compare February 16, 2022 20:36
@evansheng
Copy link
Contributor Author

@elmiko @MaciekPytel - addressed all comments, and squashed into two commits.
I think this PR is ready for another round of review!

Copy link
Contributor

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evansheng, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 21, 2022
@k8s-ci-robot k8s-ci-robot merged commit 0123869 into kubernetes:master Feb 21, 2022
evansheng pushed a commit to airbnb/autoscaler that referenced this pull request Mar 24, 2022
jiancheung pushed a commit to airbnb/autoscaler that referenced this pull request Jul 29, 2022
akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Oct 27, 2022
akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Nov 2, 2022
bcostabatista pushed a commit to airbnb/autoscaler that referenced this pull request Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants