Feature: parallel scale up #5731

pmendelski · 2023-05-05T23:10:27Z

What type of PR is this?

/kind feature
/kind cleanup

What this PR does / why we need it:

This PR introduces an optional flag --parallel-scale-up=true/false that makes the scale-up process parallel. Thanks to this change, the scale-up action across multiple node groups is faster.

The default behavior is unchanged - node groups are scaled synchronously, one by one.

Special notes for your reviewer:

Why it's a runtime flag and not a default behavior? It's a runtime flag because there are some CloudProviders that are not thread-safe. It may become a default in the future.
Why is there no new/dedicated error to combine concurrent errors? I wanted to keep the API simple without introducing big changes. Concurrent errors are preserved in the final error message, and each of them is logged separately, so information is preserved.
Why are there so many changes in test files? Scale-up tests were designed to test the scale-up process on a single node group. I had to open the mechanism for testing changes across multiple node groups simultaneously.

Does this PR introduce a user-facing change?

A new "parallel-scale-up" flag, that makes the scale-up process run parallel across multiple node groups. The default behavior is unchanged, node groups are scaled synchronously, one by one. Before using the flag make sure that the cloud provider you're using is thread-safe and ready for concurrent actions.

/assign @x13n

linux-foundation-easycla · 2023-05-05T23:10:34Z

The committers listed above are authorized under a signed CLA.

✅ login: pmendelski / name: Paweł Mendelski (de152b3, 09831cc)

k8s-ci-robot · 2023-05-05T23:10:35Z

Welcome @pmendelski!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

pmendelski · 2023-05-09T07:10:19Z

If you're reviewing this PR, please take a look at this test. I wrote it to learn more about autoscaler but the result seems a bit strange to me.

kisieland · 2023-05-10T12:18:49Z

cluster-autoscaler/core/scaleup/orchestrator/executor_test.go

+			},
+			expectedErr: errors.NewAutoscalerError(
+				errors.CloudProviderError,
+				"A ...other concurrent errors: [\"[cloudProviderError] B\", \"[internalError] A\"]"),


This seems non-intuitive that we skip the type (CloudProviderError) for the first error A, but print it for other errors.
The ... looks rather artificial.
Also this message is losing the information on which node group we encountered the given error, which I think is quite valuable, especially when debugging.

This seems non-intuitive that we skip the type (CloudProviderError) for the first error A, but print it for other errors.

It's not that first error is skipped. Errors are:

deduplicated - to keep the message short and increase a chance to fit the log length limit.

sorted - the error order is not meaningful in concurrent execution. Making them sorted makes log aggregation (and testing) easier.

I could go with a dedicated error that aggregates all others but it would require a rather big change in API and error handlers.

The ... looks rather artificial.

What do you propose?

Also this message is losing the information on which node group we encountered the given error, which I think is quite valuable, especially when debugging.

The information about that maps error to a failed node group scale-up is preserved in logs and clusterStateRegistry

kisieland · 2023-05-10T12:20:39Z

cluster-autoscaler/core/scaleup/orchestrator/executor.go

+
+// Asserts all groups are scaled only once.
+// Scaling one group multiple times concurrently may cause problems.
+func assertUniqueNodeGroups(scaleUpInfos []nodegroupset.ScaleUpInfo) errors.AutoscalerError {


I think we usually use assert in context of unit tests, I would use check or something similar in this context.

kisieland · 2023-05-10T12:22:26Z

cluster-autoscaler/core/scaleup/orchestrator/executor.go

+		}
+		return errA.Type() < errB.Type()
+	})
+	firstErrMsg := errs[0].Error()


Why can't we make two map[string]bool one having types and second messages?
This way len(typesMap) > 1 should be equivalent to singleType variable

Yes, that's right but we would simply replace two booleans with two maps, making it a bit less optimal. Is it a readability issue?

kisieland · 2023-05-10T12:23:58Z

cluster-autoscaler/core/scaleup/orchestrator/executor.go

+	availableGPUTypes map[string]struct{},
+	now time.Time,
+) errors.AutoscalerError {
+	gpuConfig := e.autoscalingContext.CloudProvider.GetNodeGpuConfig(nodeInfo.Node())


This function call a platitude of other objects, did we check that all of them support concurrent calls?
I did my own quick check and didn't found any obvious issues, but wanted to confirm as any potential issues here will be hard to debug.

Good point but the more critical line is info.Group.IncreaseSize(increase) as it modifies the state. I did a thorough analysis and finished with a conclusion that not all cloudProviders are ready for concurrent operations. That's why it's a flag and the decision is left to the user. Although I've just added a warning message to flag description.

x13n

Thanks for the change! Left a couple of comments.

x13n · 2023-05-10T09:23:30Z

cluster-autoscaler/core/test/common.go

 	kube_client "k8s.io/client-go/kubernetes"
 	kube_record "k8s.io/client-go/tools/record"
 	schedulerframework "k8s.io/kubernetes/pkg/scheduler/framework"
+
+	"k8s.io/autoscaler/cluster-autoscaler/utils/backoff"


nit: I think it would make sense to group all k8s.io/autoscaler/cluster-autoscaler imports together. In general, I suggest the following order:

core libs (sorted lexicographically)

1 line space

imports internal to the project (sorted lexicographically)

1 line space

external imports (sorted lexicographically)

x13n · 2023-05-10T09:31:18Z

cluster-autoscaler/core/test/common.go

+			return &option
+		}
+	}
+	assert.Fail(r.t, "did not find expansionOptionToChoose %s", r.optionToChoose)


nit: %v (or even %+v) will give a better format

x13n · 2023-05-10T09:33:32Z

cluster-autoscaler/core/test/common.go

+}
+
+// NewMockRepotingStrategy creates an expander strategy with reporting and mocking capabilities.
+func NewMockRepotingStrategy(t *testing.T, optionToChoose *GroupSizeChange) *MockReportingStrategy {


nit: I'd consider passing optionToChoose by value - we're not comparing pointers and the struct is quite small, so there won't be a lot of overhead anyway.

I agree, but it's passed by reference because it can be null. If you don't care about the optionToChoose, one will be picked according to the default strategy.

Ah, good point, let's leave as is then.

x13n · 2023-05-10T09:54:20Z

cluster-autoscaler/core/scaleup/orchestrator/executor.go

+	return nil
+}
+
+func combineConcurrentScaleUpErrors(errs []errors.AutoscalerError) errors.AutoscalerError {


This is quite generic, WDYT about moving it to errors package and just calling Combine? I don't think there's anything scale up specific here, apart from the current name.

I thought about it as well. I didn't go that path because:

this function is dedicated for combining concurrent errors (errors are deduplicated and sorted)

it's based on an assumption that in this particular place we're ok with loosing some information about aggregated errors (as it's reported in logs and state). Thanks to this we omit changes in API.

...so there are some assumptions we're aware of. Having them in errors package could make this function the default way. WDYT?

I think this is something that could be addressed by documenting in the comment and perhaps reflecting better in the function name (Aggregate? Coalesce? Naming is hard...) Leaving up to you though.

x13n · 2023-05-10T11:09:02Z

cluster-autoscaler/core/scaleup/orchestrator/executor.go

+
+// Asserts all groups are scaled only once.
+// Scaling one group multiple times concurrently may cause problems.
+func assertUniqueNodeGroups(scaleUpInfos []nodegroupset.ScaleUpInfo) errors.AutoscalerError {


That works, but I wonder if it would make sense to enforce this at compile time by creating type ScaleUpInfos map[string]ScaleUpInfo and using that instead? Obviously that's yet another refactor, but hopefully not a huge one (perhaps could be carved out into its own commit though).

Ha, what stands in a way of putting the same object under different keys? 😉
We would still require some validation, therefore I left the array.

We could introduce a hashset type (with group.id as the hash key), but it felt like to much for this particular usage. As the array size is guranteed to be small.

Setters do: the key can be derived from group in a setter function. Agreed it might be an overkill, so not blocking on this.

x13n · 2023-05-10T11:28:42Z

cluster-autoscaler/core/scaleup/orchestrator/executor.go

+	}
+	var builder strings.Builder
+	builder.WriteString(firstErrMsg)
+	builder.WriteString(" ...other concurrent errors: [")


super nit: s/...other/... and other/ ?

x13n · 2023-05-11T09:04:41Z

cluster-autoscaler/main.go

@@ -147,7 +147,8 @@ var (
 	maxGracefulTerminationFlag = flag.Int("max-graceful-termination-sec", 10*60, "Maximum number of seconds CA waits for pod termination when trying to scale down a node.")
 	maxTotalUnreadyPercentage  = flag.Float64("max-total-unready-percentage", 45, "Maximum percentage of unready nodes in the cluster.  After this is exceeded, CA halts operations")
 	okTotalUnreadyCount        = flag.Int("ok-total-unready-count", 3, "Number of allowed unready nodes, irrespective of max-total-unready-percentage")
-	scaleUpFromZero            = flag.Bool("scale-up-from-zero", true, "Should CA scale up when there 0 ready nodes.")
+	scaleUpFromZero            = flag.Bool("scale-up-from-zero", true, "Should CA scale up when there are 0 ready nodes.")
+	parallelScaleUp            = flag.Bool("parallel-scale-up", false, "Whether to allow parallel node groups scale up. Make sure you cloud provider supports parallel operations.")


typo: "you cloud provider" -> "your cloud provider"

Also, as a user, how do I do that? This seems a bit non-actionable, I'd rather just mark it as "experimental: may not work on some cloud providers, enable at your own risk" or something.

Good point. Your message is much better

kisieland

One comment otherwise LGTM

kisieland · 2023-05-11T12:35:08Z

cluster-autoscaler/core/scaleup/orchestrator/executor.go

+	builder.WriteString(firstErr.Error())
+	builder.WriteString(" ...and other concurrent errors: [")
+	formattedErrs := make(map[errors.AutoscalerError]bool)
+	formattedErrs[firstErr] = true


I would write it as one operation:

formattedErrs := map[errors.AutoscalerError]bool{ firstErr: true, }

x13n · 2023-05-11T13:10:23Z

/lgtm
/approve

k8s-ci-robot · 2023-05-11T13:10:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pmendelski, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [x13n]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

grosser · 2023-11-15T19:30:46Z

I tried to use that but it seems it always ends up over-scaling when using 1 asg per az.
For example 1 pending pod + 3 asgs (1 per az), since the pod could go into any of them it seems to scale all 3.
Is that expected / should be documented as gotcha ?

grosser · 2023-11-15T20:13:54Z

was unable to reproduce, so must have been a 1-off 👍

k8s-ci-robot assigned x13n May 5, 2023

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels May 5, 2023

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 5, 2023

k8s-ci-robot requested a review from BigDarkClown May 5, 2023 23:11

k8s-ci-robot added the area/cluster-autoscaler label May 5, 2023

k8s-ci-robot requested a review from x13n May 5, 2023 23:11

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 5, 2023

pmendelski force-pushed the feature-parallel-scale-up branch 2 times, most recently from c4e76d5 to 9342963 Compare May 5, 2023 23:43

kisieland reviewed May 10, 2023

View reviewed changes

x13n requested changes May 10, 2023

View reviewed changes

Add option parallel-scale-up

3e2d48d

pmendelski force-pushed the feature-parallel-scale-up branch 2 times, most recently from 213e649 to 7ee8070 Compare May 11, 2023 08:57

x13n reviewed May 11, 2023

View reviewed changes

pmendelski force-pushed the feature-parallel-scale-up branch 5 times, most recently from e6c4ea0 to 1704ec6 Compare May 11, 2023 11:01

kisieland reviewed May 11, 2023

View reviewed changes

deduplicate scale-up test setup

51f1660

pmendelski force-pushed the feature-parallel-scale-up branch from 1704ec6 to 51f1660 Compare May 11, 2023 12:56

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 11, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2023

k8s-ci-robot merged commit ac8253f into kubernetes:master May 11, 2023

himanshu-kun mentioned this pull request Dec 7, 2023

Support parallel node-group scale-ups gardener/autoscaler#268

Open

Feature: parallel scale up #5731

Feature: parallel scale up #5731

Conversation

pmendelski commented May 5, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

linux-foundation-easycla bot commented May 5, 2023 • edited Loading

k8s-ci-robot commented May 5, 2023

pmendelski commented May 9, 2023

Choose a reason for hiding this comment

pmendelski May 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

x13n left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kisieland left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

x13n commented May 11, 2023

k8s-ci-robot commented May 11, 2023

grosser commented Nov 15, 2023

grosser commented Nov 15, 2023

pmendelski commented May 5, 2023 •

edited

Loading

linux-foundation-easycla bot commented May 5, 2023 •

edited

Loading

pmendelski May 11, 2023 •

edited

Loading