Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: parallel scale up #5731

Merged
merged 2 commits into from
May 11, 2023

Conversation

pmendelski
Copy link
Contributor

@pmendelski pmendelski commented May 5, 2023

What type of PR is this?

/kind feature
/kind cleanup

What this PR does / why we need it:

This PR introduces an optional flag --parallel-scale-up=true/false that makes the scale-up process parallel. Thanks to this change, the scale-up action across multiple node groups is faster.

The default behavior is unchanged - node groups are scaled synchronously, one by one.

Special notes for your reviewer:

  • Why it's a runtime flag and not a default behavior? It's a runtime flag because there are some CloudProviders that are not thread-safe. It may become a default in the future.
  • Why is there no new/dedicated error to combine concurrent errors? I wanted to keep the API simple without introducing big changes. Concurrent errors are preserved in the final error message, and each of them is logged separately, so information is preserved.
  • Why are there so many changes in test files? Scale-up tests were designed to test the scale-up process on a single node group. I had to open the mechanism for testing changes across multiple node groups simultaneously.

Does this PR introduce a user-facing change?

A new "parallel-scale-up" flag, that makes the scale-up process run parallel across multiple node groups. The default behavior is unchanged, node groups are scaled synchronously, one by one. Before using the flag make sure that the cloud provider you're using is thread-safe and ready for concurrent actions. 

/assign @x13n

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels May 5, 2023
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented May 5, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Contributor

Welcome @pmendelski!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 5, 2023
@k8s-ci-robot k8s-ci-robot requested a review from BigDarkClown May 5, 2023 23:11
@k8s-ci-robot k8s-ci-robot requested a review from x13n May 5, 2023 23:11
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 5, 2023
@pmendelski pmendelski force-pushed the feature-parallel-scale-up branch 2 times, most recently from c4e76d5 to 9342963 Compare May 5, 2023 23:43
@pmendelski
Copy link
Contributor Author

If you're reviewing this PR, please take a look at this test. I wrote it to learn more about autoscaler but the result seems a bit strange to me.

},
expectedErr: errors.NewAutoscalerError(
errors.CloudProviderError,
"A ...other concurrent errors: [\"[cloudProviderError] B\", \"[internalError] A\"]"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems non-intuitive that we skip the type (CloudProviderError) for the first error A, but print it for other errors.
The ... looks rather artificial.
Also this message is losing the information on which node group we encountered the given error, which I think is quite valuable, especially when debugging.

Copy link
Contributor Author

@pmendelski pmendelski May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems non-intuitive that we skip the type (CloudProviderError) for the first error A, but print it for other errors.

It's not that first error is skipped. Errors are:

  • deduplicated - to keep the message short and increase a chance to fit the log length limit.
  • sorted - the error order is not meaningful in concurrent execution. Making them sorted makes log aggregation (and testing) easier.

I could go with a dedicated error that aggregates all others but it would require a rather big change in API and error handlers.

The ... looks rather artificial.

What do you propose?

Also this message is losing the information on which node group we encountered the given error, which I think is quite valuable, especially when debugging.

The information about that maps error to a failed node group scale-up is preserved in logs and clusterStateRegistry


// Asserts all groups are scaled only once.
// Scaling one group multiple times concurrently may cause problems.
func assertUniqueNodeGroups(scaleUpInfos []nodegroupset.ScaleUpInfo) errors.AutoscalerError {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we usually use assert in context of unit tests, I would use check or something similar in this context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

}
return errA.Type() < errB.Type()
})
firstErrMsg := errs[0].Error()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we make two map[string]bool one having types and second messages?
This way len(typesMap) > 1 should be equivalent to singleType variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right but we would simply replace two booleans with two maps, making it a bit less optimal. Is it a readability issue?

availableGPUTypes map[string]struct{},
now time.Time,
) errors.AutoscalerError {
gpuConfig := e.autoscalingContext.CloudProvider.GetNodeGpuConfig(nodeInfo.Node())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function call a platitude of other objects, did we check that all of them support concurrent calls?
I did my own quick check and didn't found any obvious issues, but wanted to confirm as any potential issues here will be hard to debug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point but the more critical line is info.Group.IncreaseSize(increase) as it modifies the state. I did a thorough analysis and finished with a conclusion that not all cloudProviders are ready for concurrent operations. That's why it's a flag and the decision is left to the user. Although I've just added a warning message to flag description.

Copy link
Member

@x13n x13n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! Left a couple of comments.

kube_client "k8s.io/client-go/kubernetes"
kube_record "k8s.io/client-go/tools/record"
schedulerframework "k8s.io/kubernetes/pkg/scheduler/framework"

"k8s.io/autoscaler/cluster-autoscaler/utils/backoff"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it would make sense to group all k8s.io/autoscaler/cluster-autoscaler imports together. In general, I suggest the following order:

  • core libs (sorted lexicographically)
  • 1 line space
  • imports internal to the project (sorted lexicographically)
  • 1 line space
  • external imports (sorted lexicographically)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

return &option
}
}
assert.Fail(r.t, "did not find expansionOptionToChoose %s", r.optionToChoose)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: %v (or even %+v) will give a better format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

}

// NewMockRepotingStrategy creates an expander strategy with reporting and mocking capabilities.
func NewMockRepotingStrategy(t *testing.T, optionToChoose *GroupSizeChange) *MockReportingStrategy {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd consider passing optionToChoose by value - we're not comparing pointers and the struct is quite small, so there won't be a lot of overhead anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but it's passed by reference because it can be null. If you don't care about the optionToChoose, one will be picked according to the default strategy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point, let's leave as is then.

return nil
}

func combineConcurrentScaleUpErrors(errs []errors.AutoscalerError) errors.AutoscalerError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite generic, WDYT about moving it to errors package and just calling Combine? I don't think there's anything scale up specific here, apart from the current name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it as well. I didn't go that path because:

  • this function is dedicated for combining concurrent errors (errors are deduplicated and sorted)
  • it's based on an assumption that in this particular place we're ok with loosing some information about aggregated errors (as it's reported in logs and state). Thanks to this we omit changes in API.

...so there are some assumptions we're aware of. Having them in errors package could make this function the default way. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is something that could be addressed by documenting in the comment and perhaps reflecting better in the function name (Aggregate? Coalesce? Naming is hard...) Leaving up to you though.


// Asserts all groups are scaled only once.
// Scaling one group multiple times concurrently may cause problems.
func assertUniqueNodeGroups(scaleUpInfos []nodegroupset.ScaleUpInfo) errors.AutoscalerError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works, but I wonder if it would make sense to enforce this at compile time by creating type ScaleUpInfos map[string]ScaleUpInfo and using that instead? Obviously that's yet another refactor, but hopefully not a huge one (perhaps could be carved out into its own commit though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, what stands in a way of putting the same object under different keys? 😉
We would still require some validation, therefore I left the array.

We could introduce a hashset type (with group.id as the hash key), but it felt like to much for this particular usage. As the array size is guranteed to be small.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setters do: the key can be derived from group in a setter function. Agreed it might be an overkill, so not blocking on this.

}
var builder strings.Builder
builder.WriteString(firstErrMsg)
builder.WriteString(" ...other concurrent errors: [")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: s/...other/... and other/ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@pmendelski pmendelski force-pushed the feature-parallel-scale-up branch 2 times, most recently from 213e649 to 7ee8070 Compare May 11, 2023 08:57
@@ -147,7 +147,8 @@ var (
maxGracefulTerminationFlag = flag.Int("max-graceful-termination-sec", 10*60, "Maximum number of seconds CA waits for pod termination when trying to scale down a node.")
maxTotalUnreadyPercentage = flag.Float64("max-total-unready-percentage", 45, "Maximum percentage of unready nodes in the cluster. After this is exceeded, CA halts operations")
okTotalUnreadyCount = flag.Int("ok-total-unready-count", 3, "Number of allowed unready nodes, irrespective of max-total-unready-percentage")
scaleUpFromZero = flag.Bool("scale-up-from-zero", true, "Should CA scale up when there 0 ready nodes.")
scaleUpFromZero = flag.Bool("scale-up-from-zero", true, "Should CA scale up when there are 0 ready nodes.")
parallelScaleUp = flag.Bool("parallel-scale-up", false, "Whether to allow parallel node groups scale up. Make sure you cloud provider supports parallel operations.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "you cloud provider" -> "your cloud provider"

Also, as a user, how do I do that? This seems a bit non-actionable, I'd rather just mark it as "experimental: may not work on some cloud providers, enable at your own risk" or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Your message is much better

@pmendelski pmendelski force-pushed the feature-parallel-scale-up branch 5 times, most recently from e6c4ea0 to 1704ec6 Compare May 11, 2023 11:01
Copy link
Contributor

@kisieland kisieland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment otherwise LGTM

builder.WriteString(firstErr.Error())
builder.WriteString(" ...and other concurrent errors: [")
formattedErrs := make(map[errors.AutoscalerError]bool)
formattedErrs[firstErr] = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write it as one operation:

formattedErrs := map[errors.AutoscalerError]bool{
firstErr: true,
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@pmendelski pmendelski force-pushed the feature-parallel-scale-up branch from 1704ec6 to 51f1660 Compare May 11, 2023 12:56
@x13n
Copy link
Member

x13n commented May 11, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 11, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pmendelski, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2023
@k8s-ci-robot k8s-ci-robot merged commit ac8253f into kubernetes:master May 11, 2023
@grosser
Copy link
Contributor

grosser commented Nov 15, 2023

I tried to use that but it seems it always ends up over-scaling when using 1 asg per az.
For example 1 pending pod + 3 asgs (1 per az), since the pod could go into any of them it seems to scale all 3.
Is that expected / should be documented as gotcha ?

@grosser
Copy link
Contributor

grosser commented Nov 15, 2023

was unable to reproduce, so must have been a 1-off 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants