Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws: Don't pile up successive full refreshes during AWS scaledowns #3797

Merged
merged 1 commit into from
May 3, 2021

Conversation

bpineau
Copy link
Contributor

@bpineau bpineau commented Jan 6, 2021

Force refreshing everything at every DeleteNodes calls causes slow down
and throttling on large clusters with lots of ASGs and activity.

That function might be called many times in a row during scale-down.
Each time the forced refresh will re-discover all ASGs, all LaunchConfigurations,
then re-list all instances from discovered ASGs.

That immediate refresh isn't required anyway, as the cache's DeleteInstances
concrete implementation will decrement the nodegroup size, and we can
schedule a grouped refresh for the next loop iteration.

As a later step, we can consider splitting the asgCache.generate() function
to support per ASG refreshes (and maybe per ASG caches TTLs + jitter, to
spread API calls). But that should address the current issue for now.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 6, 2021
@k8s-ci-robot k8s-ci-robot requested review from feiskyer and towca January 6, 2021 19:31
@bpineau bpineau force-pushed the aws-not-refreshes-dogpiles branch from 0f745a5 to 09ab1ea Compare January 7, 2021 11:49
@gjtempleton
Copy link
Member

/assign @gjtempleton

@gjtempleton
Copy link
Member

Thanks for the PR.

My only concern is whether we could somehow make the log messages clearer, in particular: https://github.com/kubernetes/autoscaler/pull/3797/files#diff-22984a3a02b16ff49b2a94a43b49f3aa61c856483c3612b06b69dd51e347746fL269 Though it's already a bit misleading.

@bpineau
Copy link
Contributor Author

bpineau commented Apr 8, 2021

This might be clearer but still not great perhaps (or do you have a suggestion @gjtempleton )?:

DeleteInstances was called: scheduling an ASG list refresh for next accesses

@bpineau bpineau changed the title Don't pile up successive full refreshes during AWS scaledowns aws: Don't pile up successive full refreshes during AWS scaledowns Apr 9, 2021
@gjtempleton
Copy link
Member

@bpineau that reads far better to me, one minor nit, maybe?:

DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation

Force refreshing everything at every DeleteNodes calls causes slow down
and throttling on large clusters with many ASGs (and lot of activity).

That function might be called several times in a row during scale-down
(once for each ASG having a node to be removed). Each time the forced
refresh will re-discover all ASGs, all LaunchConfigurations, then re-list all
instances from discovered ASGs.

That immediate refresh isn't required anyway, as the cache's DeleteInstances
concrete implementation will decrement the nodegroup size, and we can
schedule a grouped refresh for the next loop iteration.
@bpineau bpineau force-pushed the aws-not-refreshes-dogpiles branch from 09ab1ea to 037dc73 Compare April 19, 2021 13:12
@bpineau
Copy link
Contributor Author

bpineau commented Apr 19, 2021

Thanks, that's better indeed! Updated accordingly.

@gjtempleton
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 3, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bpineau, gjtempleton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 3, 2021
@k8s-ci-robot k8s-ci-robot merged commit 6c4101b into kubernetes:master May 3, 2021
evansheng pushed a commit to airbnb/autoscaler that referenced this pull request Mar 24, 2022
…piles

aws: Don't pile up successive full refreshes during AWS scaledowns
jiancheung pushed a commit to airbnb/autoscaler that referenced this pull request Jul 29, 2022
…piles

aws: Don't pile up successive full refreshes during AWS scaledowns
akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Oct 27, 2022
…piles

aws: Don't pile up successive full refreshes during AWS scaledowns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants