azure vmss cache fixes and improvements #4685

marwanad · 2022-02-16T23:22:54Z

With the merge of #3717, we lost the optimistic in memory cache we've added. The PR stopped bubbling down the sizeRefreshPeriod which was then removed in #4541 because it was unused.

The realization is that we still need this back for the following scenario:

Assume a cache TTL of 60s, minCount = 2, currentTarget = 3
Two nodes are eligible for deletion by autoscaler, minCount=2 and because CA core won't do the min count check for us in empty node removal.
First node gets removed, we decrement scale.Set.CurSize to 2
Second call toscaleSet.DeleteNodes() comes in and this calls into GetScaleSetSize which in turn ends up reading from manager.cache which would have a cached count of 3 and thus will return you 3 so you end up deleting that node as well

With this PR, we'll extend the last refresh time by sizeRefreshPeriod to ensure that next time we expire, manager.cache would've had the chance to refresh and give us fresh data. That's basically the behaviour prior to #3717. See in 1.19.

The PR also cleans up the logging to refer to "in-memory size" vs the one we get back from the manager cache.

This will impact 1.21+.

/area cloudprovider/azure

k8s-ci-robot · 2022-02-16T23:22:55Z

@marwanad: The label(s) area/cloudprovider/azure cannot be applied, because the repository doesn't have them.

In response to this:

With the merge of #3717, we lost the optimistic in memory cache we've added. The PR stopped bubbling down the sizeRefreshPeriod which was then removed in #4541 because it was unused.

The realization is that we still need this back for the following scenario:

Assume a cache TTL of 60s, minCount = 2, currentTarget = 3

Two nodes are eligible for deletion by autoscaler, minCount=2 and because CA core won't do the min count check for us in empty node removal.

First node gets removed, we decrement scale.Set.CurSize to 2

Second call toscaleSet.DeleteNodes() comes in and this calls into GetScaleSetSize which in turn ends up reading from manager.cache which would have a cached count of 3 and thus will return you 3 so you end up deleting that node as well

With this PR, we'll extend the last refresh time by sizeRefreshPeriod to ensure that next time we expire, manager.cache would've had the chance to refresh and give us fresh data. That's basically the behaviour prior to #3717. See in 1.19.

The PR also cleans up the logging to refer to "in-memory size" vs the one we get back from the manager cache.

/area cloudprovider/azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-02-16T23:23:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marwanad

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/azure/OWNERS~~ [marwanad]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

marwanad · 2022-02-16T23:24:23Z

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go

@@ -567,6 +579,7 @@ func (scaleSet *ScaleSet) setInstanceStatusByProviderID(providerID string, statu
 			scaleSet.instanceCache[k].Status = &status
 		}
 	}
+	scaleSet.lastInstanceRefresh = time.Now()


This added recently with the same motivation too. If we proactively update the instance state to Deletion, we don't want it to be invalidated in the next loop in case that cache is stale.

marwanad · 2022-02-16T23:39:56Z

/area provider/azure

nilo19 · 2022-02-17T02:26:15Z

/lgtm

Cherry-pick #4685, #47874 - Azure vmss cache improvements

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 16, 2022

k8s-ci-robot requested review from feiskyer and nilo19 February 16, 2022 23:23

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 16, 2022

marwanad commented Feb 16, 2022

View reviewed changes

k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Feb 16, 2022

azure vmss cache fixes and improvements

d49a131

marwanad force-pushed the stable-cache branch from bcd247e to d49a131 Compare February 17, 2022 01:52

k8s-ci-robot assigned nilo19 Feb 17, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 17, 2022

k8s-ci-robot merged commit 2f0a452 into kubernetes:master Feb 17, 2022

This was referenced Apr 5, 2022

remove check for returning in-memory size when VMSS is in updating state #4787

Merged

Cherry-pick #4685, #47874 - Azure vmss cache improvements #4794

Merged

k8s-ci-robot added a commit that referenced this pull request Apr 7, 2022

Merge pull request #4794 from marwanad/azure-scale-set-cherry-picks-1.23

9efb637

Cherry-pick #4685, #47874 - Azure vmss cache improvements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azure vmss cache fixes and improvements #4685

azure vmss cache fixes and improvements #4685

marwanad commented Feb 16, 2022 •

edited

Loading

k8s-ci-robot commented Feb 16, 2022

k8s-ci-robot commented Feb 16, 2022

marwanad Feb 16, 2022

marwanad commented Feb 16, 2022

nilo19 commented Feb 17, 2022

azure vmss cache fixes and improvements #4685

azure vmss cache fixes and improvements #4685

Conversation

marwanad commented Feb 16, 2022 • edited Loading

k8s-ci-robot commented Feb 16, 2022

k8s-ci-robot commented Feb 16, 2022

marwanad Feb 16, 2022

Choose a reason for hiding this comment

marwanad commented Feb 16, 2022

nilo19 commented Feb 17, 2022

marwanad commented Feb 16, 2022 •

edited

Loading