cleanup: refactor Azure cache and remove redundant API calls #3717

CecileRobertMichon · 2020-11-24T00:33:26Z

This PR cleans up the Azure cloud provider cache to optimize API calls and facilitate further improvements.

Currently, multiple Azure VMSS List calls are made throughout the code, such as here and here. In addition, agent_pools and aks providers both have their own code to list VMs.

This centralizes all VM and VMSS List() API calls in one central cache which refreshes every minute, which means we should see 60 VM list and 60 VMSS list API calls per hour, regardless of # of agent pools/scale sets. The other calls for VMSS are 1) capacity increase/decrease for scale ups and scale downs, and 2) VMSS VM list (which lists all the instances in a scale set). Number 2) is still quite costly and increases linearly with the number of scale sets. I'd like to try and tackle it next once this PR merges. For agent pools, the only other calls are Deletes and Creates to add/remove VMs.

Here is some initial data I gathered. Both of these clusters have 10 scale sets. The first one is running the latest cluster-autoscaler release, and the second is running with an image built from this PR's code. Chart shows number of API calls per 5 minutes:

With 20 scale sets when autoscaler is idle (no workloads running):
Before (54-56 calls per 5 minutes):

After (50 calls per 5 minutes):

CecileRobertMichon · 2020-11-24T00:34:01Z

/cc @khenidak @marwanad @feiskyer

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go

cluster-autoscaler/cloudprovider/azure/azure_cache.go

CecileRobertMichon · 2020-11-30T17:29:29Z

@feiskyer @nilo19 please review

cluster-autoscaler/cloudprovider/azure/azure_cache.go

feiskyer

LGTM in general

@marwanad since the changes here are not small, could you run an e2e test with the patch?

feiskyer · 2020-12-02T11:26:52Z

Please fix the unit test failures:

# k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure [k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure.test]
cloudprovider/azure/azure_cache_test.go:31:26: not enough arguments in call to newAzureCache
	have (*azClient, string)
	want (*azClient, time.Duration, string, string)
cloudprovider/azure/azure_cache_test.go:49:26: not enough arguments in call to newAzureCache
	have (*azClient, string)
	want (*azClient, time.Duration, string, string)
cloudprovider/azure/azure_cache_test.go:60:26: not enough arguments in call to newAzureCache
	have (*azClient, string)
	want (*azClient, time.Duration, string, string)
cloudprovider/azure/azure_cloud_provider_test.go:71:31: not enough arguments in call to newAzureCache
	have (*azClient, string)
	want (*azClient, time.Duration, string, string)
cloudprovider/azure/azure_scale_set_test.go:44:22: undefined: defaultVmssSizeRefreshPeriod

nilo19 · 2020-12-03T01:40:53Z

/lgtm

nilo19 · 2020-12-03T03:34:58Z

the unit test still fails

marwanad · 2020-12-05T06:07:14Z

cluster-autoscaler/cloudprovider/azure/azure_cache.go

+// - limit repetitive Azure API calls.
+type azureCache struct {
+	mutex           sync.Mutex
+	interrupt       chan struct{}


what is this channel used for?

I'm not entirely sure, it was there before https://github.com/kubernetes/autoscaler/pull/3717/files#diff-63481ca096e322d8f48e57a4b21089a30481cc3798e3dc123a1b4ba0938ceb1dL37

marwanad · 2020-12-05T06:11:34Z

Were the before and after case running with a different TTL? So, before was run with 15 seconds and the after case is this PR (which has a TTL of 60 seconds)

cluster-autoscaler/cloudprovider/azure/azure_cache.go

CecileRobertMichon · 2020-12-07T21:07:39Z

Were the before and after case running with a different TTL? So, before was run with 15 seconds and the after case is this PR (which has a TTL of 60 seconds)

Before there were 4 caches:

vmInstancesRefreshPeriod (list agent pool VMs): TTL 5 minutes
vmssSizeRefreshPeriod (list scale sets): TTL 15 seconds
vmssInstancesRefreshPeriod (list scale set VMs): TTL 5 minutes
manager asgCache (also calling list scale sets): TTL 1 minute

Now there is:

manager azureCache (lists VMs or VMSS depending on config): TTL 1 minute
vmssInstancesRefreshPeriod (list scale set VMs): TTL 5 minutes

So yes, one of the TTLs which was at 15s to list scale sets in the scale set cache changed from 15s to 1 minute by default. The VmssCacheTTL config value is still honored though and will override that 1 minute refresh interval if set.

CecileRobertMichon · 2020-12-10T17:28:06Z

@marwanad were you able to run the e2e test that @feiskyer mentioned above? I can also help with that if you show me how to do it.

marwanad · 2020-12-11T06:54:18Z

will have a soak cluster over the weekend and take another pass at the PR.

marwanad · 2020-12-14T17:11:56Z

/lgtm
/approve

k8s-ci-robot · 2020-12-14T17:12:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon, marwanad

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/azure/OWNERS~~ [marwanad]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 24, 2020

k8s-ci-robot requested review from feiskyer, khenidak, marwanad and nilo19 November 24, 2020 00:34

CecileRobertMichon commented Nov 24, 2020

View reviewed changes

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go Show resolved Hide resolved

marwanad reviewed Nov 24, 2020

View reviewed changes

cluster-autoscaler/cloudprovider/azure/azure_cache.go Show resolved Hide resolved

CecileRobertMichon commented Nov 24, 2020

View reviewed changes

cluster-autoscaler/cloudprovider/azure/azure_cache.go Show resolved Hide resolved

CecileRobertMichon changed the title ~~[WIP] cleanup: refactor Azure cache and remove redundant API calls~~ cleanup: refactor Azure cache and remove redundant API calls Nov 30, 2020

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 30, 2020

CecileRobertMichon force-pushed the reduce-api-calls branch from 5904044 to ce915f9 Compare November 30, 2020 17:28

feiskyer reviewed Dec 2, 2020

View reviewed changes

cluster-autoscaler/cloudprovider/azure/azure_cache.go Outdated Show resolved Hide resolved

feiskyer reviewed Dec 2, 2020

View reviewed changes

CecileRobertMichon force-pushed the reduce-api-calls branch 2 times, most recently from 20a3ca3 to 9b59359 Compare December 2, 2020 22:56

k8s-ci-robot assigned nilo19 Dec 3, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 3, 2020

CecileRobertMichon force-pushed the reduce-api-calls branch from 9b59359 to a0ba266 Compare December 3, 2020 17:36

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 3, 2020

CecileRobertMichon force-pushed the reduce-api-calls branch 2 times, most recently from 1679e5d to 653137c Compare December 4, 2020 17:13

marwanad reviewed Dec 5, 2020

View reviewed changes

cluster-autoscaler/cloudprovider/azure/azure_cache.go Outdated Show resolved Hide resolved

marwanad reviewed Dec 5, 2020

View reviewed changes

cluster-autoscaler/cloudprovider/azure/azure_cache.go Show resolved Hide resolved

cleanup: refactor Azure cache and remove redundant API calls

28badba

CecileRobertMichon force-pushed the reduce-api-calls branch from 653137c to 28badba Compare December 7, 2020 18:55

k8s-ci-robot assigned marwanad Dec 14, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 14, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2020

k8s-ci-robot merged commit 7af23ba into kubernetes:master Dec 14, 2020

marwanad mentioned this pull request Mar 24, 2021

fix: add missing call to fetch autodiscovered nodegroups #3972

Merged

towca mentioned this pull request Sep 30, 2021

Cluster Autoscaler: unit tests not passing on 1.19 and 1.20 #4368

Closed

marwanad mentioned this pull request Dec 20, 2021

improve logging for scale set size changes #4541

Merged

marwanad mentioned this pull request Feb 16, 2022

azure vmss cache fixes and improvements #4685

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cleanup: refactor Azure cache and remove redundant API calls #3717

cleanup: refactor Azure cache and remove redundant API calls #3717

CecileRobertMichon commented Nov 24, 2020 •

edited

Loading

CecileRobertMichon commented Nov 24, 2020

CecileRobertMichon commented Nov 30, 2020

feiskyer left a comment

feiskyer commented Dec 2, 2020

nilo19 commented Dec 3, 2020

nilo19 commented Dec 3, 2020

marwanad Dec 5, 2020

CecileRobertMichon Dec 7, 2020

marwanad commented Dec 5, 2020

CecileRobertMichon commented Dec 7, 2020

CecileRobertMichon commented Dec 10, 2020

marwanad commented Dec 11, 2020

marwanad commented Dec 14, 2020

k8s-ci-robot commented Dec 14, 2020

cleanup: refactor Azure cache and remove redundant API calls #3717

cleanup: refactor Azure cache and remove redundant API calls #3717

Conversation

CecileRobertMichon commented Nov 24, 2020 • edited Loading

CecileRobertMichon commented Nov 24, 2020

CecileRobertMichon commented Nov 30, 2020

feiskyer left a comment

Choose a reason for hiding this comment

feiskyer commented Dec 2, 2020

nilo19 commented Dec 3, 2020

nilo19 commented Dec 3, 2020

marwanad Dec 5, 2020

Choose a reason for hiding this comment

CecileRobertMichon Dec 7, 2020

Choose a reason for hiding this comment

marwanad commented Dec 5, 2020

CecileRobertMichon commented Dec 7, 2020

CecileRobertMichon commented Dec 10, 2020

marwanad commented Dec 11, 2020

marwanad commented Dec 14, 2020

k8s-ci-robot commented Dec 14, 2020

CecileRobertMichon commented Nov 24, 2020 •

edited

Loading