Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup: refactor Azure cache and remove redundant API calls #3717

Merged
merged 1 commit into from
Dec 14, 2020

Conversation

CecileRobertMichon
Copy link
Member

@CecileRobertMichon CecileRobertMichon commented Nov 24, 2020

This PR cleans up the Azure cloud provider cache to optimize API calls and facilitate further improvements.

Currently, multiple Azure VMSS List calls are made throughout the code, such as here and here. In addition, agent_pools and aks providers both have their own code to list VMs.

This centralizes all VM and VMSS List() API calls in one central cache which refreshes every minute, which means we should see 60 VM list and 60 VMSS list API calls per hour, regardless of # of agent pools/scale sets. The other calls for VMSS are 1) capacity increase/decrease for scale ups and scale downs, and 2) VMSS VM list (which lists all the instances in a scale set). Number 2) is still quite costly and increases linearly with the number of scale sets. I'd like to try and tackle it next once this PR merges. For agent pools, the only other calls are Deletes and Creates to add/remove VMs.

Here is some initial data I gathered. Both of these clusters have 10 scale sets. The first one is running the latest cluster-autoscaler release, and the second is running with an image built from this PR's code. Chart shows number of API calls per 5 minutes:

Screen Shot 2020-11-23 at 5 14 39 PM

Screen Shot 2020-11-23 at 5 14 51 PM

With 20 scale sets when autoscaler is idle (no workloads running):
Before (54-56 calls per 5 minutes):
Screen Shot 2020-11-24 at 6 24 11 PM

After (50 calls per 5 minutes):
Screen Shot 2020-11-24 at 6 23 52 PM

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 24, 2020
@CecileRobertMichon
Copy link
Member Author

/cc @khenidak @marwanad @feiskyer

@CecileRobertMichon CecileRobertMichon changed the title [WIP] cleanup: refactor Azure cache and remove redundant API calls cleanup: refactor Azure cache and remove redundant API calls Nov 30, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 30, 2020
@CecileRobertMichon
Copy link
Member Author

@feiskyer @nilo19 please review

Copy link
Member

@feiskyer feiskyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general

@marwanad since the changes here are not small, could you run an e2e test with the patch?

@feiskyer
Copy link
Member

feiskyer commented Dec 2, 2020

Please fix the unit test failures:

# k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure [k8s.io/autoscaler/cluster-autoscaler/cloudprovider/azure.test]
cloudprovider/azure/azure_cache_test.go:31:26: not enough arguments in call to newAzureCache
	have (*azClient, string)
	want (*azClient, time.Duration, string, string)
cloudprovider/azure/azure_cache_test.go:49:26: not enough arguments in call to newAzureCache
	have (*azClient, string)
	want (*azClient, time.Duration, string, string)
cloudprovider/azure/azure_cache_test.go:60:26: not enough arguments in call to newAzureCache
	have (*azClient, string)
	want (*azClient, time.Duration, string, string)
cloudprovider/azure/azure_cloud_provider_test.go:71:31: not enough arguments in call to newAzureCache
	have (*azClient, string)
	want (*azClient, time.Duration, string, string)
cloudprovider/azure/azure_scale_set_test.go:44:22: undefined: defaultVmssSizeRefreshPeriod

@CecileRobertMichon CecileRobertMichon force-pushed the reduce-api-calls branch 2 times, most recently from 20a3ca3 to 9b59359 Compare December 2, 2020 22:56
@nilo19
Copy link
Member

nilo19 commented Dec 3, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 3, 2020
@nilo19
Copy link
Member

nilo19 commented Dec 3, 2020

the unit test still fails

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 3, 2020
@CecileRobertMichon CecileRobertMichon force-pushed the reduce-api-calls branch 2 times, most recently from 1679e5d to 653137c Compare December 4, 2020 17:13
// - limit repetitive Azure API calls.
type azureCache struct {
mutex sync.Mutex
interrupt chan struct{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this channel used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marwanad
Copy link
Member

marwanad commented Dec 5, 2020

Were the before and after case running with a different TTL? So, before was run with 15 seconds and the after case is this PR (which has a TTL of 60 seconds)

@CecileRobertMichon
Copy link
Member Author

Were the before and after case running with a different TTL? So, before was run with 15 seconds and the after case is this PR (which has a TTL of 60 seconds)

Before there were 4 caches:

  • vmInstancesRefreshPeriod (list agent pool VMs): TTL 5 minutes
  • vmssSizeRefreshPeriod (list scale sets): TTL 15 seconds
  • vmssInstancesRefreshPeriod (list scale set VMs): TTL 5 minutes
  • manager asgCache (also calling list scale sets): TTL 1 minute

Now there is:

  • manager azureCache (lists VMs or VMSS depending on config): TTL 1 minute
  • vmssInstancesRefreshPeriod (list scale set VMs): TTL 5 minutes

So yes, one of the TTLs which was at 15s to list scale sets in the scale set cache changed from 15s to 1 minute by default. The VmssCacheTTL config value is still honored though and will override that 1 minute refresh interval if set.

@CecileRobertMichon
Copy link
Member Author

@marwanad were you able to run the e2e test that @feiskyer mentioned above? I can also help with that if you show me how to do it.

@marwanad
Copy link
Member

will have a soak cluster over the weekend and take another pass at the PR.

@marwanad
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 14, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon, marwanad

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2020
@k8s-ci-robot k8s-ci-robot merged commit 7af23ba into kubernetes:master Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants