-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cleanup: refactor Azure cache and remove redundant API calls #3717
cleanup: refactor Azure cache and remove redundant API calls #3717
Conversation
5904044
to
ce915f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general
@marwanad since the changes here are not small, could you run an e2e test with the patch?
Please fix the unit test failures:
|
20a3ca3
to
9b59359
Compare
/lgtm |
the unit test still fails |
9b59359
to
a0ba266
Compare
1679e5d
to
653137c
Compare
// - limit repetitive Azure API calls. | ||
type azureCache struct { | ||
mutex sync.Mutex | ||
interrupt chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this channel used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure, it was there before https://github.com/kubernetes/autoscaler/pull/3717/files#diff-63481ca096e322d8f48e57a4b21089a30481cc3798e3dc123a1b4ba0938ceb1dL37
Were the before and after case running with a different TTL? So, before was run with 15 seconds and the after case is this PR (which has a TTL of 60 seconds) |
653137c
to
28badba
Compare
Before there were 4 caches:
Now there is:
So yes, one of the TTLs which was at 15s to list scale sets in the scale set cache changed from 15s to 1 minute by default. The |
will have a soak cluster over the weekend and take another pass at the PR. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon, marwanad The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR cleans up the Azure cloud provider cache to optimize API calls and facilitate further improvements.
Currently, multiple Azure VMSS List calls are made throughout the code, such as here and here. In addition, agent_pools and aks providers both have their own code to list VMs.
This centralizes all VM and VMSS List() API calls in one central cache which refreshes every minute, which means we should see 60 VM list and 60 VMSS list API calls per hour, regardless of # of agent pools/scale sets. The other calls for VMSS are 1) capacity increase/decrease for scale ups and scale downs, and 2) VMSS VM list (which lists all the instances in a scale set). Number 2) is still quite costly and increases linearly with the number of scale sets. I'd like to try and tackle it next once this PR merges. For agent pools, the only other calls are Deletes and Creates to add/remove VMs.
Here is some initial data I gathered. Both of these clusters have 10 scale sets. The first one is running the latest cluster-autoscaler release, and the second is running with an image built from this PR's code. Chart shows number of API calls per 5 minutes:
With 20 scale sets when autoscaler is idle (no workloads running):
Before (54-56 calls per 5 minutes):
After (50 calls per 5 minutes):