Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync with upstream 1.22 #121

Conversation

himanshu-kun
Copy link

@himanshu-kun himanshu-kun commented May 16, 2022

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #123

Special notes for your reviewer:

  • CA IT are running locally
  • will update SYNC_CHANGES

Release note:

sync the changes till v1.22.0 of upstream autoscaler

mcristina422 and others added 30 commits March 12, 2021 12:51
This change adds 4 metrics that can be used to monitor the minimum and
maximum limits for CPU and memory, as well as the current counts in
cores and bytes, respectively.

The four metrics added are:
* `cluster_autoscaler_cpu_limits_cores`
* `cluster_autoscaler_cluster_cpu_current_cores`
* `cluster_autoscaler_memory_limits_bytes`
* `cluster_autoscaler_cluster_memory_current_bytes`

This change also adds the `max_cores_total` metric to the metrics
proposal doc, as it was previously not recorded there.

User story: As a cluster autoscaler user, I would like to monitor my
cluster through metrics to determine when the cluster is nearing its
limits for cores and memory usage.
Now supported by magnum.
https://review.opendev.org/c/openstack/magnum/+/737580/

If using node group autodiscovery, older versions
of magnum will still forbid scaling to zero or
setting the minimum node count to zero.
Force refreshing everything at every DeleteNodes calls causes slow down
and throttling on large clusters with many ASGs (and lot of activity).

That function might be called several times in a row during scale-down
(once for each ASG having a node to be removed). Each time the forced
refresh will re-discover all ASGs, all LaunchConfigurations, then re-list all
instances from discovered ASGs.

That immediate refresh isn't required anyway, as the cache's DeleteInstances
concrete implementation will decrement the nodegroup size, and we can
schedule a grouped refresh for the next loop iteration.
Sets the `kubernetes.io/arch` (and legacy `beta.kubernetes.io/arch`)
to the proper instance architecture.

While at it, re-gen the instance types list (adding new instance types
that were missing)
The current implementation assumes MIG ids have the
"https://content.googleapis.com" prefix, while the
canonical id format seems to begin with "https://www.googleapis.com".
Both formats work while talking to the GCE API, but
the API returns the latter and other GCP services seem to
assume it as well.
Cluster Autoscaler GCE: change the format of MIG id
…ycloud-provider

cloudprovider: add Bizflycloud provider
Remove vivekbagade, add towca as an approver in cluster-autoscaler/OWNERS
…piles

aws: Don't pile up successive full refreshes during AWS scaledowns
Release leader election lock on shutdown
FetchAllMigs (unfiltered InstanceGroupManagers.List) is costly as it's not
bounded to MIGs attached to the current cluster, but rather lists all MIGs
in the project/zone, and therefore equally affects all clusters in that
project/zone. Running the calls concurrently over the region's zones (so at
most, 4 concurrent API calls, about once per minute) contains that impact.

findMigsInRegion might be scoped to the current cluster (name pattern),
but also benefits from the same improvement, as it's also costly and
called at each refreshInterval (1mn).

Also: we're calling out GCE mig.Get() API again for each MIG (at ~300ms per
API call, in my tests), sequentially and with the global cache lock held
(when updateClusterState -> ...-> GetMigForInstance kicks in). Yet we
already get that bit of information (MIG's basename) from any other
mig.Get or mig.List call, like the one fetching target sizes. Leveraging
this helps significantly on large fleets (for instance this shaves 8mn
startup time on the large cluster I tested on).
…locatables

support "/"separators in custom allocatable overrides via vmss tags
add stable zone labels in azure template generation
Document that TLS bootstrapping may be necessary for scale-up
Right now the file is breaking `go mod` commands.
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels May 24, 2022
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label May 26, 2022
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label May 26, 2022
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels May 27, 2022
@himanshu-kun himanshu-kun force-pushed the sync-with-upstream-1.22 branch from f71030b to e6dc0af Compare May 27, 2022 08:42
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels May 27, 2022
@himanshu-kun himanshu-kun added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels May 30, 2022
@gardener-robot-ci-1 gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels May 30, 2022
@himanshu-kun himanshu-kun force-pushed the sync-with-upstream-1.22 branch from 8487624 to 03d4e3d Compare May 30, 2022 13:05
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label May 30, 2022
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label May 30, 2022
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels May 30, 2022
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label May 30, 2022
@himanshu-kun himanshu-kun requested a review from ashwani2k May 30, 2022 13:27
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label May 30, 2022
Copy link

@ashwani2k ashwani2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @himanshu-kun for the PR.
Looks good to me.

@himanshu-kun himanshu-kun merged commit b000e84 into gardener:machine-controller-manager-provider Jun 1, 2022
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jun 1, 2022
@himanshu-kun himanshu-kun deleted the sync-with-upstream-1.22 branch June 1, 2022 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review needs/second-opinion Needs second review by someone else size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sync with upstream v1.22.0