[cluster-autoscaler] Publish node group min/max metrics #4022

amrmahdi · 2021-04-21T07:28:52Z

This change publishes the node groups min and max count set on the cluster autoscaler. The current use case that these metrics are needed for is to create alerts when the number of nodes in the cluster reach a certain percentage of the max allowed nodes for example.

k8s-ci-robot · 2021-04-21T07:28:59Z

Welcome @amrmahdi!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

elmiko

although i like these kind of metrics, i think there will be issues with the cardinality around the node group IDs.

cluster-autoscaler/metrics/metrics.go

dntosas

very nice addition, one minor comment 👍

cluster-autoscaler/core/autoscaler.go

feiskyer · 2021-04-23T03:19:15Z

+1 as well to add those metrics. could you fix the CI failures?

amrmahdi · 2021-04-24T20:03:04Z

/test

k8s-ci-robot · 2021-04-24T20:03:17Z

@amrmahdi: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

feiskyer · 2021-04-25T00:44:03Z

/test

k8s-ci-robot · 2021-04-25T00:44:04Z

@feiskyer: No presubmit jobs available for kubernetes/autoscaler@master

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

feiskyer · 2021-04-25T00:44:54Z

@mwielgus @MaciekPytel any ideas the tests didn't run now for this PR?

elmiko · 2021-04-26T13:12:12Z

+1 as well to add those metrics.

i'm curious if we could get some consensus around using node group IDs in the metrics. i think there are other metrics that could benefit from using these identifiers, but i have been warned against using them in the past.

i think if we are going to agree that node group IDs are acceptable to use as metrics labels, then we should also establish guidance as to when it is appropriate. for example, this PR proposes a metric that will only get updated once (although this misses the notion of node groups being adjusted after initialization), is this usage of node group IDs acceptable? if this usage is acceptable, when is it unacceptable to use them?

amrmahdi · 2021-04-30T19:43:40Z

@mwielgus @MaciekPytel any ideas the tests didn't run now for this PR?

Can we get help on the CI? It seems it is waiting for a maintainer.

feiskyer · 2021-05-06T12:13:40Z

@mwielgus @MaciekPytel could you help to approve the github actions? The new changes from github actions make it required of approval first from maintainers (this is also true for other PRs).

amrmahdi · 2021-05-06T12:25:51Z

/assign @vivekbagade

elmiko

i like the addition of the flag, i'm happy with this solution.
/lgtm

elmiko

actually, i just thought about something. @amrmahdi would you mind adding something to the FAQ to mention the new flag please?

/lgtm cancel

amrmahdi · 2021-05-21T14:09:23Z

i just thought about something. @amrmahdi would you mind adding som

Yes I missed that. Thanks

cluster-autoscaler/metrics/metrics.go

MaciekPytel · 2021-05-21T14:52:02Z

Left one more comment, otherwise lgtm.

elmiko

thanks for the update
/lgtm

k8s-ci-robot · 2021-05-21T15:47:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amrmahdi, elmiko, feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [feiskyer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

amrmahdi · 2021-05-26T21:03:35Z

I approved the github workflows. I'll defer to @mwielgus on cardinality.

@towca Any more concerns?

amrmahdi · 2021-06-17T09:48:33Z

I approved the github workflows. I'll defer to @mwielgus on cardinality.

@towca Any more concerns?

@towca @mwielgus can we close on this PR? Thanks

towca · 2021-07-05T14:37:33Z

/unhold

Each test works in isolation, but they cause panic when the entire suite is run (ex. make test-in-docker), because the underlying metrics library panics when the same metric is registered twice.

Skipping metrics tests added in #4022

Each test works in isolation, but they cause panic when the entire suite is run (ex. make test-in-docker), because the underlying metrics library panics when the same metric is registered twice.

…e-1.21-nodegroup-minmax [cluster-autoscaler] backport #4022 Publish node group min/max metrics into 1.21

* Set maxAsgNamesPerDescribe to the new maximum value While this was previously effectively limited to 50, `DescribeAutoScalingGroups` now supports fetching 100 ASG per calls on all regions, matching what's documented: https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_DescribeAutoScalingGroups.html ``` AutoScalingGroupNames.member.N The names of the Auto Scaling groups. By default, you can only specify up to 50 names. You can optionally increase this limit using the MaxRecords parameter. MaxRecords The maximum number of items to return with this call. The default value is 50 and the maximum value is 100. ``` Doubling this halves API calls on large clusters, which should help to prevent throttling. * Break out unmarshal from GenerateEC2InstanceTypes Refactor to allow for optimisation * Optimise GenerateEC2InstanceTypes unmarshal memory usage The pricing json for us-east-1 is currently 129MB. Currently fetching this into memory and parsing results in a large memory footprint on startup, and can lead to the autoscaler being OOMKilled. Change the ReadAll/Unmarshal logic to a stream decoder to significantly reduce the memory use. * use aws sdk to find region * Merge pull request kubernetes#4274 from kinvolk/imran/cloud-provider-packet-fix Cloud provider[Packet] fixes * Fix templated nodeinfo names collisions in BinpackingNodeEstimator Both upscale's `getUpcomingNodeInfos` and the binpacking estimator now uses the same shared DeepCopyTemplateNode function and inherits its naming pattern, which is great as that fixes a long standing bug. Due to that, `getUpcomingNodeInfos` will enrich the cluster snapshots with generated nodeinfos and nodes having predictable names (using template name + an incremental ordinal starting at 0) for upcoming nodes. Later, when it looks for fitting nodes for unschedulable pods (when upcoming nodes don't satisfy those (FitsAnyNodeMatching failing due to nodes capacity, or pods antiaffinity, ...), the binpacking estimator will also build virtual nodes and place them in a snapshot fork to evaluate scheduler predicates. Those temporary virtual nodes are built using the same pattern (template name and an index ordinal also starting at 0) as the one previously used by `getUpcomingNodeInfos`, which means it will generate the same nodeinfos/nodes names for nodegroups having upcoming nodes. But adding nodes by the same name in an existing cluster snapshot isn't allowed, and the evaluation attempt will fail. Practically this blocks re-upscales for nodegroups having upcoming nodes, which can cause a significant delay. * Improve misleading log Signed-off-by: Sylvain Rabot <sylvain@abstraction.fr> * dont proactively decrement azure cache for unregistered nodes * annotate fakeNodes so that cloudprovider implementations can identify them if needed * move annotations to cloudprovider package * Cluster Autoscaler 1.21.1 * CA - AWS - Instance List Update 03-10-21 - 1.21 release branch * CA - AWS - Instance List Update 29-10-21 - 1.21 release branch * Cluster-Autoscaler update AWS EC2 instance types with g5, m6 and r6 * CA - AWS Instance List Update - 13/12/21 - 1.21 * Merge pull request kubernetes#4497 from marwanad/add-more-azure-instance-types add more azure instance types * Cluster Autoscaler 1.21.2 * Add `--feature-gates` flag to support scale up on volume limits (CSI migration enabled) Signed-off-by: ialidzhikov <i.alidjikov@gmail.com> * [Cherry pick 1.21] Remove TestDeleteBlob UT Signed-off-by: Zhecheng Li <zhechengli@microsoft.com> * cherry-pick kubernetes#4022 [cluster-autoscaler] Publish node group min/max metrics * Skipping metrics tests added in kubernetes#4022 Each test works in isolation, but they cause panic when the entire suite is run (ex. make test-in-docker), because the underlying metrics library panics when the same metric is registered twice. (cherry picked from commit 52392b3) * cherry-pick kubernetes#4162 and kubernetes#4172 [cluster-autoscaler]Add flag to control DaemonSet eviction on non-empty nodes & Allow DaemonSet pods to opt in/out from eviction. * CA - AWS Cloud Provider - 1.21 Static Instance List Update 02-06-2022 * fix instance type fallback Instead of logging a fatal error, log a standard error and fall back to loading instance types from the static list. * Cluster Autoscaler - 1.21.3 release * FAQ updated * Sync_changes file updated Co-authored-by: Benjamin Pineau <benjamin.pineau@datadoghq.com> Co-authored-by: Adrian Lai <aidy@loathe.me.uk> Co-authored-by: darkpssngr <shreyas300691@gmail.com> Co-authored-by: Kubernetes Prow Robot <k8s-ci-robot@users.noreply.github.com> Co-authored-by: Sylvain Rabot <sylvain@abstraction.fr> Co-authored-by: Marwan Ahmed <marwanad@microsoft.com> Co-authored-by: Jakub Tużnik <jtuznik@google.com> Co-authored-by: GuyTempleton <guy.templeton@skyscanner.net> Co-authored-by: sturman <4456572+sturman@users.noreply.github.com> Co-authored-by: Maciek Pytel <maciekpytel@google.com> Co-authored-by: ialidzhikov <i.alidjikov@gmail.com> Co-authored-by: Zhecheng Li <zhechengli@microsoft.com> Co-authored-by: Shubham Kuchhal <shubham.kuchhal@india.nec.com> Co-authored-by: Todd Neal <tnealt@amazon.com>

Each test works in isolation, but they cause panic when the entire suite is run (ex. make test-in-docker), because the underlying metrics library panics when the same metric is registered twice.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 21, 2021

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 21, 2021

k8s-ci-robot requested review from aleksandra-malinowska and feiskyer April 21, 2021 07:29

elmiko reviewed Apr 21, 2021

View reviewed changes

cluster-autoscaler/metrics/metrics.go Show resolved Hide resolved

dntosas reviewed Apr 22, 2021

View reviewed changes

cluster-autoscaler/core/autoscaler.go Outdated Show resolved Hide resolved

jbartosik added the area/cluster-autoscaler label Apr 23, 2021

amrmahdi force-pushed the amrh/nodegroupminmaxmetrics branch 2 times, most recently from 51f0613 to 0e97a40 Compare April 24, 2021 19:34

amrmahdi closed this Apr 25, 2021

amrmahdi reopened this Apr 25, 2021

amrmahdi force-pushed the amrh/nodegroupminmaxmetrics branch 2 times, most recently from 51b4dc9 to 42ce8b7 Compare April 30, 2021 19:38

k8s-ci-robot assigned vivekbagade May 6, 2021

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 14, 2021

amrmahdi force-pushed the amrh/nodegroupminmaxmetrics branch from 42ce8b7 to 2b4f971 Compare May 17, 2021 19:22

[cluster-autoscaler] Publish node group min/max metrics

2bd7f0e

amrmahdi force-pushed the amrh/nodegroupminmaxmetrics branch from 2b4f971 to 2bd7f0e Compare May 17, 2021 19:27

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 17, 2021

amrmahdi added 2 commits May 20, 2021 16:49

Emit the node group metrics behind a flag

f5c2ab7

Update node group min/max on cloud provider refresh

3ac32b8

amrmahdi force-pushed the amrh/nodegroupminmaxmetrics branch from f2b343b to 3ac32b8 Compare May 21, 2021 00:36

amrmahdi requested review from towca and MaciekPytel May 21, 2021 00:37

elmiko approved these changes May 21, 2021

View reviewed changes

k8s-ci-robot assigned elmiko May 21, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2021

elmiko suggested changes May 21, 2021

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2021

MaciekPytel reviewed May 21, 2021

View reviewed changes

cluster-autoscaler/metrics/metrics.go Show resolved Hide resolved

Update FAQ to mention the new flag

8b2aee0

amrmahdi requested a review from elmiko May 21, 2021 15:33

elmiko approved these changes May 21, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2021

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 5, 2021

k8s-ci-robot merged commit 9f84d39 into kubernetes:master Jul 5, 2021

MaciekPytel mentioned this pull request Jul 8, 2021

Skipping metrics tests added in #4022 #4184

Merged

k8s-ci-robot added a commit that referenced this pull request Jul 8, 2021

Merge pull request #4184 from MaciekPytel/fix_ut

389cfd2

Skipping metrics tests added in #4022

k8s-ci-robot added a commit that referenced this pull request Apr 25, 2022

Merge pull request #4700 from mongodb-forks/cluster-autoscaler-releas…

0ad1731

…e-1.21-nodegroup-minmax [cluster-autoscaler] backport #4022 Publish node group min/max metrics into 1.21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cluster-autoscaler] Publish node group min/max metrics #4022

[cluster-autoscaler] Publish node group min/max metrics #4022

amrmahdi commented Apr 21, 2021

k8s-ci-robot commented Apr 21, 2021

elmiko left a comment

dntosas left a comment

feiskyer commented Apr 23, 2021

amrmahdi commented Apr 24, 2021

k8s-ci-robot commented Apr 24, 2021

feiskyer commented Apr 25, 2021

k8s-ci-robot commented Apr 25, 2021

feiskyer commented Apr 25, 2021

elmiko commented Apr 26, 2021 •

edited

Loading

amrmahdi commented Apr 30, 2021

feiskyer commented May 6, 2021

amrmahdi commented May 6, 2021

elmiko left a comment

elmiko left a comment

amrmahdi commented May 21, 2021

MaciekPytel commented May 21, 2021

elmiko left a comment

k8s-ci-robot commented May 21, 2021

amrmahdi commented May 26, 2021

amrmahdi commented Jun 17, 2021

towca commented Jul 5, 2021

[cluster-autoscaler] Publish node group min/max metrics #4022

[cluster-autoscaler] Publish node group min/max metrics #4022

Conversation

amrmahdi commented Apr 21, 2021

k8s-ci-robot commented Apr 21, 2021

elmiko left a comment

Choose a reason for hiding this comment

dntosas left a comment

Choose a reason for hiding this comment

feiskyer commented Apr 23, 2021

amrmahdi commented Apr 24, 2021

k8s-ci-robot commented Apr 24, 2021

feiskyer commented Apr 25, 2021

k8s-ci-robot commented Apr 25, 2021

feiskyer commented Apr 25, 2021

elmiko commented Apr 26, 2021 • edited Loading

amrmahdi commented Apr 30, 2021

feiskyer commented May 6, 2021

amrmahdi commented May 6, 2021

elmiko left a comment

Choose a reason for hiding this comment

elmiko left a comment

Choose a reason for hiding this comment

amrmahdi commented May 21, 2021

MaciekPytel commented May 21, 2021

elmiko left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 21, 2021

amrmahdi commented May 26, 2021

amrmahdi commented Jun 17, 2021

towca commented Jul 5, 2021

elmiko commented Apr 26, 2021 •

edited

Loading