Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove safeToStartCycle check #91

Merged

Conversation

MinyiZ
Copy link
Collaborator

@MinyiZ MinyiZ commented Jan 20, 2025

This section of code does not behave as intended. It was introduced in 2020 to attempt to fix a race condition between the observer and cluster autoscaler. Due to recent improvements, the race condition is now solved so this code can be removed.

Why does this code not work?
From https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/metrics.md#cluster-autoscaler-execution

last_activity records last time certain part of cluster autoscaler logic executed. Represented with unix timestamp. autoscaler-activity values are:

main - main loop iteration started.

autoscaling - current state of the cluster has been updated, started autoscaling logic.

scaleUp - autoscaler will check if scale up is necessary.

scaleDown - autoscaler will try to scale down some nodes.

The last_activity metric reflects whether cluster autoscaler has attempted to scale up or down. However, it does not necessarily indicate actual changes in the ASG, resulting in numerous false positives. This behaviour can unnecessarily block the creation of CNRs, even when it is safe to proceed with cycling.

@MinyiZ MinyiZ changed the title KUBE-9199: remove safeToStartCycle check remove safeToStartCycle check Jan 20, 2025
@MinyiZ MinyiZ force-pushed the mzhong/KUBE-9199-remove-prom-metrics-check-for-safetocycle branch 4 times, most recently from c95603b to 92b795b Compare January 20, 2025 05:37
@MinyiZ MinyiZ force-pushed the mzhong/KUBE-9199-remove-prom-metrics-check-for-safetocycle branch from 424f027 to 2d93170 Compare January 20, 2025 05:56
nodeStartupTime: rootCmd.PersistentFlags().Duration("node-startup-time", 2*time.Minute, "duration to wait after a cluster-autoscaler scaleUp event is detected"),
runImmediately: rootCmd.PersistentFlags().Bool("now", false, "makes the check loop run straight away on program start rather than wait for the check interval to elapse"),
runOnce: rootCmd.PersistentFlags().Bool("once", false, "run the check loop once then exit. also works with --now"),
prometheusAddress: rootCmd.PersistentFlags().String("prometheus-address", "prometheus", "Prometheus service address used to query cluster-autoscaler metrics"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discuss: is it worth allowing --prometheus-address to remain as a flags but be unused, in order to avoid erroring out people's installations when they automatically bump the version, or is it better for the version bump to immediately show this feature isn't used any more by this error message stopping it starting?

docker run --rm -it ghcr.io/atlassian-labs/cyclops:v1.10.1 cyclops --prometheus-address 127.0.0.1:8080
cyclops: error: unknown long flag '--prometheus-address', try --help

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i prefer to remove the flag immediately because having a flag that does nothing can be confusing. leaving it would also require us to remember to remove it later. since this is already a breaking change, i think it's better to alert users to the change in behaviour right away with the flag removal, rather than silently ignoring it.

Copy link
Member

@vincentportella vincentportella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@MinyiZ MinyiZ merged commit 913fc99 into master Jan 20, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants