Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(trafficrouting): Fix rollback behavior for canary with trafficrouting and .DynamicStableScale=true #4035

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ArenSH
Copy link

@ArenSH ArenSH commented Jan 9, 2025

Currently, in case of rollback controller always scales stable up to 100%, while dynamically scaling canary down. If rollback or abort occurs on later steps of rollout, it can cause various issues. For example, our deployment has several hundreds of pods, and rollback can cause surge hundreds of new pods.

This fix ensures that when a rollout is aborted and DynamicStableScale is enabled, the StableRS dynamically scales up to 100% as NewRS scales down based on steps in reverse order.



By spec, SetCanaryScale used to diverge from scaling canary according to traffic weights https://argo-rollouts.readthedocs.io/en/stable/features/canary/#dynamic-canary-scale-with-traffic-routing. Therefore, I expect it to be used only as an intermediate step in rollout that can be ignored.
Attempting to guess and handle all the combinations of setCanaryScale with matchWeights/weights/replicas is complex and I don't have a robust solution at hand (especially without actual use-cases).



Since I don't have comprehensive understanding of the argo-rollouts architecture, this fix might be somewhat clunky or suboptimal. Please feel free to suggest a better implementation.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

…ting and .DynamicStableScale=true

Without the fix, controller always scales stable up to 100%, while dynamically scaling canary down.
If rollback or abort occurs on later steps of rollout, it can cause various issues, e.g. surge of hundreds of new pods.

This fix ensures that when a rollout is aborted and .DynamicStableScale is enabled,
the StableRS dynamically scales up to 100% as NewRS scales down based on steps in reverse order.

Signed-off-by: Armen Shakhbazian <armen.shakhbazian@gmail.com>
Copy link

sonarqubecloud bot commented Jan 9, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
53.4% Duplication on New Code (required ≤ 40%)

See analysis details on SonarQube Cloud

Copy link
Contributor

github-actions bot commented Jan 9, 2025

Published E2E Test Results

  4 files    4 suites   3h 10m 58s ⏱️
113 tests 102 ✅  7 💤 4 ❌
456 runs  424 ✅ 28 💤 4 ❌

For more details on these failures, see this check.

Results for commit 5be5b26.

Copy link
Contributor

github-actions bot commented Jan 9, 2025

Published Unit Test Results

2 295 tests   2 295 ✅  2m 59s ⏱️
  128 suites      0 💤
    1 files        0 ❌

Results for commit 5be5b26.

@zachaller zachaller added this to the v1.9 milestone Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants