fix(trafficrouting): Fix rollback behavior for canary with trafficrouting and .DynamicStableScale=true #4035
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, in case of rollback controller always scales stable up to 100%, while dynamically scaling canary down. If rollback or abort occurs on later steps of rollout, it can cause various issues. For example, our deployment has several hundreds of pods, and rollback can cause surge hundreds of new pods.
This fix ensures that when a rollout is aborted and
DynamicStableScale
is enabled, the StableRS dynamically scales up to 100% as NewRS scales down based on steps in reverse order.By spec,
SetCanaryScale
used to diverge from scaling canary according to traffic weights https://argo-rollouts.readthedocs.io/en/stable/features/canary/#dynamic-canary-scale-with-traffic-routing. Therefore, I expect it to be used only as an intermediate step in rollout that can be ignored.Attempting to guess and handle all the combinations of setCanaryScale with matchWeights/weights/replicas is complex and I don't have a robust solution at hand (especially without actual use-cases).
Since I don't have comprehensive understanding of the argo-rollouts architecture, this fix might be somewhat clunky or suboptimal. Please feel free to suggest a better implementation.
Checklist:
"fix(controller): Updates such and such. Fixes #1234"
.