Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove condition where header routes can stay directed at empty service in preemption #3898

Merged

Conversation

schlags
Copy link
Contributor

@schlags schlags commented Oct 16, 2024

Attempts to address some concerns with behavior in:

It seems like we are forgetting to remove the managed routes when the following occurs:

  1. Rollout is stable on V1
  2. Rollout is updated with V2, and begins to progress
  3. Managed routes are defined (e.g. to send traffic with specific header values to the canary service)
  4. Rollout is updated with V3, and must start back at the beginning again where V1 is still considered stable (ditching any ongoing progress with V2)
  5. Canary service selector is swapped to V3 replicaset
  6. Traffic weights are set back to 0

It does not seem like any managed routes are considered between steps 4-6. I believe this code that I've made an addition to is where this would need to occur, but would love some feedback.

NOTE: I still don't think this fixes another situation where the service selector is swapped to the new replicaset before the traffic can be reconciled back at the 0% weight, but this at least allows the managed routes to fall back to the stable V1 version, instead of the behavior in the graph below:

image

c8e is V1 5ba is V2 and aa9 is V3. Note how for a client meeting the requirements for the managed route, it never actually falls back to V1 but simply 503's until the replicaset for V3 can finally mark ready (it is never taken away from the canary service)

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

…service in preemption

Signed-off-by: Dylan Schlager <dylan.schlager@lattice.com>
@schlags schlags force-pushed the schlags-removeManagedRoutesInPreemption branch from b54f980 to cb43cc5 Compare October 16, 2024 18:04
Copy link
Contributor

github-actions bot commented Oct 16, 2024

Published E2E Test Results

  4 files    4 suites   3h 16m 2s ⏱️
113 tests 103 ✅  7 💤 3 ❌
462 runs  425 ✅ 28 💤 9 ❌

For more details on these failures, see this check.

Results for commit 5ca1d5e.

♻️ This comment has been updated with latest results.

Copy link
Contributor

github-actions bot commented Oct 16, 2024

Published Unit Test Results

2 279 tests   2 279 ✅  2m 59s ⏱️
  128 suites      0 💤
    1 files        0 ❌

Results for commit 5ca1d5e.

♻️ This comment has been updated with latest results.

@maschwenk
Copy link

We had some downtime related to this so would be sweet to get a look!

@zachaller
Copy link
Collaborator

Can we get a unit test for this?

@zachaller zachaller added this to the v1.8 milestone Oct 17, 2024
@zachaller zachaller self-requested a review October 17, 2024 02:08
Signed-off-by: Dylan Schlager <dylan.schlager@lattice.com>
@schlags
Copy link
Contributor Author

schlags commented Oct 22, 2024

Can we get a unit test for this?
@zachaller

Added one just now, was able to verify that this test fails without the presence of that function call!

Signed-off-by: Dylan Schlager <dylan.schlager@lattice.com>
Copy link

f.ingressLister = append(f.ingressLister, ingressutil.NewLegacyIngress(ingress))

f.expectPatchRolloutAction(r3)
f.run(getKey(r3, t))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first reconciliation shows that it skips scaling down the old RS since it is still referenced.

Comment on lines +1418 to +1425
r3.Status.Canary.Weights = &v1alpha1.TrafficWeights{
Canary: v1alpha1.WeightDestination{
PodTemplateHash: replicasetutil.GetPodTemplateHash(rs3),
},
}

f.expectUpdateReplicaSetAction(rs3)
f.run(getKey(r3, t))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We swap that reference here, and reconcile again to observe the scaling down intermediate RS log - where the else if c.newRS == nil || c.newRS.Status.AvailableReplicas == 0 condition is met while reconciling the traffic routing.

Adding the removal of the managed traffic routes here allows us to ensure that no traffic will be directed to that intermediate RS.

Copy link

codecov bot commented Oct 22, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 3 lines in your changes missing coverage. Please review.

Project coverage is 82.72%. Comparing base (806d07c) to head (5ca1d5e).
Report is 19 commits behind head on master.

Files with missing lines Patch % Lines
rollout/trafficrouting.go 66.66% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3898      +/-   ##
==========================================
- Coverage   83.92%   82.72%   -1.20%     
==========================================
  Files         163      163              
  Lines       18564    22895    +4331     
==========================================
+ Hits        15579    18939    +3360     
- Misses       2116     3084     +968     
- Partials      869      872       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zachaller zachaller merged commit 53c4f12 into argoproj:master Nov 3, 2024
23 of 25 checks passed
@schlags schlags deleted the schlags-removeManagedRoutesInPreemption branch November 4, 2024 16:25
Rizwana777 pushed a commit to Rizwana777/argo-rollouts that referenced this pull request Dec 12, 2024
…service in preemption (argoproj#3898)

* fix: remove condition where header routes can stay directed at empty service in preemption

Signed-off-by: Dylan Schlager <dylan.schlager@lattice.com>

* add unit test

Signed-off-by: Dylan Schlager <dylan.schlager@lattice.com>

* lint

Signed-off-by: Dylan Schlager <dylan.schlager@lattice.com>

---------

Signed-off-by: Dylan Schlager <dylan.schlager@lattice.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants