-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ block move with annotation #8690
Conversation
I will take a look at this this week :) |
I'm not sure the solution proposed here addresses #8473; I have summarized my understanding of the issue in #8473 (comment); if possible, let continue discussion on the issue first. |
Just updated this with something closer to I think what @fabriziopandini had in mind. /retitle ✨ block move with annotation |
@ykakarap PTAL if you're still interested in going through this. @fabriziopandini I added some more thoughts to the issue and updated this PR accordingly. |
I updated the PR description to match how the current changes work. |
(reminder to squash) |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking so long to provide feedback.
Overall I think we are really near the finish line, a couple of suggestions from my side but nothing really blocking
Also, could you add the new annotation in https://cluster-api.sigs.k8s.io/reference/labels_and_annotations.html and a note in https://cluster-api.sigs.k8s.io/clusterctl/provider-contract.html#move; might be worth to surface the new feature to other providers in https://main.cluster-api.sigs.k8s.io/developer/providers/migrations/v1.4-to-v1.5.html
} | ||
key := client.ObjectKeyFromObject(obj) | ||
|
||
if err := retryWithExponentialBackoff(backoff, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is implemented the backoff applies to every CR, thus assuming that we are moving 100 objects, this can result in 100*5sec, which is a lot. Is that the expected behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I had in mind for the time being. I generally wouldn't expect this to ever wait for N resources * timeout for one resource
since while clusterctl
is waiting for one resource to be paused, the provider will continue doing work out-of-band to pause that and other resources in parallel. Overall I think the total time this waits will trend toward the longest wait for an individual resource in that case.
That same parallelism could be applied to the waits here as well, but that didn't seem worth the additional complexity under the assumption that resources are going to be paused in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got your point, but It is not ideal because ultimately the time the users have to wait can vary in a significant way, however, I can be ok with this approach for the first iteration if we can get a better UX as suggested in other comments.
TL;DR; if (and only if) there is a block on an object, before starting the wait loop we should log "Waiting for the block to be removed on ..." or something similar that makes the users aware of where the move process is blocked at
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my read of the code, it looks like the "resource is not ready to move" error returned here will get logged here which happens before any time.Sleep
:
log.V(5).Info("Retrying with backoff", "Cause", err.Error()) |
Does that cover what you describe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, this is the debug of every retry and it is very noisy; it will also be printed for every object, not only the objects with an actual block move.
So it does not comply with
if (and only if) there is a block on an object, before starting the wait loop we should log "Waiting for the block to be removed on ..." or something similar that makes the users aware of where the move process is blocked at
It just provides debug/trace details on every step of the wait loop when you are already in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you'd like to see a single log message indicating that at least one object is blocking the move?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obj1 no block move annotation --> no message at all (and possibly skip entirely the wait loop)
obj2 block move annotation --> before starting the wait loop, "Waiting for the block to be removed on ..."
obj3-50 no block move annotation --> no message at all (and possibly skip entirely the wait loop)
obj51 block move annotation --> before starting the wait loop, "Waiting for the block to be removed on ..."
and so on
NOTE: this is consistent with most of the logs in clusterctl, where we inform about the next operation before starting it, which is even more important in this case because otherwise the system will seem to be stuck waiting; in this case the caveat is that according to the spec we should wait only for objects with the annotation, not for all the objects (this will be much easier if you collect collect wait for move during discovery, as I have suggested in another comment on the same topic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I've updated this to do that. FWIW we need to enter the wait loop no matter what to determine if the annotation exists, but the loop is exited as soon as the annotation doesn't exist (which may be in the first iteration). You already suggested gathering whether the resource is blocked during object discovery which could make this short-circuit earlier, but I'd like to do that as a follow-up if that's still acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to do that as a follow-up if that's still acceptable
this would also be good to track in an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #9196
@@ -595,6 +610,49 @@ func setClusterClassPause(proxy Proxy, clusterclasses []*node, pause bool, dryRu | |||
return nil | |||
} | |||
|
|||
func waitReadyForMove(proxy Proxy, nodes []*node, dryRun bool, backoff wait.Backoff) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we can collect wait for move during discovery, then we can make this a no-op for most of the object, and implement a better UX for objects that are actually blocking move by printing "Wating for ... to unblock move" before starting to wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that seems better. I can take a look at that as a follow-up.
@fabriziopandini Do you have any other thoughts on this? |
@nojnhuh if I'm not wrong this point is still open
What I see in the code is instead the opposite; the operation start to wait without giving any info to users (so the user is blind on what is happening), then only when wait is finished, it informs the user that it has finished waiting for an object |
Responded in the thread. I think that was the last piece of outstanding feedback. |
Need to squash again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/assign @fabriziopandini
LGTM label has been added. Git tree hash: 1c1addfcd6f634bd5dd89787829658463bc4da08
|
/lgtm |
@@ -333,6 +335,19 @@ func (o *objectMover) move(graph *objectGraph, toProxy Proxy, mutators ...Resour | |||
return errors.Wrap(err, "error pausing ClusterClasses") | |||
} | |||
|
|||
log.Info("Waiting for all resources to be ready to move") | |||
// exponential backoff configuration which returns durations for a total time of ~2m. | |||
// Example: 0, 5s, 8s, 11s, 17s, 26s, 38s, 57s, 86s, 128s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, are these the actual numbers it would use or did you pull this from somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The jitter makes this inherently random to some degree, but I threw together something to simulate what these numbers could be and this is one example.
/test pull-cluster-api-e2e-full-main |
LGTM label has been added. Git tree hash: ef774513a8b0bd56b6b039d4ad1163fd56f6abf5
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fabriziopandini The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
This PR adds a new well-known
clusterctl.cluster.x-k8s.io/block-move
annotation that, when defined on any resource to be moved, blocksclusterctl move
from recreating any resources in the target cluster. Infrastructure providers can set this annotation ahead of time to make time to do any extra work in reaction to a cluster being paused.More details on CAPZ's specific use case are here: #8473.
The expected flow is as follows:
block-move
annotation on any resource(s) in the first reconciliation.clusterctl move
starts.clusterctl
sets Cluster'sspec.pause
totrue
, waits forblock-move
annotation to not be defined on any resource to be moved.block-move
.clusterctl move
continues.A POC implementation can be found here which I've verified with one of the self-hosted e2e tests: main...nojnhuh:cluster-api:wait-pause-docker
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #8473