✨ block move with annotation #8690

nojnhuh · 2023-05-17T18:55:23Z

What this PR does / why we need it:

This PR adds a new well-known clusterctl.cluster.x-k8s.io/block-move annotation that, when defined on any resource to be moved, blocks clusterctl move from recreating any resources in the target cluster. Infrastructure providers can set this annotation ahead of time to make time to do any extra work in reaction to a cluster being paused.

More details on CAPZ's specific use case are here: #8473.

The expected flow is as follows:

Infra controller sets the block-move annotation on any resource(s) in the first reconciliation.
clusterctl move starts.
clusterctl sets Cluster's spec.pause to true, waits for block-move annotation to not be defined on any resource to be moved.
InfraCluster reconciliation invoked based on that change, controller starts work.
InfraCluster controller finishes work, removes block-move.
clusterctl move continues.

A POC implementation can be found here which I've verified with one of the self-hosted e2e tests: main...nojnhuh:cluster-api:wait-pause-docker

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #8473

api/v1beta1/common_types.go

ykakarap · 2023-05-24T03:00:14Z

I will take a look at this this week :)

fabriziopandini · 2023-05-30T16:47:39Z

I'm not sure the solution proposed here addresses #8473; I have summarized my understanding of the issue in #8473 (comment); if possible, let continue discussion on the issue first.

nojnhuh · 2023-06-06T19:44:33Z

Just updated this with something closer to I think what @fabriziopandini had in mind.

/retitle ✨ block move with annotation

nojnhuh · 2023-06-12T15:22:17Z

@ykakarap PTAL if you're still interested in going through this.

@fabriziopandini I added some more thoughts to the issue and updated this PR accordingly.

nojnhuh · 2023-06-14T16:29:58Z

I updated the PR description to match how the current changes work.

cmd/clusterctl/client/cluster/mover.go

cmd/clusterctl/api/v1alpha3/annotations.go

nojnhuh · 2023-06-19T15:26:16Z

(reminder to squash)
/hold

nojnhuh · 2023-06-19T15:49:20Z

/retest

fabriziopandini

Sorry for taking so long to provide feedback.
Overall I think we are really near the finish line, a couple of suggestions from my side but nothing really blocking

Also, could you add the new annotation in https://cluster-api.sigs.k8s.io/reference/labels_and_annotations.html and a note in https://cluster-api.sigs.k8s.io/clusterctl/provider-contract.html#move; might be worth to surface the new feature to other providers in https://main.cluster-api.sigs.k8s.io/developer/providers/migrations/v1.4-to-v1.5.html

cmd/clusterctl/api/v1alpha3/annotations.go

cmd/clusterctl/client/cluster/mover.go

fabriziopandini · 2023-06-28T12:26:03Z

cmd/clusterctl/client/cluster/mover.go

+		}
+		key := client.ObjectKeyFromObject(obj)
+
+		if err := retryWithExponentialBackoff(backoff, func() error {


As it is implemented the backoff applies to every CR, thus assuming that we are moving 100 objects, this can result in 100*5sec, which is a lot. Is that the expected behavior?

That's what I had in mind for the time being. I generally wouldn't expect this to ever wait for N resources * timeout for one resource since while clusterctl is waiting for one resource to be paused, the provider will continue doing work out-of-band to pause that and other resources in parallel. Overall I think the total time this waits will trend toward the longest wait for an individual resource in that case.

That same parallelism could be applied to the waits here as well, but that didn't seem worth the additional complexity under the assumption that resources are going to be paused in parallel.

I got your point, but It is not ideal because ultimately the time the users have to wait can vary in a significant way, however, I can be ok with this approach for the first iteration if we can get a better UX as suggested in other comments.

TL;DR; if (and only if) there is a block on an object, before starting the wait loop we should log "Waiting for the block to be removed on ..." or something similar that makes the users aware of where the move process is blocked at

Based on my read of the code, it looks like the "resource is not ready to move" error returned here will get logged here which happens before any time.Sleep:

cluster-api/cmd/clusterctl/client/cluster/client.go

Line 234 in 8f173a1

log.V(5).Info("Retrying with backoff", "Cause", err.Error())

Does that cover what you describe?

Not really, this is the debug of every retry and it is very noisy; it will also be printed for every object, not only the objects with an actual block move.

So it does not comply with

if (and only if) there is a block on an object, before starting the wait loop we should log "Waiting for the block to be removed on ..." or something similar that makes the users aware of where the move process is blocked at

It just provides debug/trace details on every step of the wait loop when you are already in

So you'd like to see a single log message indicating that at least one object is blocking the move?

obj1 no block move annotation --> no message at all (and possibly skip entirely the wait loop)
obj2 block move annotation --> before starting the wait loop, "Waiting for the block to be removed on ..."
obj3-50 no block move annotation --> no message at all (and possibly skip entirely the wait loop)
obj51 block move annotation --> before starting the wait loop, "Waiting for the block to be removed on ..."
and so on

NOTE: this is consistent with most of the logs in clusterctl, where we inform about the next operation before starting it, which is even more important in this case because otherwise the system will seem to be stuck waiting; in this case the caveat is that according to the spec we should wait only for objects with the annotation, not for all the objects (this will be much easier if you collect collect wait for move during discovery, as I have suggested in another comment on the same topic)

That makes sense, I've updated this to do that. FWIW we need to enter the wait loop no matter what to determine if the annotation exists, but the loop is exited as soon as the annotation doesn't exist (which may be in the first iteration). You already suggested gathering whether the resource is blocked during object discovery which could make this short-circuit earlier, but I'd like to do that as a follow-up if that's still acceptable.

I'd like to do that as a follow-up if that's still acceptable

this would also be good to track in an issue

Opened #9196

cmd/clusterctl/client/cluster/mover.go

fabriziopandini · 2023-06-28T12:32:46Z

cmd/clusterctl/client/cluster/mover.go

@@ -595,6 +610,49 @@ func setClusterClassPause(proxy Proxy, clusterclasses []*node, pause bool, dryRu
 	return nil
 }

+func waitReadyForMove(proxy Proxy, nodes []*node, dryRun bool, backoff wait.Backoff) error {


It would be great if we can collect wait for move during discovery, then we can make this a no-op for most of the object, and implement a better UX for objects that are actually blocking move by printing "Wating for ... to unblock move" before starting to wait

Yeah that seems better. I can take a look at that as a follow-up.

nojnhuh · 2023-07-07T16:26:56Z

@fabriziopandini Do you have any other thoughts on this?

fabriziopandini · 2023-07-10T18:28:49Z

@nojnhuh if I'm not wrong this point is still open

I got your point, but It is not ideal because ultimately the time the users have to wait can vary in a significant way, however, I can be ok with this approach for the first iteration if we can get a better UX as suggested in other comments.

TL;DR; if (and only if) there is a block on an object, before starting the wait loop we should log "Waiting for the block to be removed on ..." or something similar that makes the users aware of where the move process is blocked at

What I see in the code is instead the opposite; the operation start to wait without giving any info to users (so the user is blind on what is happening), then only when wait is finished, it informs the user that it has finished waiting for an object

nojnhuh · 2023-08-02T17:51:13Z

@nojnhuh if I'm not wrong this point is still open

Responded in the thread. I think that was the last piece of outstanding feedback.

CecileRobertMichon · 2023-08-09T17:13:51Z

/cc @ykakarap @Jont828

Are you both able to review this to help move it forward?

nojnhuh · 2023-08-15T22:12:35Z

Need to squash again
/hold

CecileRobertMichon

/lgtm
/assign @fabriziopandini

k8s-ci-robot · 2023-08-15T22:32:45Z

LGTM label has been added.

Git tree hash: 1c1addfcd6f634bd5dd89787829658463bc4da08

Jont828 · 2023-08-16T17:20:35Z

/lgtm

Jont828 · 2023-08-16T18:07:19Z

cmd/clusterctl/client/cluster/mover.go

@@ -333,6 +335,19 @@ func (o *objectMover) move(graph *objectGraph, toProxy Proxy, mutators ...Resour
 		return errors.Wrap(err, "error pausing ClusterClasses")
 	}

+	log.Info("Waiting for all resources to be ready to move")
+	// exponential backoff configuration which returns durations for a total time of ~2m.
+	// Example: 0, 5s, 8s, 11s, 17s, 26s, 38s, 57s, 86s, 128s


Just curious, are these the actual numbers it would use or did you pull this from somewhere?

The jitter makes this inherently random to some degree, but I threw together something to simulate what these numbers could be and this is one example.

docs/book/src/reference/labels_and_annotations.md

fabriziopandini · 2023-08-17T15:49:23Z

/test pull-cluster-api-e2e-full-main

fabriziopandini · 2023-08-17T17:46:19Z

/lgtm
/approve

@nojnhuh it would be great to get #9196 implemented before 1.6.0 is cut, because the current implementation is adding an unnecessary extra get for all the objects + the user can wait for x times the timeout for waitMove.

k8s-ci-robot · 2023-08-17T17:46:24Z

LGTM label has been added.

Git tree hash: ef774513a8b0bd56b6b039d4ad1163fd56f6abf5

k8s-ci-robot · 2023-08-17T17:46:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/clusterctl/OWNERS~~ [fabriziopandini]
~~docs/OWNERS~~ [fabriziopandini]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nojnhuh · 2023-08-17T18:31:02Z

@nojnhuh it would be great to get #9196 implemented before 1.6.0 is cut, because the current implementation is adding an unnecessary extra get for all the objects + the user can wait for x times the timeout for waitMove.

Sounds good, I'll get going on that.

Also squashed:
/hold cancel

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/clusterctl Issues or PRs related to clusterctl size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 17, 2023

k8s-ci-robot requested review from jackfrancis and killianmuldoon May 17, 2023 18:55

CecileRobertMichon reviewed May 17, 2023

View reviewed changes

api/v1beta1/common_types.go Outdated Show resolved Hide resolved

nojnhuh force-pushed the wait-pause branch from bfa7c87 to a929969 Compare June 6, 2023 19:44

k8s-ci-robot changed the title ~~✨ Add is-paused annotation~~ ✨ block move with annotation Jun 6, 2023

nojnhuh force-pushed the wait-pause branch from a929969 to c918c21 Compare June 6, 2023 20:01

CecileRobertMichon reviewed Jun 19, 2023

View reviewed changes

cmd/clusterctl/client/cluster/mover.go Outdated Show resolved Hide resolved

cmd/clusterctl/api/v1alpha3/annotations.go Outdated Show resolved Hide resolved

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2023

fabriziopandini reviewed Jun 28, 2023

View reviewed changes

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 3, 2023

nojnhuh force-pushed the wait-pause branch from fae3b97 to 14cc458 Compare July 10, 2023 19:27

nojnhuh mentioned this pull request Jul 17, 2023

add pause handling for AzureCluster controller kubernetes-sigs/cluster-api-provider-azure#3735

Merged

4 tasks

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 1, 2023

k8s-ci-robot requested a review from Jont828 August 9, 2023 17:13

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 15, 2023

k8s-ci-robot requested a review from Jont828 August 15, 2023 22:12

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 15, 2023

CecileRobertMichon reviewed Aug 15, 2023

View reviewed changes

k8s-ci-robot assigned fabriziopandini and CecileRobertMichon Aug 15, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 15, 2023

nojnhuh mentioned this pull request Aug 16, 2023

Determine which resources are blocking clusterctl move during discovery #9196

Closed

Jont828 reviewed Aug 16, 2023

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 16, 2023

k8s-ci-robot requested review from CecileRobertMichon, fabriziopandini and Jont828 August 16, 2023 18:48

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 17, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 17, 2023

✨ block move with annotation

29bb47d

nojnhuh force-pushed the wait-pause branch from 03b499b to 29bb47d Compare August 17, 2023 18:29

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 17, 2023

k8s-ci-robot merged commit 390549a into kubernetes-sigs:main Aug 17, 2023

k8s-ci-robot added this to the v1.6 milestone Aug 17, 2023

nojnhuh deleted the wait-pause branch August 17, 2023 18:47

nojnhuh mentioned this pull request Nov 1, 2023

handle clusterctl block-move annotation kubernetes-sigs/cluster-api-provider-azure#4213

Merged

4 tasks

✨ block move with annotation #8690

✨ block move with annotation #8690

Conversation

nojnhuh commented May 17, 2023 • edited Loading

ykakarap commented May 24, 2023

fabriziopandini commented May 30, 2023

nojnhuh commented Jun 6, 2023

nojnhuh commented Jun 12, 2023

nojnhuh commented Jun 14, 2023

nojnhuh commented Jun 19, 2023

nojnhuh commented Jun 19, 2023

fabriziopandini left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabriziopandini Jul 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabriziopandini Jul 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nojnhuh commented Jul 7, 2023

fabriziopandini commented Jul 10, 2023 • edited Loading

nojnhuh commented Aug 2, 2023

CecileRobertMichon commented Aug 9, 2023

nojnhuh commented Aug 15, 2023

CecileRobertMichon left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 15, 2023

Jont828 commented Aug 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabriziopandini commented Aug 17, 2023

fabriziopandini commented Aug 17, 2023

k8s-ci-robot commented Aug 17, 2023

k8s-ci-robot commented Aug 17, 2023

nojnhuh commented Aug 17, 2023

nojnhuh commented May 17, 2023 •

edited

Loading

fabriziopandini Jul 3, 2023 •

edited

Loading

fabriziopandini Jul 10, 2023 •

edited

Loading

fabriziopandini commented Jul 10, 2023 •

edited

Loading