Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Do not update KCP and MS status when unable to get workload cluster #10229

Closed
wants to merge 3 commits into from
Closed

🐛 Do not update KCP and MS status when unable to get workload cluster #10229

wants to merge 3 commits into from

Conversation

jessehu
Copy link
Contributor

@jessehu jessehu commented Mar 6, 2024

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #10195

Test:

  • Added UT for KCP.Status update logic in the error scenario.
  • The UT code for existing MS.Status update logic does not contain fake Reconciler.Tracker object and not consider the getNode logic. So it needs more changes to add UT for this error scenario.
  • Ran e2e test for 3 CP & 3 Worker cluster creation with updating CAPI components version(based on CAPI 1.5.2) after the cluster created, observed the following expected behavior:
    • MD.Status.ReadyReplicas won't change from 3 to 0 when getMachineNode() hit ErrClusterLocked error. The log "Requeuing because another worker has the lock on the ClusterCacheTracker" was printed.
    • KCP.Status.ReadyReplicas won't change from 3 to 0 when controlPlane.GetWorkloadCluster(ctx) hit ErrClusterLocked error.

@k8s-ci-robot k8s-ci-robot added this to the v1.5 milestone Mar 6, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label labels Mar 6, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @jessehu!

It looks like this is your first PR to kubernetes-sigs/cluster-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @jessehu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 6, 2024
Copy link
Contributor

@killianmuldoon killianmuldoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on!

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 6, 2024
@@ -43,8 +43,7 @@ func (r *KubeadmControlPlaneReconciler) updateStatus(ctx context.Context, contro

// set basic data that does not require interacting with the workload cluster
controlPlane.KCP.Status.Replicas = replicas
controlPlane.KCP.Status.ReadyReplicas = 0
controlPlane.KCP.Status.UnavailableReplicas = replicas
controlPlane.KCP.Status.UnavailableReplicas = replicas - controlPlane.KCP.Status.ReadyReplicas
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This change avoids setting controlPlane.KCP.Status.ReadyReplicas from none 0 to 0 when GetWorkloadCluster(ctx) hits ErrClusterLocked or other error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW:https://github.com/kubernetes-sigs/cluster-api/blob/release-1.5/controlplane/kubeadm/internal/controllers/status.go#L93 uses replicas rather than desiredReplicas to calculate UnavailableReplicas. Not sure if it's a bug or on purpose. cc @fabriziopandini please help take a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be odd to have two different ways to calculate the unavailable replicas within the same status update function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @neolit123, actually it uses the same way to calculate the unavailable replicas. The line 46 is only for covering the case that KCP.Status.ReadyReplicas is 0 since the default value of UnavailableReplicas is 0 when KCP is newly created. If ReadyReplicas is not 0, UnavailableReplicas should equal to replicas - controlPlane.KCP.Status.ReadyReplicas.

line 45-46:
   controlPlane.KCP.Status.Replicas = replicas
   controlPlane.KCP.Status.UnavailableReplicas = replicas - controlPlane.KCP.Status.ReadyReplicas

line 91-92:
	controlPlane.KCP.Status.ReadyReplicas = status.ReadyNodes
	controlPlane.KCP.Status.UnavailableReplicas = replicas - status.ReadyNodes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @neolit123 @fabriziopandini could you please help take some time to move this forward? We need this patch to fix the issue hit in our project based on CAPI.

Copy link
Contributor Author

@jessehu jessehu Apr 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbueringer Got your point. If doing it this way, we may also need to move controlPlane.KCP.Status.Replicas = replicas to before line 96 controlPlane.KCP.Status.ReadyReplicas = status.ReadyNodes, otherwise only Status.Replicas increases or decreases by 1 and other Replicas fields unchanged (e.g. UpdatedReplicas, ReadyReplicas, UnavailableReplicas etc. which seems unreasonable). Even if it's moved, the conditions are updated while no Replicas fields are updated (which is unreasonable but maybe acceptedable since it will be corrected in the coming reconciles soon ?).

If updating both Status.Replicas and Status.UnavailableReplicas before getting ClusterStatus like this PR currently behaviors is not expected:

  1. it sounds unreasonable when replicas changes from 3 to 4 (when rolling out KCP): Replicas -> 4 & UnavailableReplicas -> 1 (incorrect)
  2. it sounds unreasonable when replicas changes from 3 to 2 (when manually deleted a KCP Mahcine ?): Replicas -> 2 & UnavailableReplicas -> -1 (should be 1)
    The root cause is line 93 controlPlane.KCP.Status.UnavailableReplicas = replicas - status.ReadyNodes (and line 47 as well) should be controlPlane.KCP.Status.UnavailableReplicas = desiredReplicas- status.ReadyNodes.

Copy link
Member

@sbueringer sbueringer Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Let's do the following:

  1. in the create case: let's set .status.replicas, .status.unavailableReplicas, .status.readyReplicas (we'll set readyReplicas to 0 but let's do it anyway for clarity)
  2. Then we set .stauts.version & conditions (l.55-82)
  3. Then let's get the cluster status and only afterwards modify the status replica fields (let's also move down the update of status updatedReplicas)

I think it makes sense to update the conditions in 2 as they are all based on information from the mgmt cluster.

Overall we're just missing a way to signal that we can't communicate with the wl cluster anymore, but we can follow-up on that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbueringer Got it. I pushed a commit to use Status.UnavailableReplicas = desiredReplicas- status.ReadyNodes which should solve the problems we discussed above. Please take a look if it is. If not, I will make the changes as you described, but not sure why we need to move down the update of status updatedReplicas because it doesn't depend on the workload cluste status: controlPlane.KCP.Status.UpdatedReplicas = int32(len(controlPlane.UpToDateMachines()))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea would have been to not update any status replica fields if we can't communicate with the workload cluster to not get them into an inconsistent state with each other. I'll think a bit about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I agree that the idea way would have been to not update any status replica fields and conditions fields if we can't communicate with the workload cluster.

@jessehu
Copy link
Contributor Author

jessehu commented Mar 6, 2024

/area machineset

@k8s-ci-robot k8s-ci-robot added area/machineset Issues or PRs related to machinesets and removed do-not-merge/needs-area PR is missing an area label labels Mar 6, 2024
@sbueringer sbueringer removed this from the v1.5 milestone Mar 12, 2024
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general idea for the change LGTM
approval is up to CAPI maintainers.

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
leaving lgtm/approve to active maintainers

internal/controllers/machineset/machineset_controller.go Outdated Show resolved Hide resolved
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 28, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: d4abd1501740aab050ed6f27e26ac81af5433520

@sbueringer
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 5, 2024
@k8s-ci-robot k8s-ci-robot requested a review from neolit123 April 5, 2024 07:40
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 5, 2024
@jessehu jessehu changed the base branch from release-1.5 to main April 5, 2024 07:40
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 5, 2024
@k8s-ci-robot

This comment was marked as outdated.

@k8s-ci-robot k8s-ci-robot removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 5, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from neolit123. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 5, 2024
@jessehu
Copy link
Contributor Author

jessehu commented Apr 5, 2024

/retest

g.Expect(kcp.Status.ReadyReplicas).To(BeEquivalentTo(3))
g.Expect(kcp.Status.UnavailableReplicas).To(BeEquivalentTo(0))
g.Expect(kcp.Status.Ready).To(BeTrue())
}
Copy link
Member

@sbueringer sbueringer Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add an additional call to updateStatus here:

  • drop one controlPlane.machines (or add another one)
  • call updateStatus
  • check that status.Replicas changed and the other ones stayed the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will add the case after the discussion is resolved: #10229 (comment)

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 9, 2024
Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jessehu sorry but I don't think we should rush this fix in in the current state for a couple of reasons:

  • the problem description in this issue is about the MD controller handling cluster cacher tracker lock while restarting the controllers.
  • the problem description in the issue also talks about kcp flicking from 3 to 2 and back to 3 while restarting controllers. This seems unrelated to the lock above (there is only one reconciler running in KCP, so there are no other reconcilers that could hold the lock for the same cluster); it means this is probably a different issue, but probably we don't have enough data to triage
  • the discussion on the issue lead to a more wide problem about how to handle "we cannot get the node state", and somehow shifted to the semantic of the replica fields, but this IMO requires more thinking / design.

I admit this is not ideal. We should try to do better as a maintainer when providing guidance.

Unfortunately, the effect of this lack of analysis/scope creeping on the issue, is that the current PR might introduce more confusion (and not fix the original issue). I have tried to explain this in the comments below, but probably it is more productive to think about a way forward.

My suggestion is to focus on the specific issue about the MD controller handling cluster cacher tracker lock while restarting the controllers.
We can probably mitigate this with a simple change I have discussed with @sbueringer (at least for the happy case where there is connectivity to the workload cluster)

@@ -956,6 +960,9 @@ func (r *Reconciler) getMachineNode(ctx context.Context, cluster *clusterv1.Clus
}
node := &corev1.Node{}
if err := remoteClient.Get(ctx, client.ObjectKey{Name: machine.Status.NodeRef.Name}, node); err != nil {
if apierrors.IsNotFound(err) {
Copy link
Member

@fabriziopandini fabriziopandini Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why are we ignoring not found.
Is that for the case when someone manually deletes a node? If yes, please add a comment (but it also seems unrelated to the issue we are trying to fix)

Copy link
Member

@sbueringer sbueringer Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarification.

The idea behind this one was to signal to the calling function that the node doesn't exist (vs. that it's just an error). Because not finding the Node is also a valuable information (which basically comes down to that the node is not ready)

This was added to preserve the previous behavior one level above (where we only logged the error before but still considered a not found node a node that is not ready)

But yeah it becomes pretty had to understand

@@ -882,8 +887,7 @@ func (r *Reconciler) updateStatus(ctx context.Context, cluster *clusterv1.Cluste

node, err := r.getMachineNode(ctx, cluster, machine)
if err != nil && machine.GetDeletionTimestamp().IsZero() {
log.Error(err, "Unable to retrieve Node status", "node", klog.KObj(node))
continue
return errors.Wrapf(err, "unable to retrieve the status of Node %s", klog.KObj(node))
Copy link
Member

@fabriziopandini fabriziopandini Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the issue

We probably also want to have a timeout on this behaviour. If we haven't seen the Node in x time then we assume its status is unknown

But if I'm not wrong with the current implementation we are going to freeze the replica count indefinetly, which could be confusing or eventually also wrong, given that the number of replicas/machine might change in the meantime.

Frankly speaking, in order to properly fix this issue I think that we first need to figure out how to get/store the "last seen" information for each node.

Given that info, we can decide if it is ok to flip the replica status or if to wait (but I don't have a concrete idea on how to do so, I need some time to research into it)

Copy link
Member

@sbueringer sbueringer Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important point that I discussed with Fabrizio. We're not only freezing the replica fields, we would also freeze other status fields like ObservedGeneration and conditions.

@@ -844,7 +848,8 @@ func (r *Reconciler) shouldAdopt(ms *clusterv1.MachineSet) bool {
}

// updateStatus updates the Status field for the MachineSet
// It checks for the current state of the replicas and updates the Status of the MachineSet.
// It checks for the current state of the replicas and updates the Status field of the MachineSet.
// When unable to retrieve the Node status, it returns error and won't update the Status field of the MachineSet.
func (r *Reconciler) updateStatus(ctx context.Context, cluster *clusterv1.Cluster, ms *clusterv1.MachineSet, filteredMachines []*clusterv1.Machine) error {
Copy link
Member

@sbueringer sbueringer Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here a proposed solution to the ErrClusterLocked issue: (that Fabrizio was referring to)

	// Retry getting a remote client.
	// Note: This is to ensure that we don't run into errors here just 
	// because multiple reconcilers try to create the client at the same time
	// (for the case where the workload cluster is reachable).
	var canCommunicateWithWorkloadCluster bool
	var remoteClient client.Client
	err = retry.OnError(wait.Backoff{
		Steps:    5,
		Duration: 200 * time.Millisecond,
		Factor:   1.0,
	}, func(err error) bool {
		// Retry as long as we get remote.ErrClusterLocked errors. 
		return errors.Is(err, remote.ErrClusterLocked)
	}, func() error {
		remoteClient, err = r.Tracker.GetClient(ctx, util.ObjectKey(cluster))
		if err != nil {
			return err
		}
		canCommunicateWithWorkloadCluster = true
		return nil
	})
	if err != nil {
		log.Error(err, "Unable to retrieve status of Nodes: failed to create remote client")
	}

	for _, machine := range filteredMachines {
		log := log.WithValues("Machine", klog.KObj(machine))

		if templateLabel.Matches(labels.Set(machine.Labels)) {
			fullyLabeledReplicasCount++
		}

		if machine.Status.NodeRef == nil {
			log.V(4).Info("Waiting for the machine controller to set status.nodeRef on the Machine")
			continue
		}

		if !canCommunicateWithWorkloadCluster {
			// Skip the rest of the for loop if we can't communicate with the workload cluster.
			continue
		}

		node, err := r.getMachineNode(ctx, remoteClient, machine)
		if err != nil && machine.GetDeletionTimestamp().IsZero() {
			log.Error(err, "Unable to retrieve Node status", "Node", klog.KObj(node))
			continue
		}

		if noderefutil.IsNodeReady(node) {
			readyReplicasCount++
			if noderefutil.IsNodeAvailable(node, ms.Spec.MinReadySeconds, metav1.Now()) {
				availableReplicasCount++
			}
		} else if machine.GetDeletionTimestamp().IsZero() {
			log.V(4).Info("Waiting for the Kubernetes node on the machine to report ready state")
		}
	}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that when the workload cluster is reachable, we only get ErrClusterLocked for a very short amount of time (the time it takes to create a client). For this case it is good enough to simply retry creating the client.

We will fallback to the previous behavior only if the workload cluster is actually not reachable.

Copy link
Contributor Author

@jessehu jessehu Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!I'm not quite sure if it's worthy to add this bulk of code to resolve the temporary inconsistency of some status replicas and conditions fields caused by ErrClusterLocked error. The simple change in current PR can solve this problem while there are inconsistency but acceptable and won't cause issues per my thinking (maybe I missed some cases).
cc @fabriziopandini for awareness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbueringer and I are in total agreement, and we are trying to help in re-focusing the PR to the original issue. We also took some additional steps to help in finding a way forward by proposing an alternative solution.

With regards to the current PR, I already tried to explain my concern and I will be happy to add more if you have any doubts (and I already answered one).

But given the concern above, I'm personally -1 to merge the PR in the current state.

Instead, we both think the change proposed by @sbueringer solves the original issues, but ultimately it is up to you to accept it or not

Copy link
Contributor Author

@jessehu jessehu Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fabriziopandini for the details. Here are my two cents:

  1. Should this new if block calls return err instead of continue since canCommunicateWithWorkloadCluster won't change in the for loop.
		if !canCommunicateWithWorkloadCluster {
			// Skip the rest of the for loop if we can't communicate with the workload cluster.
			continue
		}
  1. There are many places that return ErrClusterLocked (not only in MS controller and KCP controller). If we want this retry logic, shall we add it in other places as well? Or just here to resolve the original status updating issue?
  2. Current PR code returns err in the for loop in updateStatus() in MS controller when hitting ErrClusterLocked or other error. In this case, it won't update MS.Status because the status update code is after the for loop at lin 903 newStatus.Replicas = int32(len(filteredMachines)).
    However KCP controller is different and already updates some status fields before hitting ErrClusterLocked.
machineset_controller.go:

		if err != nil && machine.GetDeletionTimestamp().IsZero() {
-		    log.Error(err, "Unable to retrieve Node status", "node", klog.KObj(node))
-			continue
+			return errors.Wrapf(err, "unable to retrieve the status of Node %s", klog.KObj(node))
		}

图片

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing: not sure if canCommunicateWithWorkloadCluster == true can 100% guarentee getMachineNode() won't hit ErrClusterLocked since they are two serial func calls but not atom calls ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this new if block calls return err instead of continue since canCommunicateWithWorkloadCluster won't change in the for loop.

Our main concern is basically that if we return an error here as soon as we hit ErrClusterLocked we don't update the status at all anymore. This should be okay in the happy path, which is that we can actually communicate with the workload cluster and it's just a matter of waiting until another goroutine successfully created the client so we can use it. The problem we see is if the workload cluster is actually not reachable. Because in that case we will just continuously return the error forever. In this case we "freeze" the values of status: FullyLabeledReplicas, ReadyReplicas, AvailableReplicas, ObservedGeneration and a few conditions.

The concern Fabrizio raised (and I didn't have on my radar before) is that if we freeze the status indefinitely in these cases (or rather until the workload cluster is reachable) this can be pretty confusing for users. So for this case we should actually have a more extensive solution which also covers signalling to users that we can't communicate with the workload cluster anymore.

What we were trying to suggest was a mitigation for the issue for the happy path, where we actually can talk to the workload cluster, but replicas are flipping only because of the way we create the client initially for a cluster.

There are many places that return ErrClusterLocked (not only in MS controller and KCP controller). If we want this retry logic, shall we add it in other places as well? Or just here to resolve the original status updating issue?

I'm not really sure about this one. Basically we introduced the current "TryLock" over a "Lock" to make sure we don't block all reconcile for a cluster when a workload cluster is not reachable. The "Lock" we had before lead to really problematic performance issues when some workload clusters (with a lot of Machines) were unreachable.
So I think we should be careful with introducing these retries as they can lead to serious performance degradation (basically by just introducing this here, every reconcile of a MachineSet of an unreachable cluster will take 1 second).

@sbueringer
Copy link
Member

sbueringer commented Apr 11, 2024

@jessehu

Just trying to summarize the overall direction of my and Fabrizio's recent comments:

  • We realized that by not updating status at all as soon as we get ErrClusterLocked we freeze the status of unreachable clusters indefinitely. We think we need a more extensive solution for that which should include in some way signalling to the user via status that we can't communicate with the workload cluster anymore (e.g. just like Node conditions can flip to unknown).
  • But it probably requires more work and more extensive changes to figure out how to do this correctly
  • Because of this we were trying to figure out who we can solve your concrete issue for the "happy path" (== wl is actually reachable, cluster cache tracker just temporarily returns ErrClusterLocked) in the mean time
  • For KCP, I don't understand yet how it is possible that we get ErrClusterLocked errors. I would like to understand what I'm missing there. (if we understand what is happening there we might be able to just avoid ErrClusterLocked errors in KCP altogether)

Sorry for the confusion overall, it's just a fairly complex topic and we're trying to figure out what the right thing to do is

@jessehu
Copy link
Contributor Author

jessehu commented Apr 11, 2024

@sbueringer Thanks very much for the summary. I totally agree on bullet 1/2/3.

  • To archive 1 (the ideal solution), it probably requires more work and more extensive change (as you mentioned in 2);
  • To archive 3 (the workable solution in most cases), current PR is almost enough except the KCP status update patch that @fabriziopandini and you have converns.
  • For 4, I'm not 100% sure that KCP.Status.ReadyReplicas updated unexpectedly is caused by ErrClusterLocked error. However, the main branch code gives this possibility when hitting ErrClusterLocked or any other errors at line 86 or 90:

图片
图片

@sbueringer
Copy link
Member

sbueringer commented Apr 11, 2024

For 4, I'm not 100% sure that KCP.Status.ReadyReplicas updated unexpectedly is caused by ErrClusterLocked error. However, the main branch code gives this possibility when hitting ErrClusterLocked or any other errors at line 86 or 90.

Yup absolutely no question that if we hit ErrClusterLocked, this happens. What I don't understand is how we can hit ErrClusterLocked. This should only be possible by concurrent access to the ClusterCacheTracker for the same cluster. But we only have one reconciler (KubeadmControlPlaneReconciler) in this controller and this one can't run concurrently for the same KCP/Cluster object.

I think what could help to figure this out is to add a log to print the call stack whenever we enter ClusterCacheTracker.GetClient (+ for which cluster GetClient is called)

Q: Are you also getting ErrClusterLocked printed in the KCP controller logs? (IIRC the issue had a log from the MachineSet controller)

@jessehu
Copy link
Contributor Author

jessehu commented Apr 11, 2024

I observed the issue when upgrading CAPI components. Would it be able to produce the ClusterCacheTracker lock scenario when rolling update KCP controller deployment?

@sbueringer
Copy link
Member

To archive 3 (the workable solution in most cases), current PR is almost enough

For MachineSet specifically: It definitely solves the "happy path", but it makes the "unhappy path" worse by freezing the status indefinitely. Without the changes the ready and available replicas were dropping to 0 when the communication broke down. Probably a matter of preference if in case of the communication breakdown we prefer to keep ready and available replicas as is or drop them to 0. But another side effect is also that we don't update ObservedGeneration anymore, which definitely seems wrong (similar for the conditions)

@sbueringer
Copy link
Member

sbueringer commented Apr 11, 2024

I observed the issue when upgrading CAPI components. Would it be able to produce the ClusterCacheTracker lock scenario when rolling update KCP controller deployment?

No the lock is purely in-memory, basically it just ensures that only one worker (within a controller) at a time can call GetClient (which times out in 10 seconds when the workload cluster is unreachable). The goal there was that one unreachable cluster doesn't block a lot of workers (that would be especially bad if you have an unreachable cluster with a lot of Machines).

@smartxworks smartxworks closed this by deleting the head repository Apr 14, 2024
@jessehu
Copy link
Contributor Author

jessehu commented Apr 14, 2024

I'm sorry for the confusion that the forked repo was deleted by others today by mistake. I will create a new PR to handle the MS status update issue first as suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machineset Issues or PRs related to machinesets cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
7 participants