🌱 ClusterCacheTracker: fix accessor deletion on health check failure #9025

sbueringer · 2023-07-21T05:14:43Z

Signed-off-by: Stefan Büringer buringerst@vmware.com

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Part of #8948 (should fix the issue for main and release-1.5)

Signed-off-by: Stefan Büringer buringerst@vmware.com

sbueringer · 2023-07-21T05:15:22Z

/assign @fabriziopandini @chrischdi @killianmuldoon

I think we should have a follow-up to ensure we improve the test coverage of this critical component for cases like that workload clusters become unreachable

sbueringer · 2023-07-21T05:16:59Z

/cherry-pick release-1.5

This issue was introduced in the 1.5 cycle during the CR bump. So we shouldn't need it in v1.4.
I'll verify if this behavior works in v1.4

EDIT: PR to fix the other half of the issue #9028

k8s-infra-cherrypick-robot · 2023-07-21T05:17:01Z

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.5 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.5

This issue was introduced in the 1.5 cycle during the CR bump. So we shouldn't need it in v1.4.
I'll verify if this behavior works in v1.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chrischdi

/lgtm

k8s-ci-robot · 2023-07-21T06:46:57Z

LGTM label has been added.

Git tree hash: de607ad529c83b17301f7729c470877bedc92a45

sbueringer · 2023-07-21T07:22:31Z

/assign @vincepri

fabriziopandini

/lgtm
nice find!

fabriziopandini · 2023-07-21T10:00:48Z

/approve

k8s-ci-robot · 2023-07-21T10:00:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [fabriziopandini]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

killianmuldoon · 2023-07-21T10:01:29Z

/cherry-pick release-1.5

k8s-infra-cherrypick-robot · 2023-07-21T10:01:30Z

@killianmuldoon: once the present PR merges, I will cherry-pick it on top of release-1.5 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-infra-cherrypick-robot · 2023-07-21T10:16:48Z

@sbueringer: new pull request created: #9031

In response to this:

/cherry-pick release-1.5

This issue was introduced in the 1.5 cycle during the CR bump. So we shouldn't need it in v1.4.
I'll verify if this behavior works in v1.4

EDIT: PR to fix the other half of the issue #9028

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2023-07-21T12:13:46Z

Just more context for future us. We don't need this on release-1.4 or older because there we still could precisely check for the error returned by the wait func. This changed and with >= release-1.5 we can't diferentiate between wait func timeout and client-go timeout.

1.4 code:

	err := wait.PollImmediateUntil(in.interval, runHealthCheckWithThreshold, ctx.Done())
	// An error returned implies the health check has failed a sufficient number of
	// times for the cluster to be considered unhealthy
	// NB. we are ignoring ErrWaitTimeout because this error happens when the channel is close, that in this case
	// happens when the cache is explicitly stopped.
	if err != nil && err != wait.ErrWaitTimeout {
		t.log.Error(err, "Error health checking cluster", "Cluster", klog.KRef(in.cluster.Namespace, in.cluster.Name))
		t.deleteAccessor(ctx, in.cluster)
	}

(but the old code was a bit brittle anyway so I'm happy with the new one in any case)

killianmuldoon · 2023-07-24T16:30:12Z

/area clustercachetracker

ClusterCacheTracker: fix accessor deletion on health check failure

1505143

Signed-off-by: Stefan Büringer buringerst@vmware.com

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 21, 2023

k8s-ci-robot requested review from jackfrancis and ykakarap July 21, 2023 05:14

k8s-ci-robot assigned chrischdi, fabriziopandini and killianmuldoon Jul 21, 2023

sbueringer mentioned this pull request Jul 21, 2023

KCP reconcile hang the when workload cluster API Server is unreachable #8948

Closed

chrischdi approved these changes Jul 21, 2023

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 21, 2023

k8s-ci-robot assigned vincepri Jul 21, 2023

fabriziopandini reviewed Jul 21, 2023

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2023

k8s-ci-robot merged commit 6762aed into kubernetes-sigs:main Jul 21, 2023

k8s-ci-robot added this to the v1.6 milestone Jul 21, 2023

k8s-infra-cherrypick-robot mentioned this pull request Jul 21, 2023

[release-1.5] 🌱 ClusterCacheTracker: fix accessor deletion on health check failure #9031

Merged

sbueringer deleted the pr-fix-cct-healthcheck branch July 21, 2023 12:15

sbueringer mentioned this pull request Jul 21, 2023

Improve test coverage of ClusterCacheTracker #9044

Closed

k8s-ci-robot added the area/clustercachetracker Issues or PRs related to the clustercachetracker label Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌱 ClusterCacheTracker: fix accessor deletion on health check failure #9025

🌱 ClusterCacheTracker: fix accessor deletion on health check failure #9025

sbueringer commented Jul 21, 2023 •

edited

Loading

sbueringer commented Jul 21, 2023

sbueringer commented Jul 21, 2023 •

edited

Loading

k8s-infra-cherrypick-robot commented Jul 21, 2023

chrischdi left a comment

k8s-ci-robot commented Jul 21, 2023

sbueringer commented Jul 21, 2023

fabriziopandini left a comment

fabriziopandini commented Jul 21, 2023

k8s-ci-robot commented Jul 21, 2023

killianmuldoon commented Jul 21, 2023

k8s-infra-cherrypick-robot commented Jul 21, 2023

k8s-infra-cherrypick-robot commented Jul 21, 2023

sbueringer commented Jul 21, 2023 •

edited

Loading

killianmuldoon commented Jul 24, 2023

🌱 ClusterCacheTracker: fix accessor deletion on health check failure #9025

🌱 ClusterCacheTracker: fix accessor deletion on health check failure #9025

Conversation

sbueringer commented Jul 21, 2023 • edited Loading

sbueringer commented Jul 21, 2023

sbueringer commented Jul 21, 2023 • edited Loading

k8s-infra-cherrypick-robot commented Jul 21, 2023

chrischdi left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jul 21, 2023

sbueringer commented Jul 21, 2023

fabriziopandini left a comment

Choose a reason for hiding this comment

fabriziopandini commented Jul 21, 2023

k8s-ci-robot commented Jul 21, 2023

killianmuldoon commented Jul 21, 2023

k8s-infra-cherrypick-robot commented Jul 21, 2023

k8s-infra-cherrypick-robot commented Jul 21, 2023

sbueringer commented Jul 21, 2023 • edited Loading

killianmuldoon commented Jul 24, 2023

sbueringer commented Jul 21, 2023 •

edited

Loading

sbueringer commented Jul 21, 2023 •

edited

Loading

sbueringer commented Jul 21, 2023 •

edited

Loading