-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KCP reconcile hang the when workload cluster API Server is unreachable #8948
Comments
/triage accepted I'll take a look at reproducing this. Thanks for reporting, and thanks for supplying clear steps to reproduce this issue. |
@killianmuldoon Have you tried to reproduce the problem? |
@killianmuldoon we can easily reproduce this issue in 1 Control Plane cluster by injecting a fault to ensure that the workload cluster apiserver is unreachable |
This problem does not seem to be reproduced in the latest code.
|
Hm no this code just ignores errors if the cache was explicitly shutdown. But we only do this after Cluster deletion or when we couldn't sync in createClient |
smartxworks@2f1d311
It looks like the error is indeed being ignored.
|
Thx! I think now I understand it. The problem is that the wail utils in k8s.io/apimachinery were changed with v0.27 (we picked it up with CR v0.15) // ErrWaitTimeout is returned when the condition was not satisfied in time.
//
// Deprecated: This type will be made private in favor of Interrupted()
// for checking errors or ErrorInterrupted(err) for returning a wrapped error.
var ErrWaitTimeout = ErrorInterrupted(errors.New("timed out waiting for the condition")) Previously we just checked for this error. But now we use the util func they recommended. And that one is capturing more: func Interrupted(err error) bool {
switch {
case errors.Is(err, errWaitTimeout),
errors.Is(err, context.Canceled),
errors.Is(err, context.DeadlineExceeded):
return true
default:
return false
}
} |
But it doesn't make any sense because this change is only in main & release-1.5. It doesn't exist on release-1.4 |
I wonder if we should just always delete the accessor |
This should fix it for main and release-1.5 #9025 I'll take a look at release-1.4 now |
@Levi080513 @jessehu Just to get as much data as possible. With which versions of CAPI did you encounter this issue? |
v1.4.3 and the feature of lazyRestMapper is closed. |
Okay so back to your issue on release-1.4. I think I know what's going on. Initial client creation:
APIserver becomes unreachable In a Reconcile after APIServer becomes unreachable
I think at this point it doesn't matter if the healtcheck deletes the clusterAccessor / client / cache as we'll just stay stuck. |
Yes, what you said is correct. |
Can i try to fix this? @sbueringer |
Oh sorry. Already working on a fix. PR should be up in a bit (want to get this up / reviewed / merged ASAP to get it into the release & patch releases on Tuesday) |
@jessehu @Levi080513 It would be super helpful if you can take a look at #9027 and potentially verify if the fix works for you (it works for me) |
Thanks a lot @sbueringer. We verified the PR works as expected. |
We'll cherry-pick both fixes. So with the next series of releases (next Tuesday) all should be fixed |
What steps did you take and what happened?
When upgrading KCP, i found that KCP reconcile hang. Even if the annotation is manually updated to trigger reconcile, there is no reconcile log printed.
KCP log
I killed the KCP process through kill -SIGQUIT and found that KCP reconcile stuck in
Workload.EnsureResource
and would not exit.When I was analyzing this problem, i found that KCP obtains the workload Cluster Client through
ClusterCacheTracker.GetClient
. When using this client to get resource for the first time, it will start Informer to cache resources, and the getting request will not return until theinformer.HasSynced=true
.When the workload cluster API Server is unreachable, the getting request will be blocked due to
infromer.HasSynced=false
.And after
ClusterCacheTracker.healthCheckCluster
failed, the informer cache will no longer try to synchronize, andinfromer.HasSynced
will always be false.cluster-api/controllers/remote/cluster_cache_tracker.go
Lines 527 to 605 in 14b88ca
cluster-api/controllers/remote/cluster_cache_tracker.go
Lines 419 to 435 in 14b88ca
After that, even if the workload cluster API Server is reachable, the getting request will always be blocked and case KCP reconcile hang.
What did you expect to happen?
When the workload cluster API server is accessible, KCP can reconcile normally.
Cluster API version
v1.4.3
Kubernetes version
v1.24.13
Anything else you would like to add?
This issue can be easily reproduced by the following method.
start wait for bug
, inject a fault to ensure that the workload cluster apiserver is unreachableremote/ClusterCacheTracker: Error health checking cluster
, the problem can be reproduced. Even if the workload cluster apiserver can be accessed later, KCP will not reconcile.Label(s) to be applied
/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
The text was updated successfully, but these errors were encountered: