-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache sync timeout after upgrading to version 0.11.0 #1786
Comments
Do you have any conversion webhook for multi-version CRD? (actually it should be solved in 0.11, which used to cause cache sync timeout before) |
@FillZpp Yes (and no), in one of the two failing controllers we have multi-version CRD and conversion webhook on the owner side. But also failing for a controller which does not have multiple versions - at least not on the owner side. It might have multi-version CRD with conversion webhook on the ownee side, though. Could that also be relevant? |
So you do use the default two minutes timeout? Does it fail like that after two minutes or a different interval? |
Thanks for challenging me to report this issue better! 😊 It seems like I didn't investigate our issue well enough, but now I think I got a bit further. The issue is reproducible when enabling leader election, and occurs during shutdown on the non-leader replica. It seems like the non-leader attempts to start sync'ing it' cache just before it is shut down - which is aborted resulting in this annoying and misleading error message. The full log for a non-leader replica running controller-runtime v0.10.3:
And the full log for a non-leader replica running controller-runtime v0.11.0:
|
If the manager starts to terminate when the controllers and cache are starting, the cache syncing will get timeout because of the canceled context. controller-runtime/pkg/source/source.go Lines 172 to 180 in 273e608
So the question is what's the cause of your manager termination? |
@FillZpp That is quite easy to answer, since there is a new replica set available - as part of a normal rolling update of the controller deployment. As noted above this happens when the non-leader is about to shut down. For some reason it seems to initiate a startup sequence as part of the shutdown, and that is doomed..... If we compare the behavior on c-r 0.10.3 with c-r 0.11.0, it seems like some mechanics have changed causing this behavior. Could it be that it thinks it has become the new leader? I would prefer if the new leader is elected from the newest replica set, but that is probably not possible to control. 😉 But at least I would expect this shutdown to be more gracefully than to always end with a (IMO misleading) error in the logs. |
So it happened when you were updating your controller deployment. When the old leader terminated, another old controller became leader and began to start its runnables. And then deployment delete it, so it terminated when cache was starting. So... I don't think there is something wrong with the cache syncing canceled. It is the expected behavior IMHO. Maybe you are expecting the log message to be "cache starting has been canceled" instead of "timed out waiting"? |
I agree that this is a probable cause, and to some extent the expected behavior. But I would not expect this to be logged as an error. Would it be possible to fix that? A cancellation from the control plane is not an error IMO. |
@alvaroaleman What do you think? Does it make sense to check if the controller-runtime/pkg/internal/controller/controller.go Lines 195 to 205 in 273e608
|
After upgrading our controller K8s dependencies to more recent versions, controller-runtime 0.11.0, K8s 0.23.1, we see an increased frequency of this type of error:
Any clues what might cause this, and how to fix it? The default cache timeout seems to be 2 minutes, and I think that should be (more than) enough time to sync the cache..... 🤔 We see the problem in 2 out of 3 controllers.
Is there any change that might have caused this error to emerge, and that we just have to increase the cache timeout? After reverting to the previous versions, controller-runtime 0.10.3, K8s 0.22.6, this error does not occur at all.
The text was updated successfully, but these errors were encountered: