Cache sync timeout after upgrading to version 0.11.0 #1786

erikgb · 2022-01-26T21:52:49Z

After upgrading our controller K8s dependencies to more recent versions, controller-runtime 0.11.0, K8s 0.23.1, we see an increased frequency of this type of error:

{
  "level": "error",
  "ts": 1642682485.0051146,
  "logger": "controller.application",
  "msg": "Could not wait for Cache to sync",
  "reconciler group": "stas.mycompany.com",
  "reconciler kind": "Application",
  "error": "failed to wait for application caches to sync: timed out waiting for cache to be synced",
  "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/builds/stas/k8s-controllers/application-operator/.go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/builds/stas/k8s-controllers/application-operator/.go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/builds/stas/k8s-controllers/application-operator/.go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/manager/runnable_group.go:218"
}

Any clues what might cause this, and how to fix it? The default cache timeout seems to be 2 minutes, and I think that should be (more than) enough time to sync the cache..... 🤔 We see the problem in 2 out of 3 controllers.

Is there any change that might have caused this error to emerge, and that we just have to increase the cache timeout? After reverting to the previous versions, controller-runtime 0.10.3, K8s 0.22.6, this error does not occur at all.

FillZpp · 2022-01-27T09:18:18Z

Do you have any conversion webhook for multi-version CRD? (actually it should be solved in 0.11, which used to cause cache sync timeout before)

erikgb · 2022-01-27T09:40:59Z

Do you have any conversion webhook for multi-version CRD? (actually it should be solved in 0.11, which used to cause cache sync timeout before)

@FillZpp Yes (and no), in one of the two failing controllers we have multi-version CRD and conversion webhook on the owner side. But also failing for a controller which does not have multiple versions - at least not on the owner side. It might have multi-version CRD with conversion webhook on the ownee side, though. Could that also be relevant?

alvaroaleman · 2022-02-07T14:46:29Z

So you do use the default two minutes timeout? Does it fail like that after two minutes or a different interval?

erikgb · 2022-02-08T13:49:48Z

Thanks for challenging me to report this issue better! 😊 It seems like I didn't investigate our issue well enough, but now I think I got a bit further.

The issue is reproducible when enabling leader election, and occurs during shutdown on the non-leader replica. It seems like the non-leader attempts to start sync'ing it' cache just before it is shut down - which is aborted resulting in this annoying and misleading error message.

The full log for a non-leader replica running controller-runtime v0.10.3:

{"level":"info","ts":1644321376.4299257,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1644321376.4308634,"msg":"Starting metrics server","path":"/metrics"}
{"level":"info","ts":1644321376.4305346,"logger":"setup","msg":"starting manager"}
I0208 11:56:16.431148       1 leaderelection.go:248] attempting to acquire leader lease identity-provider/b6e1c15e.REDACTED...

And the full log for a non-leader replica running controller-runtime v0.11.0:

{"level":"info","ts":1644313447.3173363,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1644313447.3181336,"logger":"setup","msg":"starting manager"}
{"level":"info","ts":1644313447.4092572,"msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":1644313447.3186686,"msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
I0208 09:44:13.219438       1 leaderelection.go:248] attempting to acquire leader lease identity-provider/b6e1c15e.REDACTED...
{"level":"info","ts":1644321368.1553714,"msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":1644321368.1553307,"msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":1644321368.1558619,"msg":"Wait completed, proceeding to shutdown the manager"}
{"level":"info","ts":1644321368.1558402,"msg":"Stopping and waiting for webhooks"}
{"level":"error","ts":1644321368.1557145,"msg":"error received after stop sequence was engaged","error":"failed to wait for serviceaccount caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/builds/stas/k8s-controllers/identity-provider/.go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/manager/internal.go:541"}
{"level":"info","ts":1644321368.1556916,"msg":"Stopping and waiting for caches"}
{"level":"error","ts":1644321368.1556363,"logger":"controller.serviceaccount","msg":"Could not wait for Cache to sync","reconciler group":"","reconciler kind":"ServiceAccount","error":"failed to wait for serviceaccount caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/builds/stas/k8s-controllers/identity-provider/.go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/builds/stas/k8s-controllers/identity-provider/.go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/builds/stas/k8s-controllers/identity-provider/.go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/manager/runnable_group.go:218"}
{"level":"info","ts":1644321368.1555128,"logger":"controller.serviceaccount","msg":"Starting Controller","reconciler group":"","reconciler kind":"ServiceAccount"}
{"level":"info","ts":1644321368.1555092,"logger":"controller.serviceaccount","msg":"Starting EventSource","reconciler group":"","reconciler kind":"ServiceAccount","source":"kind source: *v1.Secret"}
{"level":"info","ts":1644321368.1555035,"logger":"controller.serviceaccount","msg":"Starting EventSource","reconciler group":"","reconciler kind":"ServiceAccount","source":"kind source: *v1.Certificate"}
{"level":"info","ts":1644321368.1554766,"logger":"controller.serviceaccount","msg":"Starting EventSource","reconciler group":"","reconciler kind":"ServiceAccount","source":"kind source: *v1.ServiceAccount"}

FillZpp · 2022-02-09T03:03:07Z

"error received after stop sequence was engaged","error":"failed to wait for serviceaccount caches to sync: timed out waiting for cache to be synced"

If the manager starts to terminate when the controllers and cache are starting, the cache syncing will get timeout because of the canceled context.

controller-runtime/pkg/source/source.go

Lines 172 to 180 in 273e608

    
           func (ks *Kind) WaitForSync(ctx context.Context) error { 
        
           	select { 
        
           	case err := <-ks.started: 
        
           		return err 
        
           	case <-ctx.Done(): 
        
           		ks.startCancel() 
        
           		return errors.New("timed out waiting for cache to be synced") 
        
           	} 
        
           }

So the question is what's the cause of your manager termination?

erikgb · 2022-02-09T07:26:27Z

So the question is what's the cause of your manager termination?

@FillZpp That is quite easy to answer, since there is a new replica set available - as part of a normal rolling update of the controller deployment. As noted above this happens when the non-leader is about to shut down. For some reason it seems to initiate a startup sequence as part of the shutdown, and that is doomed.....

If we compare the behavior on c-r 0.10.3 with c-r 0.11.0, it seems like some mechanics have changed causing this behavior. Could it be that it thinks it has become the new leader? I would prefer if the new leader is elected from the newest replica set, but that is probably not possible to control. 😉 But at least I would expect this shutdown to be more gracefully than to always end with a (IMO misleading) error in the logs.

FillZpp · 2022-02-10T07:55:32Z

So it happened when you were updating your controller deployment.

When the old leader terminated, another old controller became leader and began to start its runnables. And then deployment delete it, so it terminated when cache was starting.

So... I don't think there is something wrong with the cache syncing canceled. It is the expected behavior IMHO. Maybe you are expecting the log message to be "cache starting has been canceled" instead of "timed out waiting"?

erikgb · 2022-02-10T09:14:50Z

So... I don't think there is something wrong with the cache syncing canceled. It is the expected behavior IMHO. Maybe you are expecting the log message to be "cache starting has been canceled" instead of "timed out waiting"?

I agree that this is a probable cause, and to some extent the expected behavior. But I would not expect this to be logged as an error. Would it be possible to fix that? A cancellation from the control plane is not an error IMO.

FillZpp · 2022-02-10T09:26:12Z

But I would not expect this to be logged as an error.

@alvaroaleman What do you think? Does it make sense to check if the ctx is done which means it is canceled instead of timeout before line 202?

controller-runtime/pkg/internal/controller/controller.go

Lines 195 to 205 in 273e608

    
           // use a context with timeout for launching sources and syncing caches. 
        
           sourceStartCtx, cancel := context.WithTimeout(ctx, c.CacheSyncTimeout) 
        
           defer cancel() 
        
           // WaitForSync waits for a definitive timeout, and returns if there 
        
           // is an error or a timeout 
        
           if err := syncingSource.WaitForSync(sourceStartCtx); err != nil { 
        
           	err := fmt.Errorf("failed to wait for %s caches to sync: %w", c.Name, err) 
        
           	c.Log.Error(err, "Could not wait for Cache to sync") 
        
           	return err 
        
           }

erikgb mentioned this issue Jan 26, 2022

Update dependencies for K8s 1.22+ kubernetes-sigs/hierarchical-namespaces#133

Merged

erikgb mentioned this issue Feb 16, 2022

🐛 Cache sync cancelled should not error with timeout #1808

Merged

k8s-ci-robot closed this as completed in #1808 Feb 16, 2022

johnjelinek mentioned this issue Dec 7, 2023

error running manager: timed out waiting for cache to be synced snyk/kubernetes-scanner#71

Open

wenqiq mentioned this issue Jul 25, 2024

Fix leader eletion issue vmware-tanzu/nsx-operator#650

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache sync timeout after upgrading to version 0.11.0 #1786

Cache sync timeout after upgrading to version 0.11.0 #1786

erikgb commented Jan 26, 2022

FillZpp commented Jan 27, 2022

erikgb commented Jan 27, 2022 •

edited

Loading

alvaroaleman commented Feb 7, 2022

erikgb commented Feb 8, 2022

FillZpp commented Feb 9, 2022

erikgb commented Feb 9, 2022

FillZpp commented Feb 10, 2022

erikgb commented Feb 10, 2022

FillZpp commented Feb 10, 2022

Cache sync timeout after upgrading to version 0.11.0 #1786

Cache sync timeout after upgrading to version 0.11.0 #1786

Comments

erikgb commented Jan 26, 2022

FillZpp commented Jan 27, 2022

erikgb commented Jan 27, 2022 • edited Loading

alvaroaleman commented Feb 7, 2022

erikgb commented Feb 8, 2022

FillZpp commented Feb 9, 2022

erikgb commented Feb 9, 2022

FillZpp commented Feb 10, 2022

erikgb commented Feb 10, 2022

FillZpp commented Feb 10, 2022

erikgb commented Jan 27, 2022 •

edited

Loading