controller: increase sync and cleanup intervals #33

horazont · 2022-07-01T06:17:07Z

Previously, we saw issues like the following (note that this log has
been captured without the log message which we just now added in the
parent commit):

W0621 08:49:17.637865       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0621 08:49:18.750799       1 controller.go:113] Setting up event handlers
I0621 08:49:18.752439       1 controller.go:182] Starting Load Balancer controller
I0621 08:49:18.757767       1 controller.go:185] Waiting for informer caches to sync
I0621 08:49:18.837646       1 controller.go:224] handleObject called
I0621 08:49:18.837667       1 controller.go:229] Processing object: monitoring/prometheus-stack-prometheus-node-exporter
[…]
I0621 08:49:18.841761       1 controller.go:229] Processing object: monitoring/prometheus-adapter
I0621 08:49:18.842126       1 controller.go:224] handleObject called
[…]
I0621 08:49:18.840476       1 controller.go:224] handleObject called
I0621 08:49:18.840516       1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller-admission
[…]
I0621 08:49:18.930515       1 controller.go:224] handleObject called
I0621 08:49:18.930578       1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller
I0621 08:49:18.932766       1 controller.go:224] handleObject called
I0621 08:49:18.932808       1 controller.go:229] Processing object: kube-system/ch-k8s-lbaas-controller
I0621 08:49:18.930315       1 controller.go:193] Starting workers
I0621 08:49:18.932896       1 controller.go:204] Started workers
I0621 08:49:18.932944       1 controller.go:212] Triggering periodic cleanup
I0621 08:49:18.933026       1 worker.go:277] Worker started
I0621 08:49:25.401770       1 worker.go:306] Successfully executed job UpdateConfigJob
I0621 08:49:25.401879       1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-stack-prometheus-node-exporter")
[…]
I0621 08:49:25.404497       1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-adapter")
I0621 08:49:25.404634       1 worker.go:306] Successfully executed job RemoveCleanupBarrier
I0621 08:49:25.680933       1 port_manager.go:239] Used ports=[]
I0621 08:49:25.681074       1 port_manager.go:255] Trying to delete port "90bc1474-e1ed-4d58-87b7-3d281ff0a191"
I0621 08:49:28.298263       1 port_manager.go:224] Trying to delete floating ip "5b0463e4-7c70-4c66-af46-6f2a204c8032"
I0621 08:49:29.162748       1 worker.go:306] Successfully executed job CleanupJob
I0621 08:49:29.163014       1 worker.go:306] Successfully executed job SyncServiceJob("k8s-svc-cert-manager/cert-manager-webhook")

The RemoveCleanupBarrier and CleanupJob are enqueued right before
Starting workers. Yet, we can see that they are executed in the middle
of a series of jobs enqueued between Waiting for informer caches to sync and Starting workers.

In addition, we captured this log (with an intermediate working state;
"Executing ... inline!" is not a log line you can find in the current
state of the code):

I0630 11:47:31.014386       1 controller.go:188] Waiting for informer caches to sync
I0630 11:47:31.070488       1 controller.go:231] handleObject called
I0630 11:47:31.074975       1 controller.go:236] Processing object: k8s-svc-cert-manager/cert-manager
I0630 11:47:31.075248       1 controller.go:271] Executing SyncServiceJob("k8s-svc-cert-manager/cert-manager") inline!
I0630 11:47:31.075474       1 controller.go:231] handleObject called
I0630 11:47:31.075675       1 controller.go:236] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller
I0630 11:47:31.075893       1 controller.go:271] Executing SyncServiceJob("k8s-svc-ingress/ingress-ingress-nginx-controller") inline!
I0630 11:47:31.076104       1 port_mapper.go:209] Lookup l3port[93d89231-c0c4-4650-8142-19b2a6fcf592]={map[]}
I0630 11:47:31.076323       1 port_mapper.go:211] Allocating port {TCP 80} to service k8s-svc-ingress/ingress-ingress-nginx-controller
I0630 11:47:31.076514       1 port_mapper.go:211] Allocating port {TCP 443} to service k8s-svc-ingress/ingress-ingress-nginx-controller
I0630 11:47:31.114802       1 controller.go:192] Informer caches are synchronized, enqueueing job to remove the cleanup barrier

As it is obvious here, the WaitForCacheSync function returns before
all services have been seen. The cluster has way more than just these
two.

Therefore we cannot rely on the return of WaitForCacheSync as a marker
for all services having been observed. In addition, it is hard to
explain the first log without assuming that the queue reorders somehow
(0.5 seconds between the handle object where stuff gets injected and the
log line which indicates that the cleanup jobs have been enqueued seems
an awful lot for a race condition).

As a remedy, we decided on this slightly ugly solution: We let the
cleanup jobs run only after three sync intervals have passed. In
addition, we raise the sync interval to five minutes in order to not run
into scalability issues in clusters with lots of services to sync.

This means that a cleanup will now only happen 15 minutes after the
controller has started up. This is generally not a problem, because the
cleanup is only needed when services are deleted without the LBaaS
controller knowing about it. Deleting and recreating services will work
just fine (without new port allocations) as the LBaaS controller still
knows about the ports it used.

Fixes #30.

This is useful to find the root cause of #30.

Previously, we saw issues like the following (note that this log has been captured without the log message which we just now added in the parent commit): ``` W0621 08:49:17.637865 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0621 08:49:18.750799 1 controller.go:113] Setting up event handlers I0621 08:49:18.752439 1 controller.go:182] Starting Load Balancer controller I0621 08:49:18.757767 1 controller.go:185] Waiting for informer caches to sync I0621 08:49:18.837646 1 controller.go:224] handleObject called I0621 08:49:18.837667 1 controller.go:229] Processing object: monitoring/prometheus-stack-prometheus-node-exporter […] I0621 08:49:18.841761 1 controller.go:229] Processing object: monitoring/prometheus-adapter I0621 08:49:18.842126 1 controller.go:224] handleObject called […] I0621 08:49:18.840476 1 controller.go:224] handleObject called I0621 08:49:18.840516 1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller-admission […] I0621 08:49:18.930515 1 controller.go:224] handleObject called I0621 08:49:18.930578 1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller I0621 08:49:18.932766 1 controller.go:224] handleObject called I0621 08:49:18.932808 1 controller.go:229] Processing object: kube-system/ch-k8s-lbaas-controller I0621 08:49:18.930315 1 controller.go:193] Starting workers I0621 08:49:18.932896 1 controller.go:204] Started workers I0621 08:49:18.932944 1 controller.go:212] Triggering periodic cleanup I0621 08:49:18.933026 1 worker.go:277] Worker started I0621 08:49:25.401770 1 worker.go:306] Successfully executed job UpdateConfigJob I0621 08:49:25.401879 1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-stack-prometheus-node-exporter") […] I0621 08:49:25.404497 1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-adapter") I0621 08:49:25.404634 1 worker.go:306] Successfully executed job RemoveCleanupBarrier I0621 08:49:25.680933 1 port_manager.go:239] Used ports=[] I0621 08:49:25.681074 1 port_manager.go:255] Trying to delete port "90bc1474-e1ed-4d58-87b7-3d281ff0a191" I0621 08:49:28.298263 1 port_manager.go:224] Trying to delete floating ip "5b0463e4-7c70-4c66-af46-6f2a204c8032" I0621 08:49:29.162748 1 worker.go:306] Successfully executed job CleanupJob I0621 08:49:29.163014 1 worker.go:306] Successfully executed job SyncServiceJob("k8s-svc-cert-manager/cert-manager-webhook") ``` The RemoveCleanupBarrier and CleanupJob are enqueued right before `Starting workers`. Yet, we can see that they are executed in the middle of a series of jobs enqueued between `Waiting for informer caches to sync` and `Starting workers`. In addition, we captured this log (with an intermediate working state; "Executing ... inline!" is not a log line you can find in the current state of the code): ``` I0630 11:47:31.014386 1 controller.go:188] Waiting for informer caches to sync I0630 11:47:31.070488 1 controller.go:231] handleObject called I0630 11:47:31.074975 1 controller.go:236] Processing object: k8s-svc-cert-manager/cert-manager I0630 11:47:31.075248 1 controller.go:271] Executing SyncServiceJob("k8s-svc-cert-manager/cert-manager") inline! I0630 11:47:31.075474 1 controller.go:231] handleObject called I0630 11:47:31.075675 1 controller.go:236] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller I0630 11:47:31.075893 1 controller.go:271] Executing SyncServiceJob("k8s-svc-ingress/ingress-ingress-nginx-controller") inline! I0630 11:47:31.076104 1 port_mapper.go:209] Lookup l3port[93d89231-c0c4-4650-8142-19b2a6fcf592]={map[]} I0630 11:47:31.076323 1 port_mapper.go:211] Allocating port {TCP 80} to service k8s-svc-ingress/ingress-ingress-nginx-controller I0630 11:47:31.076514 1 port_mapper.go:211] Allocating port {TCP 443} to service k8s-svc-ingress/ingress-ingress-nginx-controller I0630 11:47:31.114802 1 controller.go:192] Informer caches are synchronized, enqueueing job to remove the cleanup barrier ``` As it is obvious here, the WaitForCacheSync function returns *before* all services have been seen. The cluster has way more than just these two. Therefore we cannot rely on the return of WaitForCacheSync as a marker for all services having been observed. In addition, it is hard to explain the first log without assuming that the queue reorders somehow (0.5 seconds between the handle object where stuff gets injected and the log line which indicates that the cleanup jobs have been enqueued seems an awful lot for a race condition). As a remedy, we decided on this slightly ugly solution: We let the cleanup jobs run only after three sync intervals have passed. In addition, we raise the sync interval to five minutes in order to not run into scalability issues in clusters with lots of services to sync. This means that a cleanup will now only happen 15 minutes after the controller has started up. This is generally not a problem, because the cleanup is only needed when services are deleted without the LBaaS controller knowing about it. Deleting and recreating services will work just fine (without new port allocations) as the LBaaS controller still knows about the ports it used. Fixes #30.

controller: Add log message to indicate sync completion

e551c10

This is useful to find the root cause of #30.

horazont mentioned this pull request Jul 1, 2022

controller: Execute SyncServiceJobs right away on initial sync #32

Closed

horazont force-pushed the fix/cleanup-interval branch from 2da6da6 to 3beea0a Compare July 1, 2022 07:30

horazont marked this pull request as ready for review July 1, 2022 07:54

expektorans approved these changes Jul 1, 2022

View reviewed changes

horazont merged commit 7d5310a into master Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

controller: increase sync and cleanup intervals #33

controller: increase sync and cleanup intervals #33

horazont commented Jul 1, 2022

controller: increase sync and cleanup intervals #33

controller: increase sync and cleanup intervals #33

Conversation

horazont commented Jul 1, 2022