Race condition or reordering of jobs during startup causes loss of L3 ports #30

horazont · 2022-06-22T07:58:37Z

Look at the following log (redacted for reasons, but the key idea is visible). I cut it into snippets to point at the various code places we're at.

W0621 08:49:17.637865       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0621 08:49:18.750799       1 controller.go:113] Setting up event handlers
I0621 08:49:18.752439       1 controller.go:182] Starting Load Balancer controller
I0621 08:49:18.757767       1 controller.go:185] Waiting for informer caches to sync

controller.go:185

At this point, we get a full initial sync and are presented with the list of service objects:

I0621 08:49:18.837646       1 controller.go:224] handleObject called
I0621 08:49:18.837667       1 controller.go:229] Processing object: monitoring/prometheus-stack-prometheus-node-exporter
[…]
I0621 08:49:18.841761       1 controller.go:229] Processing object: monitoring/prometheus-adapter
I0621 08:49:18.842126       1 controller.go:224] handleObject called
[…]
I0621 08:49:18.840476       1 controller.go:224] handleObject called
I0621 08:49:18.840516       1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller-admission
[…]
I0621 08:49:18.930515       1 controller.go:224] handleObject called
I0621 08:49:18.930578       1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller

controller.go:229

These all cause SyncServiceJob instances to be enqueued. Then the sync is eventually over:

I0621 08:49:18.932766       1 controller.go:224] handleObject called
I0621 08:49:18.932808       1 controller.go:229] Processing object: kube-system/ch-k8s-lbaas-controller
I0621 08:49:18.930315       1 controller.go:193] Starting workers

Starting workers is right after we enqueued the RemoveCleanupBarrier.

I0621 08:49:18.932896       1 controller.go:204] Started workers
I0621 08:49:18.932944       1 controller.go:212] Triggering periodic cleanup
I0621 08:49:18.933026       1 worker.go:277] Worker started
I0621 08:49:25.401770       1 worker.go:306] Successfully executed job UpdateConfigJob
I0621 08:49:25.401879       1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-stack-prometheus-node-exporter")
[…]
I0621 08:49:25.404497       1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-adapter")
I0621 08:49:25.404634       1 worker.go:306] Successfully executed job RemoveCleanupBarrier
I0621 08:49:25.680933       1 port_manager.go:239] Used ports=[]
I0621 08:49:25.681074       1 port_manager.go:255] Trying to delete port "90bc1474-e1ed-4d58-87b7-3d281ff0a191"
I0621 08:49:28.298263       1 port_manager.go:224] Trying to delete floating ip "5b0463e4-7c70-4c66-af46-6f2a204c8032"
I0621 08:49:29.162748       1 worker.go:306] Successfully executed job CleanupJob
I0621 08:49:29.163014       1 worker.go:306] Successfully executed job SyncServiceJob("k8s-svc-cert-manager/cert-manager-webhook")
I0621 08:49:29.163298       1 port_mapper.go:209] Lookup l3port[90bc1474-e1ed-4d58-87b7-3d281ff0a191]={map[]}
I0621 08:49:29.163464       1 port_mapper.go:211] Allocating port {TCP 80} to service k8s-svc-ingress/ingress-ingress-nginx-controller
I0621 08:49:29.163802       1 port_mapper.go:211] Allocating port {TCP 443} to service k8s-svc-ingress/ingress-ingress-nginx-controller
E0621 08:49:29.241823       1 worker.go:330] error processing job SyncServiceJob("k8s-svc-ingress/ingress-ingress-nginx-controller"): Resource not found: [GET https://network-f1a.cloudandheat.com:9696/v2.0/ports/90bc1474-e1ed-4d58-87b7-3d281ff0a191], error message: {"NeutronError": {"type": "PortNotFound", "message": "Port 90bc1474-e1ed-4d58-87b7-3d281ff0a191 could not be found.", "detail": ""}}; requeueing

And here we can observe what goes wrong. After monitoring/prometheus-adapter is synced, the RemoveCleanupBarrier is executed, but before the k8s-svc-ingress/ingress-ingress-nginx-controller service is synced even once!

That way, the CleanupJob does not see the usage of the L3 port used by that service and it deletes it. Later on, when the service is synced, ch-k8s-lbaas tries to use the port and fails hard because it doesn't exit.

That breaks one of the fundamental assumptions in ch-k8s-lbaas, namely that all services are synced at least once before RemoveCleanupBarrier is invoked (and thus before the first CleanupJob can run).

The question is now why this happens. I could imagine multiple scenarios:

There is a race condition which is not observable in the logs which causes cache.WaitForCacheSync to return early, hence RemoveCleanupBarrierJob being enqueued too early.
The queue itself has somewhat unstable ordering (conceivable, because it also does job deduplication; maybe something is weird with the hashes?).

Either way, this needs more investigation and fixing. A workaround could involve not enqueueing a CleanupJob initially (only the barrier removal), causing it to be delayed by 17s (due to controller.go:202).

The text was updated successfully, but these errors were encountered:

This avoids going through the queue, which appears to be re-ordering jobs. The evidence for this has been collected in #30, but I'll try to summarize it here. We see the following logs: ``` W0621 08:49:17.637865 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0621 08:49:18.750799 1 controller.go:113] Setting up event handlers I0621 08:49:18.752439 1 controller.go:182] Starting Load Balancer controller I0621 08:49:18.757767 1 controller.go:185] Waiting for informer caches to sync I0621 08:49:18.837646 1 controller.go:224] handleObject called I0621 08:49:18.837667 1 controller.go:229] Processing object: monitoring/prometheus-stack-prometheus-node-exporter […] I0621 08:49:18.841761 1 controller.go:229] Processing object: monitoring/prometheus-adapter I0621 08:49:18.842126 1 controller.go:224] handleObject called […] I0621 08:49:18.840476 1 controller.go:224] handleObject called I0621 08:49:18.840516 1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller-admission […] I0621 08:49:18.930515 1 controller.go:224] handleObject called I0621 08:49:18.930578 1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller I0621 08:49:18.932766 1 controller.go:224] handleObject called I0621 08:49:18.932808 1 controller.go:229] Processing object: kube-system/ch-k8s-lbaas-controller I0621 08:49:18.930315 1 controller.go:193] Starting workers I0621 08:49:18.932896 1 controller.go:204] Started workers I0621 08:49:18.932944 1 controller.go:212] Triggering periodic cleanup I0621 08:49:18.933026 1 worker.go:277] Worker started I0621 08:49:25.401770 1 worker.go:306] Successfully executed job UpdateConfigJob I0621 08:49:25.401879 1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-stack-prometheus-node-exporter") […] I0621 08:49:25.404497 1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-adapter") I0621 08:49:25.404634 1 worker.go:306] Successfully executed job RemoveCleanupBarrier I0621 08:49:25.680933 1 port_manager.go:239] Used ports=[] I0621 08:49:25.681074 1 port_manager.go:255] Trying to delete port "90bc1474-e1ed-4d58-87b7-3d281ff0a191" I0621 08:49:28.298263 1 port_manager.go:224] Trying to delete floating ip "5b0463e4-7c70-4c66-af46-6f2a204c8032" I0621 08:49:29.162748 1 worker.go:306] Successfully executed job CleanupJob I0621 08:49:29.163014 1 worker.go:306] Successfully executed job SyncServiceJob("k8s-svc-cert-manager/cert-manager-webhook") ``` The RemoveCleanupBarrier and CleanupJob are enqueued right before `Starting workers`. Yet, we can see that they are executed in the middle of a series of jobs enqueued between `Waiting for informer caches to sync` and `Starting workers`. There are only two explanations: - WaitForCacheSync (controller.go:186) returns before all calls to handleObject have proceeded and then code between :186 and :193 takes more than 0.5 seconds to execute (difference between the timestamp at which `SyncServiceJob("monitoring/prometheus-adapter")` is enqueued and the timestamp of the `Starting workers` log message). In this case, the order in the queue would be incorrect, but at least unaltered by the queue. We can't know if this is the case without an additional log message in line 189, which we'll add in a follow-up commit. - The queue is actually reordering the entries. That would be weird for a queue and from a quick glance at the code of the queue, I cannot find how it would do that. But chances are I missed something, because "generic" golang code is dark and full of indirections. Either way, this should fix the loss of IPs for good. Fixes #30.

This is useful to find the root cause of #30. Please also observe the parent commit message.

This avoids going through the queue, which appears to be re-ordering jobs. The evidence for this has been collected in #30, but I'll try to summarize it here. We see the following logs: ``` W0621 08:49:17.637865 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0621 08:49:18.750799 1 controller.go:113] Setting up event handlers I0621 08:49:18.752439 1 controller.go:182] Starting Load Balancer controller I0621 08:49:18.757767 1 controller.go:185] Waiting for informer caches to sync I0621 08:49:18.837646 1 controller.go:224] handleObject called I0621 08:49:18.837667 1 controller.go:229] Processing object: monitoring/prometheus-stack-prometheus-node-exporter […] I0621 08:49:18.841761 1 controller.go:229] Processing object: monitoring/prometheus-adapter I0621 08:49:18.842126 1 controller.go:224] handleObject called […] I0621 08:49:18.840476 1 controller.go:224] handleObject called I0621 08:49:18.840516 1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller-admission […] I0621 08:49:18.930515 1 controller.go:224] handleObject called I0621 08:49:18.930578 1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller I0621 08:49:18.932766 1 controller.go:224] handleObject called I0621 08:49:18.932808 1 controller.go:229] Processing object: kube-system/ch-k8s-lbaas-controller I0621 08:49:18.930315 1 controller.go:193] Starting workers I0621 08:49:18.932896 1 controller.go:204] Started workers I0621 08:49:18.932944 1 controller.go:212] Triggering periodic cleanup I0621 08:49:18.933026 1 worker.go:277] Worker started I0621 08:49:25.401770 1 worker.go:306] Successfully executed job UpdateConfigJob I0621 08:49:25.401879 1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-stack-prometheus-node-exporter") […] I0621 08:49:25.404497 1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-adapter") I0621 08:49:25.404634 1 worker.go:306] Successfully executed job RemoveCleanupBarrier I0621 08:49:25.680933 1 port_manager.go:239] Used ports=[] I0621 08:49:25.681074 1 port_manager.go:255] Trying to delete port "90bc1474-e1ed-4d58-87b7-3d281ff0a191" I0621 08:49:28.298263 1 port_manager.go:224] Trying to delete floating ip "5b0463e4-7c70-4c66-af46-6f2a204c8032" I0621 08:49:29.162748 1 worker.go:306] Successfully executed job CleanupJob I0621 08:49:29.163014 1 worker.go:306] Successfully executed job SyncServiceJob("k8s-svc-cert-manager/cert-manager-webhook") ``` The RemoveCleanupBarrier and CleanupJob are enqueued right before `Starting workers`. Yet, we can see that they are executed in the middle of a series of jobs enqueued between `Waiting for informer caches to sync` and `Starting workers`. There are only two explanations: - WaitForCacheSync (controller.go:186) returns before all calls to handleObject have proceeded and then code between :186 and :193 takes more than 0.5 seconds to execute (difference between the timestamp at which `SyncServiceJob("monitoring/prometheus-adapter")` is enqueued and the timestamp of the `Starting workers` log message). In this case, the order in the queue would be incorrect, but at least unaltered by the queue. We can't know if this is the case without an additional log message in line 189, which we'll add in a follow-up commit. - The queue is actually reordering the entries. That would be weird for a queue and from a quick glance at the code of the queue, I cannot find how it would do that. But chances are I missed something, because "generic" golang code is dark and full of indirections. Either way, this should fix the loss of IPs for good. Fixes #30.

This is useful to find the root cause of #30. Please also observe the parent commit message.

This avoids going through the queue, which appears to be re-ordering jobs. The evidence for this has been collected in #30, but I'll try to summarize it here. We see the following logs: ``` W0621 08:49:17.637865 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0621 08:49:18.750799 1 controller.go:113] Setting up event handlers I0621 08:49:18.752439 1 controller.go:182] Starting Load Balancer controller I0621 08:49:18.757767 1 controller.go:185] Waiting for informer caches to sync I0621 08:49:18.837646 1 controller.go:224] handleObject called I0621 08:49:18.837667 1 controller.go:229] Processing object: monitoring/prometheus-stack-prometheus-node-exporter […] I0621 08:49:18.841761 1 controller.go:229] Processing object: monitoring/prometheus-adapter I0621 08:49:18.842126 1 controller.go:224] handleObject called […] I0621 08:49:18.840476 1 controller.go:224] handleObject called I0621 08:49:18.840516 1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller-admission […] I0621 08:49:18.930515 1 controller.go:224] handleObject called I0621 08:49:18.930578 1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller I0621 08:49:18.932766 1 controller.go:224] handleObject called I0621 08:49:18.932808 1 controller.go:229] Processing object: kube-system/ch-k8s-lbaas-controller I0621 08:49:18.930315 1 controller.go:193] Starting workers I0621 08:49:18.932896 1 controller.go:204] Started workers I0621 08:49:18.932944 1 controller.go:212] Triggering periodic cleanup I0621 08:49:18.933026 1 worker.go:277] Worker started I0621 08:49:25.401770 1 worker.go:306] Successfully executed job UpdateConfigJob I0621 08:49:25.401879 1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-stack-prometheus-node-exporter") […] I0621 08:49:25.404497 1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-adapter") I0621 08:49:25.404634 1 worker.go:306] Successfully executed job RemoveCleanupBarrier I0621 08:49:25.680933 1 port_manager.go:239] Used ports=[] I0621 08:49:25.681074 1 port_manager.go:255] Trying to delete port "90bc1474-e1ed-4d58-87b7-3d281ff0a191" I0621 08:49:28.298263 1 port_manager.go:224] Trying to delete floating ip "5b0463e4-7c70-4c66-af46-6f2a204c8032" I0621 08:49:29.162748 1 worker.go:306] Successfully executed job CleanupJob I0621 08:49:29.163014 1 worker.go:306] Successfully executed job SyncServiceJob("k8s-svc-cert-manager/cert-manager-webhook") ``` The RemoveCleanupBarrier and CleanupJob are enqueued right before `Starting workers`. Yet, we can see that they are executed in the middle of a series of jobs enqueued between `Waiting for informer caches to sync` and `Starting workers`. There are only two explanations: - WaitForCacheSync (controller.go:186) returns before all calls to handleObject have proceeded and then code between :186 and :193 takes more than 0.5 seconds to execute (difference between the timestamp at which `SyncServiceJob("monitoring/prometheus-adapter")` is enqueued and the timestamp of the `Starting workers` log message). In this case, the order in the queue would be incorrect, but at least unaltered by the queue. We can't know if this is the case without an additional log message in line 189, which we'll add in a follow-up commit. - The queue is actually reordering the entries. That would be weird for a queue and from a quick glance at the code of the queue, I cannot find how it would do that. But chances are I missed something, because "generic" golang code is dark and full of indirections. Either way, this should fix the loss of IPs for good. Fixes #30.

This is useful to find the root cause of #30. Please also observe the parent commit message.

This avoids going through the queue, which appears to be re-ordering jobs. The evidence for this has been collected in #30, but I'll try to summarize it here. We see the following logs: ``` W0621 08:49:17.637865 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0621 08:49:18.750799 1 controller.go:113] Setting up event handlers I0621 08:49:18.752439 1 controller.go:182] Starting Load Balancer controller I0621 08:49:18.757767 1 controller.go:185] Waiting for informer caches to sync I0621 08:49:18.837646 1 controller.go:224] handleObject called I0621 08:49:18.837667 1 controller.go:229] Processing object: monitoring/prometheus-stack-prometheus-node-exporter […] I0621 08:49:18.841761 1 controller.go:229] Processing object: monitoring/prometheus-adapter I0621 08:49:18.842126 1 controller.go:224] handleObject called […] I0621 08:49:18.840476 1 controller.go:224] handleObject called I0621 08:49:18.840516 1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller-admission […] I0621 08:49:18.930515 1 controller.go:224] handleObject called I0621 08:49:18.930578 1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller I0621 08:49:18.932766 1 controller.go:224] handleObject called I0621 08:49:18.932808 1 controller.go:229] Processing object: kube-system/ch-k8s-lbaas-controller I0621 08:49:18.930315 1 controller.go:193] Starting workers I0621 08:49:18.932896 1 controller.go:204] Started workers I0621 08:49:18.932944 1 controller.go:212] Triggering periodic cleanup I0621 08:49:18.933026 1 worker.go:277] Worker started I0621 08:49:25.401770 1 worker.go:306] Successfully executed job UpdateConfigJob I0621 08:49:25.401879 1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-stack-prometheus-node-exporter") […] I0621 08:49:25.404497 1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-adapter") I0621 08:49:25.404634 1 worker.go:306] Successfully executed job RemoveCleanupBarrier I0621 08:49:25.680933 1 port_manager.go:239] Used ports=[] I0621 08:49:25.681074 1 port_manager.go:255] Trying to delete port "90bc1474-e1ed-4d58-87b7-3d281ff0a191" I0621 08:49:28.298263 1 port_manager.go:224] Trying to delete floating ip "5b0463e4-7c70-4c66-af46-6f2a204c8032" I0621 08:49:29.162748 1 worker.go:306] Successfully executed job CleanupJob I0621 08:49:29.163014 1 worker.go:306] Successfully executed job SyncServiceJob("k8s-svc-cert-manager/cert-manager-webhook") ``` The RemoveCleanupBarrier and CleanupJob are enqueued right before `Starting workers`. Yet, we can see that they are executed in the middle of a series of jobs enqueued between `Waiting for informer caches to sync` and `Starting workers`. There are only two explanations: - WaitForCacheSync (controller.go:186) returns before all calls to handleObject have proceeded and then code between :186 and :193 takes more than 0.5 seconds to execute (difference between the timestamp at which `SyncServiceJob("monitoring/prometheus-adapter")` is enqueued and the timestamp of the `Starting workers` log message). In this case, the order in the queue would be incorrect, but at least unaltered by the queue. We can't know if this is the case without an additional log message in line 189, which we'll add in a follow-up commit. - The queue is actually reordering the entries. That would be weird for a queue and from a quick glance at the code of the queue, I cannot find how it would do that. But chances are I missed something, because "generic" golang code is dark and full of indirections. Either way, this should fix the loss of IPs for good. Fixes #30.

This is useful to find the root cause of #30. Please also observe the parent commit message.

This is useful to find the root cause of #30.

Previously, we saw issues like the following (note that this log has been captured without the log message which we just now added in the parent commit): ``` W0621 08:49:17.637865 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0621 08:49:18.750799 1 controller.go:113] Setting up event handlers I0621 08:49:18.752439 1 controller.go:182] Starting Load Balancer controller I0621 08:49:18.757767 1 controller.go:185] Waiting for informer caches to sync I0621 08:49:18.837646 1 controller.go:224] handleObject called I0621 08:49:18.837667 1 controller.go:229] Processing object: monitoring/prometheus-stack-prometheus-node-exporter […] I0621 08:49:18.841761 1 controller.go:229] Processing object: monitoring/prometheus-adapter I0621 08:49:18.842126 1 controller.go:224] handleObject called […] I0621 08:49:18.840476 1 controller.go:224] handleObject called I0621 08:49:18.840516 1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller-admission […] I0621 08:49:18.930515 1 controller.go:224] handleObject called I0621 08:49:18.930578 1 controller.go:229] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller I0621 08:49:18.932766 1 controller.go:224] handleObject called I0621 08:49:18.932808 1 controller.go:229] Processing object: kube-system/ch-k8s-lbaas-controller I0621 08:49:18.930315 1 controller.go:193] Starting workers I0621 08:49:18.932896 1 controller.go:204] Started workers I0621 08:49:18.932944 1 controller.go:212] Triggering periodic cleanup I0621 08:49:18.933026 1 worker.go:277] Worker started I0621 08:49:25.401770 1 worker.go:306] Successfully executed job UpdateConfigJob I0621 08:49:25.401879 1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-stack-prometheus-node-exporter") […] I0621 08:49:25.404497 1 worker.go:306] Successfully executed job SyncServiceJob("monitoring/prometheus-adapter") I0621 08:49:25.404634 1 worker.go:306] Successfully executed job RemoveCleanupBarrier I0621 08:49:25.680933 1 port_manager.go:239] Used ports=[] I0621 08:49:25.681074 1 port_manager.go:255] Trying to delete port "90bc1474-e1ed-4d58-87b7-3d281ff0a191" I0621 08:49:28.298263 1 port_manager.go:224] Trying to delete floating ip "5b0463e4-7c70-4c66-af46-6f2a204c8032" I0621 08:49:29.162748 1 worker.go:306] Successfully executed job CleanupJob I0621 08:49:29.163014 1 worker.go:306] Successfully executed job SyncServiceJob("k8s-svc-cert-manager/cert-manager-webhook") ``` The RemoveCleanupBarrier and CleanupJob are enqueued right before `Starting workers`. Yet, we can see that they are executed in the middle of a series of jobs enqueued between `Waiting for informer caches to sync` and `Starting workers`. In addition, we captured this log (with an intermediate working state; "Executing ... inline!" is not a log line you can find in the current state of the code): ``` I0630 11:47:31.014386 1 controller.go:188] Waiting for informer caches to sync I0630 11:47:31.070488 1 controller.go:231] handleObject called I0630 11:47:31.074975 1 controller.go:236] Processing object: k8s-svc-cert-manager/cert-manager I0630 11:47:31.075248 1 controller.go:271] Executing SyncServiceJob("k8s-svc-cert-manager/cert-manager") inline! I0630 11:47:31.075474 1 controller.go:231] handleObject called I0630 11:47:31.075675 1 controller.go:236] Processing object: k8s-svc-ingress/ingress-ingress-nginx-controller I0630 11:47:31.075893 1 controller.go:271] Executing SyncServiceJob("k8s-svc-ingress/ingress-ingress-nginx-controller") inline! I0630 11:47:31.076104 1 port_mapper.go:209] Lookup l3port[93d89231-c0c4-4650-8142-19b2a6fcf592]={map[]} I0630 11:47:31.076323 1 port_mapper.go:211] Allocating port {TCP 80} to service k8s-svc-ingress/ingress-ingress-nginx-controller I0630 11:47:31.076514 1 port_mapper.go:211] Allocating port {TCP 443} to service k8s-svc-ingress/ingress-ingress-nginx-controller I0630 11:47:31.114802 1 controller.go:192] Informer caches are synchronized, enqueueing job to remove the cleanup barrier ``` As it is obvious here, the WaitForCacheSync function returns *before* all services have been seen. The cluster has way more than just these two. Therefore we cannot rely on the return of WaitForCacheSync as a marker for all services having been observed. In addition, it is hard to explain the first log without assuming that the queue reorders somehow (0.5 seconds between the handle object where stuff gets injected and the log line which indicates that the cleanup jobs have been enqueued seems an awful lot for a race condition). As a remedy, we decided on this slightly ugly solution: We let the cleanup jobs run only after three sync intervals have passed. In addition, we raise the sync interval to five minutes in order to not run into scalability issues in clusters with lots of services to sync. This means that a cleanup will now only happen 15 minutes after the controller has started up. This is generally not a problem, because the cleanup is only needed when services are deleted without the LBaaS controller knowing about it. Deleting and recreating services will work just fine (without new port allocations) as the LBaaS controller still knows about the ports it used. Fixes #30.

horazont mentioned this issue Jun 22, 2022

Lost port is not recoverable #31

Open

horazont added a commit that referenced this issue Jun 30, 2022

controller: Add log message to indicate sync completion

a412a20

This is useful to find the root cause of #30. Please also observe the parent commit message.

horazont added a commit that referenced this issue Jun 30, 2022

controller: Add log message to indicate sync completion

b79fbef

This is useful to find the root cause of #30. Please also observe the parent commit message.

horazont mentioned this issue Jun 30, 2022

controller: Execute SyncServiceJobs right away on initial sync #32

Closed

horazont added a commit that referenced this issue Jun 30, 2022

controller: Add log message to indicate sync completion

f981033

This is useful to find the root cause of #30. Please also observe the parent commit message.

horazont added a commit that referenced this issue Jun 30, 2022

controller: Add log message to indicate sync completion

f5ca896

This is useful to find the root cause of #30. Please also observe the parent commit message.

horazont added a commit that referenced this issue Jul 1, 2022

controller: Add log message to indicate sync completion

e551c10

This is useful to find the root cause of #30.

horazont mentioned this issue Jul 1, 2022

controller: increase sync and cleanup intervals #33

Merged

horazont closed this as completed in #33 Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition or reordering of jobs during startup causes loss of L3 ports #30

Race condition or reordering of jobs during startup causes loss of L3 ports #30

horazont commented Jun 22, 2022 •

edited

Loading

Race condition or reordering of jobs during startup causes loss of L3 ports #30

Race condition or reordering of jobs during startup causes loss of L3 ports #30

Comments

horazont commented Jun 22, 2022 • edited Loading

horazont commented Jun 22, 2022 •

edited

Loading