ClusterCacheTracker: Memory leak when kubeconfig is rotated too often, and CAPI is running on the workload cluster #9542
Labels
kind/bug
Categorizes issue or PR as related to a bug.
triage/accepted
Indicates an issue or PR is ready to be actively worked on.
What steps did you take and what happened?
Steps to reproduce the error:
MachinePool
andAWSManagedMachinePool
resources.capi-controller-manager
logs being spammed, and pod memory increasing.We have a
v1.27.4
EKS cluster deployed via cluster-api, and we recently upgraded fromv1.3.7
tov1.5.2
.Everything went fine with the upgrade. However, after a week of running, I noticed the
capi-controller-manager
pod being spammed (about 300 logs per minute) with:and pod memory usage being high.
Restarted the pod, everything is fine again. But slowly with time, the same issue reproduces again.
What did you expect to happen?
Pod logs that are not being spammed, and pod memory not increasing over time.
Cluster API version
v1.5.2
Kubernetes version
v1.27.4-eks-2d98532
Anything else you would like to add?
I already investigated and solved this issue, but I want to bring more context on the issue first.
When deploying EKS Kubernetes clusters via cluster-api AWS provider (CAPA), there are two
kubeconfig
secrets being set:<CLUSTER_NAME>-kubeconfig
<CLUSTER_NAME>-user-kubeconfig
The
<CLUSTER_NAME>-user-kubeconfig
is meant to be used by users, while<CLUSTER_NAME>-kubeconfig
is being used internally by CAPI, and it's rotated everysync-period
, since it contains an expiring token to access the EKS API. More details here.When running cluster-api in the workload cluster, this code path causes a memory leak when
<CLUSTER_NAME>-kubeconfig
is rotated frequently:cluster-api/controllers/remote/cluster_cache_tracker.go
Lines 298 to 330 in a82c340
Notice that in this part of the code, when CAPI is running in the workload cluster, the
t.createClient
is called twice:cluster-api/controllers/remote/cluster_cache_tracker.go
Lines 298 to 299 in a82c340
cluster-api/controllers/remote/cluster_cache_tracker.go
Lines 324 to 325 in a82c340
However, the function
t.createClient
also starts a goroutine with theclusterAccessor
cache:cluster-api/controllers/remote/cluster_cache_tracker.go
Lines 449 to 450 in a82c340
Using more verbose logging, I noticed that immediately after the goroutine cache starts, there's a cluster nodes watcher registered:
I believe this is happening, because we have the following watcher already registered on the
ClusterCacheTracker
:cluster-api/exp/internal/controllers/machinepool_controller.go
Lines 325 to 331 in a82c340
However, looking in the
ClusterCacheTracker
code, this watcher is supposed to be started just once:cluster-api/controllers/remote/cluster_cache_tracker.go
Lines 541 to 545 in a82c340
But, when I restart the
capi-controller-manager
, I see that the watcher registers twice for the same cluster. This is confirmed by seing this twice:So, one of the goroutine is leaked.
This is happening because, in the initially mentioned code path, we start the cache twice. Before overwriting the cached client with the in-cluster configuration, we never stop the initial cache, and that runs continuously in the background.
I noticed that stopping the cache would also stop the goroutine that watches the nodes continuously.
This wouldn't be a memory leak problem if
kubeconfig
is not rotated frequently, since theclusterAccessor
is never re-created. However, when the loadedkubeconfig
expires, thehealthCheckCluster
fails, and theclusterAccessor
is deleted:cluster-api/controllers/remote/cluster_cache_tracker.go
Lines 656 to 668 in a82c340
This causes the
clusterAccessor
to be recreated on the nextReconcile
loop, and the troublesome code path leaks another goroutine with the cluster node watcher. And this happens every time the loaded EKS clusterkubeconfig
expires, and theclusterAccessor
is recreated.The log spam comes the fact that, after a long running time, we have a lot of leaked goroutines trying to list cluster nodes, with an expired
kubeconfig
.Label(s) to be applied
/kind bug
/area clustercachetracker
The text was updated successfully, but these errors were encountered: