Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible netlink leak on 3.29.1 #9603

Closed
imbstack opened this issue Dec 14, 2024 · 10 comments · Fixed by #9609
Closed

Possible netlink leak on 3.29.1 #9603

imbstack opened this issue Dec 14, 2024 · 10 comments · Fixed by #9609

Comments

@imbstack
Copy link

We recently updated calico to 3.29.1 on one of our staging clusters and found that after a few hours there was a clear upward trend in the number of file descriptors held by calico-node pods.

image

Checking on a running instance after a couple days, we found that the calico-node -felix process had nearly 6000 file descriptors according to lsof, nearly all of which were like the following:

# lsof -p 1517383 | tail
calico-no 1517383 root 5778u  netlink                 0t0  689245146 ROUTE
calico-no 1517383 root 5779u  netlink                 0t0  688913733 ROUTE
calico-no 1517383 root 5780u  netlink                 0t0  689385180 ROUTE
calico-no 1517383 root 5781u  netlink                 0t0  689391746 ROUTE
calico-no 1517383 root 5782u  netlink                 0t0  689402663 ROUTE
calico-no 1517383 root 5783u  netlink                 0t0  689407738 ROUTE
calico-no 1517383 root 5784u  netlink                 0t0  689296536 ROUTE
calico-no 1517383 root 5785u  netlink                 0t0  689301559 ROUTE
calico-no 1517383 root 5790u  netlink                 0t0  689395559 ROUTE
calico-no 1517383 root 5791u  netlink                 0t0  689400833 ROUTE

Deleting that pod made dropped the fds although the new pod is starting the trend all over again.

Let me know if there is any other debugging data I can provide.

Expected Behavior

A relatively steady state of file descriptors for a calico-node pod.

Current Behavior

A steady increase in open file descriptors.

Possible Solution

Steps to Reproduce (for bugs)

  1. Just deploy calico 3.29.1 afaict

Context

This is ok for now in our staging environment but are worried about going to production this way. It is entirely possible this is due to some weird config on our side but nothing is jumping out at me so far.

Your Environment

  • Calico version: 3.29.1
  • Calico dataplane (iptables, windows etc.): iptables
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes
  • Operating System and version: Linux ip-10-213-23-129 6.8.0-1018-aws #19~22.04.1-Ubuntu SMP Wed Oct 9 16:48:22 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Link to your project (optional):
@fasaxc
Copy link
Member

fasaxc commented Dec 17, 2024

Uh oh, likely to be one of my PRs to rework route programming:

Any interesting logs in calico-node from route_table.go? I'd expect re-opening netlink sockets after a failure, might help to know which failure you're hitting (if any).

@fasaxc
Copy link
Member

fasaxc commented Dec 17, 2024

Actually, think it might be this one: #9135 some of the calls to netlink.XXX got converted to use a Handle (which is a socket under the covers) and looks like we might have forgotten to close the Handle when done.

@imbstack
Copy link
Author

Well that was quick. Just responding to confirm I'm not seeing any logs from route_table.go just from poking around a bit.

@sridhartigera
Copy link
Member

@imbstack Is it possible for you to test an image with the fix? Image: docker.io/calico/node:v3.29.1-9-g95d2ad73b69e

@fasaxc
Copy link
Member

fasaxc commented Dec 19, 2024

Manifests here if that's more convenient; this is a nightly build of v3.29.next https://2024-12-19-v3-29-viper.docs.eng.tigera.net/

@imbstack
Copy link
Author

I've only had it deployed for an hour or so but it definitely looks better to me. At this point before the patch a pod would've had over 400 open fd but this time around they are sitting at ~160 and its looking flat!

I'll report back here if I see something different in the next couple days but this looks fixed to me. Thanks for the quick turnaround!

@sridhartigera
Copy link
Member

Thanks for testing this.

@sridhartigera
Copy link
Member

@imbstack Hope you did not see any FD leak.

@imbstack
Copy link
Author

Just checked and it's still flat! Looks fixed to me

@sridhartigera
Copy link
Member

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants