-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak of TCP connections #8219
Comments
@eminence on my machines IPFS ceils arround 13 Gigs (with Do you know if IPFS does the same without What datastore are you using (this could also be badger that is using a lot) ? What usage are you doing of theses nodes ? Can you send a profile of IPFS when it's on high ? |
I have a similar problem. Recently, ipfs began to eat up 1 GB of memory allocated to it in half a day and stop responding to requests. Before that, everything worked great. |
Here is all requested data: |
@akhilman a few questions:
Note: if your node is running in a low powered environment you may want to consider the |
Data collected while the node was slow. Then I restarted my node to upload those files. Without restarting uploading was too slow. I have a cron job that restarts ipfs every day to avoid these slowdowns.
My node is almost idle with no requests and eats memory over time. Until recently, 1Gb of memory was enough for several days, and only after that the memory ran out.
I'll try v0.8.0.
My node is almost idle.
Now I use 300-600, but I used to have 600-900 without any problem for a long time. |
Hi all, thanks for the good questions @Jorropo Unfortunately my graphing tool stopped working after a NAS hiccup, so the above graphs is all I have for now in terms of pictures. But the node is now up to 14G virt mem (compared to the ~8GB from the last datapoint in the graphs I uploaded).
No, but I will spend several days trying this
I'm using whatever the default datastore it. I haven't yet collected a profile, but I will later.
The node wasn't entirely idle. It was likely serving some content to the network, and every now and again it would be adding content (both new stuff, and rehashing stuff it already had in its store) To be clear, the node seems to be performing just fine, except for the memory usage. It may be true that the memory usage just hasn't hit its ceiling yet, but I feel pretty confident in saying that the 0.9 memory usage is higher than 0.8. I will also try to downgrade to 0.8 to confirm. Thanks! |
v0.8.0 up for 22 hours. Bpytop reports 440Mb of memory usage, all works fine, no slowdowns. |
I'm happy to announce that my node has broke it's ceiling too, and it do seems that there is a memory leak in 0.9.0 which wasn't in 0.8.0. |
This only affect |
Ok, good info, thanks! I'm still having problems with 0.9.0 (with |
Saw some strange behavior on one of my nodes as well. While the system had 2 GB of free memory and 9 GB of ZFS ARC memory left, the The system specs: 32 GB memory IPFS: go-ipfs version: 0.9.0-rc2 Datastore: flatfs Maybe have a look at the memory IO with |
I had the same problem on 0.9.0 (AcceleratedDHTClient disabled). I rolled back to 0.8.0 and now memory consumption is growing a lot slower |
@qwertyforce how much ram do you have ? My 4Gigs VPS in 0.9.0 without |
@Jorropo 1gb (rpi3) |
I've had to downgrade another node (on a VPS with 4GB memory) from v0.9.0 to v0.8.0 due to OOM issues (happened 3 times, about once every 26 hours). It's running a default config (so running the default profile, and no tuning done to things like ConnMgr configs) |
Okay, hard to debug something without any debug info. Can you guys please read the debug manual and attach the debug info in the cases where you think the memory consumption is too high? |
This link doesn't work. |
Is this the debug data for 0.8 which works fine? If so, why? |
Are you looking for debug info for nodes using |
This issue is a bit confusing with comments being fairly anecdotal and not easy to analyze. If people who are experiencing increasing memory usage over time could drop profiles here (as mentioned above #8219 (comment)) along with config files that would make it much easier to analyze. I've done a little bit of local testing (lots-of-noise-ram.zip) and it seems that v0.9.0 has some suspicious memory usage where it seems to keep holding onto buffers from Noise connections for a long time even with low connection limits. This was with the standard config (i.e. AcceratedDHTClient disabled) on Windows. I haven't tested this against v0.8.0 yet, but the resource usage does appear to climb over time even when the number of connections remains the same. If other people can post their pprof dumps from when they have high RAM usage it'll be easier to tell if these problems are similar or not. |
Here are those files archived.
For comparison with anomalous data. |
Pretty sure this is related to libp2p/go-tcp-transport#81. One way to test this is to periodically do |
Thanks, nice find @aschmahmann. I curled the prometheus endpoint, it took a full 30 seconds the first time I did (so I'm guessing the bulk of those 30 seconds was the metrics collector doing some cleanup). It didn't immediately reduce memory usage, but it start a downward trend in RSS memory usage. I've also captured some debug info (run after curling the prometheus endpoint), they are available here: https://ipfs.io/ipfs/QmTacvQzyHMwcRzciejfTT38ECU1n8MdbaeXxppYv77B4b (Also, I had totally forgotten about the prometheus endpoint, so I'll start graphing that data now, thanks for reminding me about it) |
I think #8195 is related |
@eminence is that profile after restarting the node? Do things seem to be stabilizing over time now? It'd be good to know if the workaround (and the fix once one is out) is enough to resolve the issue. Side note: your most recent profile shows about 1GB of heap size, but a lot of memory unclaimed by the OS. I'm not sure if this is just a combination of Go taking some time to GC, since it does it in pieces, and your OS being really slow at reclaiming the memory Go relinquishes or something else. |
No. This profile is taken after IPFS had been running for about 10 days. Unfortunately, I've since restarted the node (in an attempt to debug some unrelated content-resolution problems), so I can't tell you how things have stabilized. I am now collecting the prometheus metrics every 15 seconds (and graphing them), so I expect this will give (1) more detailed metrics and (2) be a good test of the workaround.
I see that there is a |
@eminence you can turn on debugging of the garbage collector with setting gctrace to 1. See here for more details. You can also follow the debug guide of go-ipfs and have a look at the memory consumption by allocations to routines and libraries. There are several programs designed to plot them out in graphs. I mean this issue is somewhat recent. It didn't happen in 0.7. So we could just bisect the history to see when it starts happening? 🤔 Only issue here is the repo version. Not sure if I can do this on my main machine where it happens. |
Well, so the real memory usage of IPFS increases (without shared memory): ... and systemd reports much more memory consumption due to highly shared memory(?): These graphs show all the same timeframe and are from starting up ipfs the last time: When systemd is asked to contain ipfs within 12 GB of memory ipfs comes basically to a halt, trying to maintain this amount of memory. See #8195 for the symptoms - there's also a set of debug info available for the current startup of ipfs and my config. I'll close the other ticket since I think this is the same issue. |
If people could try building with master and seeing if the issue remains that would be great, if so please post with new pprofs for analysis so we can reopen. |
Thanks! I'm now testing 9599ad5. For the record (and for future comparisons), here's a visual graph of memory usage captured for a few days, running 0.9.0 |
@aschmahmann looks way way better now! Thank you for the fast fix :) |
Version information:
go-ipfs version: 0.9.0-179d1d150
Repo version: 11
System version: amd64/linux
Golang version: go1.15.2
Description:
IPFS 0.9 seems to be showing increasing memory usage over time. On a small VSP (with only 2GB of memory), this results IPFS being killed every few days via the kernel OOM killer.
On a much bigger machine, I've been graphing the memory usage over the past few days:
On the small VPS, there are no experimental features enabled. On the bigger machine, I have:
(The RRD graphs are configured to store up to a week of data, so I'll keep monitoring things for a few more days)
The text was updated successfully, but these errors were encountered: