-
Notifications
You must be signed in to change notification settings - Fork 20.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Geth 1.10.4 high memory use - how to gather data #23195
Comments
I restarted my node recently here is two scans that were taken a few minutes apart with increasing Bitmap sizes. VERSION: Note: I'm running |
Approx 5G of RAM used by geth at the time of the first report, 12 before restart Reports: First report here: https://gist.github.com/adrienlacombe/a9d89abcf44d165f1ddf4af55076b8a1 Geth used for eth2 only |
Tangiantely relevant, but if I enable the --pprof server, I can't enable the metrics server, since both use port 6060. Port for metrics must be changed with |
Here is a scan roughly 12 hours later. still increasing in memory allocation Hope it helps 😄 |
@adrienlacombe @DanHenton Thank you for providing so many data points! |
@MariusVanDerWijden thank you for your help! Here is the file. |
@MariusVanDerWijden debug.stacks results |
When I run, Running my node with |
@colorfulcoin Could you also run |
@MariusVanDerWijden no problem, happy to help. When I run
|
The debug namespace is by default not available over http, but should work if you attach to the ipc endpoint
|
Two reports attached: |
@MariusVanDerWijden See attached. I ran a new scan as well. I was unable to get
|
@colorfulcoin I had the same issue, try |
@adrienlacombe that outputs |
That's good, you should now have a file with name memprofile in /tmp/ folder @colorfulcoin |
@adrienlacombe It seems to be encoded/encrypted. Cat the file just leads a bunch of stuff like this |
This is fine @colorfulcoin upload the file here. |
ok I've modified my above comment to add it. |
@colorfulcoin The memprofile is a binary file. It contains information about objects in the memory used by geth. Specifically, it contains the place in the code where each object was created. |
@yorickdowne, regarding the goroutines issue you mentioned. I have found this GETH in the wild and apparently has about 1.5K goroutines. edit: Because the server has also exposed the |
@fjl ^^ |
This isn't entirely necessary for helping troubleshoot this, but it's relevant and something I've been meaning to add to my monitoring solution anyhow. The linked dashboard includes a graph for both memory usage and goroutines, so you could even set up a Grafana alert so you know when memory usage reaches 14GB (when the pprof scan becomes useful for troubleshooting purposes). |
@yorickdowne sent info to you on how to investigate further, see discord |
I did a lot of snapshot syncing tests recently for the version of 1.10.4 I run a test node now sith disabled snapshot feature based on this ticket Here is the current measurement:
I know we should wait until Geth grows up to 7-9Gb before any conclusion. But I would suggest digging into this direction. |
I have been trying to sync, and geth will not run for more than 3 hours (often less) without stopping. I believe I am running out of memory. I only have 4 GB Ram, but I thought this was the base requirement. I generated files as suggested by @MariusVanDerWijden, by running Here are those files |
Seems that many of these reports concern snap sync. I'm doing a benchmark run of If not, then another possibility might be that the "comomn ground" is the RPC load from being eth2 backends, and something has regressed in that department. |
@holiman We have counted ~20 OOM kill events on about a dozen nodes using 1.10.4 with After searching for the root cause to no avail, this problem magically fixed itself two weeks later (still the same nodes on 1.10.4, no config changes). We've spun up two 1.10.4 nodes as a test run: --syncmode=snap --snapshot=true (Jul 12th) --syncmode=fast --snapshot=false (Jul 19th) |
I have a concern about the snapshot mode support, not about initial sync. I've changed snapshot settings for the same synced node twice (without cleaning up blockchain data) and always saw the same behavior. Geth process running with options |
Here's my memsize scan using 1.10.4, snapshotted when geth is using 12.8 GB of memory (per https://gist.github.com/agro1986/4195473c093fa78704d1bf19c9de9518 As you can see below, since I upgraded to 1.10.4 in July 1st, geth will bring the available memory of my system down until it crashes or until I restart geth (in which the available memory will jump back again). I launch geth using this
|
Thanks everyone for providing all this memory usage data. We have been checking it out, and have enough reports now. I've looked at a lot of these reports and found that the out-of-memory crashes cannot be explained using the information seen in memsize. In most reports, the top item are the 'cached trie nodes', which is expected. |
As an additional info, I updated from 1.10.4 -> 1.10.6 and the issue still persists |
What geth configuration is necessary to have a stable memory consumption. A linear increase of memory over time is not desirable. |
This has been hitting Bor ever since they rebased themselves on Geth 1.10.8. In their case it hits some, not all nodes, and seems to be related to P2P. Keep an eye on maticnetwork/bor#179 |
So from the Polygon/Bor side of things: It's a lot worse with many (150-200) peers, and considerably more mellow with few (<=50) peers. Hopefully this gives the Geth team something to look for. It should be possible to reproduce this issue more rapidly by setting maxpeers to 200. |
From testing on Bor, pprof may show 12-13 GiB, and docker stats shows ~24-25 GiB. No wonder pprof didn't help to find the issue. Bor team is testing a fix: maticnetwork/bor@54d3cf4 |
We fixed an issue in v.1.10.9 here: 067084f which should fix a potential memory leak in the transaction pool. Are there any nodes experiencing the memory leak post 1.10.9? |
I am running 1.10.11-stable-7231b3ef and didn't yet experience a spike or sth similar. Will let you know after a couple of days running. |
Using geth version 1.10.11-stable-7231b3ef in snapshot mode with --metrics --metrics.expensive --pprof --http --cache 1024: After 1 week recording: linear increase of 1GB/week memory usage. Before it was 1GB/2days. |
I am on Geth 1.10.10 and get unclean shutdown every few hours, htop shows me it is because Geth takes all my RAM. I have 16GB DDR4 3300Mhz on an Asus PN50. When I run --pprof I get the following but I cannot establish a link to the address to scan the node. |
Do you have two geth nodes running? |
This is interesting because I did have to rebuild the database after my ssd had less than <50gb for pruning- I followed instructions for removing the database and then letting it re-sync. I followed Somer Esat guide for initial install. Any tips to help figure this out? |
|
This introduces the next command line and gives no information. It has got me thinking though; Somer Esat Guide recommends a certain directory for geth which is what I have when I do geth status, but --pprof seems to show a different directory. Could this mean 2 instances of Geth after all? |
It seems that this issue was fixed in v.1.10.9. A race condition resulted in transactions not getting deallocated and garbage collected correctly. If you are experience high memory usage with a version higher than v.1.10.9 please open a new issue! |
From Geth Discord, by fjl - this is a quote, not my words:
I have seen quite a few reports of high memory usage with Geth v1.10.4 last couple days in here,
some people have even reported that there seems to be a leak, with used memory steadily rising.
If you are experiencing this problem, we would appreciate if you could do the following steps:
Note: If your system is very low on memory, the memsize scan can crash your node (because the scanning also uses some extra memory). Also, the node will not be operational while it does the scanning.
If you are using grafana, or another metrics solution, please monitor the geth/system/cpu/goroutines.gauge metric.
This metric is the number of lightweight Go threads used by geth. If this number keeps going up, it's a goroutine leak. This kind of leak can happen when there is a bug, and it will usually manifest as memory usage increasing.
It's kind of good to monitor this metric in general.
On our nodes, with 50 peers, the goroutine count is ~600-650
https://discord.com/channels/482467812179181568/482467812816977921/862323752695103519
in https://discord.gg/VNnEHqsHMr
Note: If you are looking to access the pprof web page over the LAN, you'll also need
--pprof.addr=0.0.0.0
and port6060/tcp
open on your host firewall, like ufw. Be careful not to expose the port to "Internet at large".Note: Wait until Geth uses 11GiB+, ideally even 14GiB+ of RAM, before running a report. The initial increase up to 8-9 GiB might not be that instructive. Geth 1.10.3 used 8-9 GiB of RAM on mainnet with default settings, that level of use is expected.
The text was updated successfully, but these errors were encountered: