Geth 1.10.4 high memory use - how to gather data #23195

yorickdowne · 2021-07-11T11:35:50Z

From Geth Discord, by fjl - this is a quote, not my words:

I have seen quite a few reports of high memory usage with Geth v1.10.4 last couple days in here,
some people have even reported that there seems to be a leak, with used memory steadily rising.
If you are experiencing this problem, we would appreciate if you could do the following steps:

enable the pprof server (debugging tool) by starting geth with the --pprof flag
this will enable an HTTP server on port 6060 on localhost
wait for it to reach high memory usage
then go to http://127.0.0.1:6060/memsize
click the Scan "node" button
this operation will freeze the node for quite a while and perform an internal scan of used memory
when it's done, post the resulting report to https://gist.github.com/, then link it here on Geth Discord

Note: If your system is very low on memory, the memsize scan can crash your node (because the scanning also uses some extra memory). Also, the node will not be operational while it does the scanning.

If you are using grafana, or another metrics solution, please monitor the geth/system/cpu/goroutines.gauge metric.
This metric is the number of lightweight Go threads used by geth. If this number keeps going up, it's a goroutine leak. This kind of leak can happen when there is a bug, and it will usually manifest as memory usage increasing.
It's kind of good to monitor this metric in general.
On our nodes, with 50 peers, the goroutine count is ~600-650

https://discord.com/channels/482467812179181568/482467812816977921/862323752695103519
in https://discord.gg/VNnEHqsHMr

Note: If you are looking to access the pprof web page over the LAN, you'll also need --pprof.addr=0.0.0.0 and port 6060/tcp open on your host firewall, like ufw. Be careful not to expose the port to "Internet at large".

Note: Wait until Geth uses 11GiB+, ideally even 14GiB+ of RAM, before running a report. The initial increase up to 8-9 GiB might not be that instructive. Geth 1.10.3 used 8-9 GiB of RAM on mainnet with default settings, that level of use is expected.

The text was updated successfully, but these errors were encountered:

DanHenton · 2021-07-11T12:19:40Z

I restarted my node recently here is two scans that were taken a few minutes apart with increasing Bitmap sizes.

First scan
Second scan

VERSION:
1.10.4-stable-aa637fd3

Note: I'm running mainnet and pyrmont on one machine with 64GB my current average MEM usage is ~65% but was up over 80% before I restarted the machine.

adrienlacombe · 2021-07-11T16:52:08Z

Approx 5G of RAM used by geth at the time of the first report, 12 before restart

Reports:
"node" @ 2021-07-11 16:48:31 +0000 UTC
"node" @ 2021-07-11 19:50:47 +0000 UTC
"node" @ 2021-07-12 04:26:07 +0000 UTC

First report here: https://gist.github.com/adrienlacombe/a9d89abcf44d165f1ddf4af55076b8a1
Second report here, RAM keeps increasing: https://gist.github.com/adrienlacombe/4fae6df9c6541aa110468e69ca6fbed1
Third report here, RAM keeps increasing: https://gist.github.com/adrienlacombe/4e3c9918ab4236a47fd83795634fd458

Geth used for eth2 only

odyslam · 2021-07-11T17:32:12Z

Tangiantely relevant, but if I enable the --pprof server, I can't enable the metrics server, since both use port 6060. Port for metrics must be changed with --metrics.port value so that they both co-exist.

DanHenton · 2021-07-11T23:11:48Z

Here is a scan roughly 12 hours later. still increasing in memory allocation

12hr scan

Hope it helps 😄

MariusVanDerWijden · 2021-07-12T06:31:37Z

@adrienlacombe @DanHenton Thank you for providing so many data points!
Could you run geth attach --exec debug.writeMemProfile$\"memprofile\"$ on your nodes? It creates a file memprofile which contains detailed information about where the memory is retained/allocated/etc.

adrienlacombe · 2021-07-12T07:06:10Z

@MariusVanDerWijden thank you for your help! Here is the file.
memprofile.zip

adrienlacombe · 2021-07-12T08:49:11Z

@MariusVanDerWijden debug.stacks results
stacks.txt

colorfulcoin · 2021-07-12T12:50:21Z

When I run, sudo systemctl status geth, it shows 21.5Gb. Here's the output. The numbers look smaller in the report, but I'm not sure I am reading it correctly. I also just started running the node recently, so I don't have a baseline to compare against, but was told in the discord that anything above 10Gb should be posted.

Running my node with geth --http --datadir /var/lib/goethereum --cache 10240 --maxpeers 50 --pprof

gethscan.txt

MariusVanDerWijden · 2021-07-12T13:21:49Z

@colorfulcoin Could you also run
geth attach --exec debug.stacks > file and
geth attach --exec debug.writeMemProfile$\"memprofile\"$ and paste the resulting files memprofile and file here

colorfulcoin · 2021-07-12T13:58:18Z

@MariusVanDerWijden no problem, happy to help. When I run geth attach http://127.0.0.1:8545 --exec debug.stacks > file and cat file I get this. Any suggestions for what flags I need to be running to enable debug?

ReferenceError: debug is not defined
	at <eval>:1:1(0)

holiman · 2021-07-12T14:21:38Z

The debug namespace is by default not available over http, but should work if you attach to the ipc endpoint

dimsome · 2021-07-12T14:31:04Z

Two reports attached:
The first one not too long after the start, the second one roughly after 24 hours (maybe more).

SCANS

colorfulcoin · 2021-07-12T20:00:20Z

@MariusVanDerWijden See attached. I ran a new scan as well. I was unable to get sudo geth attach /var/lib/goethereum/geth.ipc --exec debug.writeMemProfile$\"memprofile\"$ to run. Getting:

Error: open memprofile: permission denied
	at web3.js:6357:37(47)
	at web3.js:5091:62(37)
	at <eval>:1:22(4)

file.txt
gethscan.txt
memeprofile.txt

adrienlacombe · 2021-07-12T20:18:49Z

@colorfulcoin I had the same issue, try sudo geth attach /var/lib/goethereum/geth.ipc --exec debug.writeMemProfile$\"tmp\/memprofile\"$

colorfulcoin · 2021-07-12T20:23:32Z

@adrienlacombe that outputs null. What is that command doing? I am assuming it outputs something somewhere.

adrienlacombe · 2021-07-12T20:33:14Z

That's good, you should now have a file with name memprofile in /tmp/ folder @colorfulcoin

colorfulcoin · 2021-07-12T20:35:17Z

@adrienlacombe It seems to be encoded/encrypted. Cat the file just leads a bunch of stuff like this o�u24��Kٰ��d�.sI��Gc��Z�;��ު�{xF�C��A��n�2��dN��hG�? Not sure what the ask is here.

adrienlacombe · 2021-07-12T20:52:53Z

This is fine @colorfulcoin upload the file here.

colorfulcoin · 2021-07-12T20:57:40Z

ok I've modified my above comment to add it.

fjl · 2021-07-12T21:23:06Z

@colorfulcoin The memprofile is a binary file. It contains information about objects in the memory used by geth. Specifically, it contains the place in the code where each object was created.

odyslam · 2021-07-13T11:35:46Z

@yorickdowne, regarding the goroutines issue you mentioned. I have found this GETH in the wild and apparently has about 1.5K goroutines.

edit: Because the server has also exposed the pprof flag, perhaps it's better to send it over DM in discord. My username is odyslam.

yorickdowne · 2021-07-13T14:47:04Z

@fjl ^^

SeaMonkey82 · 2021-07-13T20:26:51Z

This isn't entirely necessary for helping troubleshoot this, but it's relevant and something I've been meaning to add to my monitoring solution anyhow. The linked dashboard includes a graph for both memory usage and goroutines, so you could even set up a Grafana alert so you know when memory usage reaches 14GB (when the pprof scan becomes useful for troubleshooting purposes).

Monitoring Geth with InfluxDB and Grafana

holiman · 2021-07-15T09:13:09Z

@yorickdowne sent info to you on how to investigate further, see discord

begetan · 2021-07-16T10:39:28Z

I did a lot of snapshot syncing tests recently for the version of 1.10.4
In all configurations, I saw a Geth memory leak. It grows up around 1GB/per day. Geth on a test node with 8GB RAM and --cache=3072 was killed by the OOM killer after 4-5 days of uptime.

I run a test node now sith disabled snapshot feature based on this ticket

Here is the current measurement:

4,7 GB after sync completed
5,1 GB on day 1
5,3 GB on day 2
5,5 GB on day 3

I know we should wait until Geth grows up to 7-9Gb before any conclusion.

But I would suggest digging into this direction.

jacobfrank250 · 2021-07-16T17:54:09Z

I have been trying to sync, and geth will not run for more than 3 hours (often less) without stopping. I believe I am running out of memory. I only have 4 GB Ram, but I thought this was the base requirement. I generated files as suggested by @MariusVanDerWijden, by running
./geth attach --exec debug.stacks() > file and
./geth attach --exec debug.writeMemProfile("memprofile")

Here are those files
file.txt
memprofile.txt

holiman · 2021-07-16T19:45:52Z

Seems that many of these reports concern snap sync. I'm doing a benchmark run of 1.10.3 vs 1.10.5 now, to see if we can repro it in our internal environments.

If not, then another possibility might be that the "comomn ground" is the RPC load from being eth2 backends, and something has regressed in that department.

holiman · 2021-07-17T08:16:10Z

During sync, 1.10.5 is indeed a bit higher in memory usage, which is a side-effect of it being faster downloading headers/block data. It achieves higher ingress speeds, and competes that phase sooner:

1.10.5 is green, and 1.10.3 is yellow:

Although the usage is a bit higher, I don't think it is high enough to be the culprit behind the bug reports.

riptl · 2021-07-19T18:46:55Z

@holiman We have counted ~20 OOM kill events on about a dozen nodes using 1.10.4 with --syncmode=snap --snapshot=true. The OOM hard limit was set to 11 GB. Nodes would allocate memory over about a week. All of these nodes were upgraded from older Geth versions that were initially synced with fast.

After searching for the root cause to no avail, this problem magically fixed itself two weeks later (still the same nodes on 1.10.4, no config changes).

We've spun up two 1.10.4 nodes as a test run:

--syncmode=snap --snapshot=true (Jul 12th)
9GB of memory used after 4.5 days of syncing (from scratch). About 1GB/day growth.

--syncmode=fast --snapshot=false (Jul 19th)
5.5 GB of memory used after 6 days of syncing (from scratch). Memory usage is stable, no OOM kills.

begetan · 2021-07-21T15:02:09Z

I have a concern about the snapshot mode support, not about initial sync.

I've changed snapshot settings for the same synced node twice (without cleaning up blockchain data) and always saw the same behavior. Geth process running with options --syncmode=fast --snapshot=false has no memory leak, while process without these options leaking with 0,8-1GB/per day.

agro1986 · 2021-07-24T10:47:51Z

Here's my memsize scan using 1.10.4, snapshotted when geth is using 12.8 GB of memory (per sudo pmap $GETH_PID)

https://gist.github.com/agro1986/4195473c093fa78704d1bf19c9de9518

As you can see below, since I upgraded to 1.10.4 in July 1st, geth will bring the available memory of my system down until it crashes or until I restart geth (in which the available memory will jump back again).

I launch geth using this

/usr/local/bin/geth --http --datadir /somewhere/geth --cache 2048 --maxpeers 40 --pprof --pprof.addr=0.0.0.0

fjl · 2021-07-29T14:36:25Z

Thanks everyone for providing all this memory usage data. We have been checking it out, and have enough reports now.

I've looked at a lot of these reports and found that the out-of-memory crashes cannot be explained using the information seen in memsize. In most reports, the top item are the 'cached trie nodes', which is expected.

agro1986 · 2021-07-30T06:16:39Z

As an additional info, I updated from 1.10.4 -> 1.10.6 and the issue still persists

agro1986 · 2021-08-03T14:30:32Z

It seems that memory usage is better after pruning (1.10.6). You can see on graph below, the after last restart I pruned and after that available memory doesn't drop as sharply as before. Might need to see how it behaves after several more days.

bassoy · 2021-08-19T15:07:34Z

I have a concern about the snapshot mode support, not about initial sync.

I've changed snapshot settings for the same synced node twice (without cleaning up blockchain data) and always saw the same behavior. Geth process running with options --syncmode=fast --snapshot=false has no memory leak, while process without these options leaking with 0,8-1GB/per day.

Same here:

Using geth version 1.10.6-stable-576681f2 with --metrics --metrics.expensive --pprof --http --cache 1024.

bassoy · 2021-08-27T08:59:42Z

Thanks everyone for providing all this memory usage data. We have been checking it out, and have enough reports now.

I've looked at a lot of these reports and found that the out-of-memory crashes cannot be explained using the information seen in memsize. In most reports, the top item are the 'cached trie nodes', which is expected.

What geth configuration is necessary to have a stable memory consumption. A linear increase of memory over time is not desirable.

yorickdowne · 2021-09-01T20:10:37Z

This has been hitting Bor ever since they rebased themselves on Geth 1.10.8. In their case it hits some, not all nodes, and seems to be related to P2P. Keep an eye on maticnetwork/bor#179

yorickdowne · 2021-09-19T10:17:10Z

So from the Polygon/Bor side of things: It's a lot worse with many (150-200) peers, and considerably more mellow with few (<=50) peers. Hopefully this gives the Geth team something to look for. It should be possible to reproduce this issue more rapidly by setting maxpeers to 200.

yorickdowne · 2021-09-21T14:21:15Z

From testing on Bor, pprof may show 12-13 GiB, and docker stats shows ~24-25 GiB. No wonder pprof didn't help to find the issue.

Bor team is testing a fix: maticnetwork/bor@54d3cf4

MariusVanDerWijden · 2021-10-22T09:42:56Z

We fixed an issue in v.1.10.9 here: 067084f which should fix a potential memory leak in the transaction pool. Are there any nodes experiencing the memory leak post 1.10.9?

bassoy · 2021-10-22T09:49:38Z

I am running 1.10.11-stable-7231b3ef and didn't yet experience a spike or sth similar. Will let you know after a couple of days running.

bassoy · 2021-10-29T07:38:51Z

@MariusVanDerWijden

Using geth version 1.10.11-stable-7231b3ef in snapshot mode with --metrics --metrics.expensive --pprof --http --cache 1024: After 1 week recording: linear increase of 1GB/week memory usage. Before it was 1GB/2days.

StakeNoob · 2021-10-31T10:15:08Z

We fixed an issue in v.1.10.9 here: 067084f which should fix a potential memory leak in the transaction pool. Are there any nodes experiencing the memory leak post 1.10.9?

I am on Geth 1.10.10 and get unclean shutdown every few hours, htop shows me it is because Geth takes all my RAM. I have 16GB DDR4 3300Mhz on an Asus PN50. When I run --pprof I get the following but I cannot establish a link to the address to scan the node.

MariusVanDerWijden · 2021-11-01T06:44:35Z

Do you have two geth nodes running? Error bind address already in use only happens if some other process already servers port 30303 (which is usually another instance of geth)

StakeNoob · 2021-11-01T20:55:36Z

Do you have two geth nodes running? Error bind address already in use only happens if some other process already servers port 30303 (which is usually another instance of geth)

This is interesting because I did have to rebuild the database after my ssd had less than <50gb for pruning- I followed instructions for removing the database and then letting it re-sync. I followed Somer Esat guide for initial install. Any tips to help figure this out?

MariusVanDerWijden · 2021-11-02T13:59:55Z

lsof -i :30303

StakeNoob · 2021-11-02T21:04:37Z

lsof -i :30303

This introduces the next command line and gives no information. It has got me thinking though; Somer Esat Guide recommends a certain directory for geth which is what I have when I do geth status, but --pprof seems to show a different directory. Could this mean 2 instances of Geth after all?

MariusVanDerWijden · 2021-11-03T14:37:03Z

It seems that this issue was fixed in v.1.10.9. A race condition resulted in transactions not getting deallocated and garbage collected correctly. If you are experience high memory usage with a version higher than v.1.10.9 please open a new issue!
Thank you to everyone who helped us getting to the bottom of this!

yorickdowne added the type:bug label Jul 11, 2021

rjl493456442 mentioned this issue Jul 23, 2021

Runtime out of memory error in geth #23255

Closed

bassoy mentioned this issue Oct 7, 2021

Memory leak on txpool priced urgent list #23690

Closed

MariusVanDerWijden closed this as completed Nov 3, 2021

StakeNoob mentioned this issue Nov 14, 2021

High Memory Usage v1.10.12-stable #23907

Closed

naoki4 mentioned this issue Aug 8, 2024

Memory usage of non-validator nodes grows indefinitely, leading to OOM and unclean shutdown Consensys/quorum#1660

Open

mhirokawa03 mentioned this issue Sep 5, 2024

Geth high memory use Consensys/quorum#1730

Open

Geth 1.10.4 high memory use - how to gather data #23195

Geth 1.10.4 high memory use - how to gather data #23195

Comments

yorickdowne commented Jul 11, 2021 • edited Loading

DanHenton commented Jul 11, 2021

adrienlacombe commented Jul 11, 2021 • edited Loading

odyslam commented Jul 11, 2021

DanHenton commented Jul 11, 2021

MariusVanDerWijden commented Jul 12, 2021

adrienlacombe commented Jul 12, 2021

adrienlacombe commented Jul 12, 2021

colorfulcoin commented Jul 12, 2021 • edited Loading

MariusVanDerWijden commented Jul 12, 2021

colorfulcoin commented Jul 12, 2021 • edited Loading

holiman commented Jul 12, 2021 via email

dimsome commented Jul 12, 2021 • edited Loading

colorfulcoin commented Jul 12, 2021 • edited Loading

adrienlacombe commented Jul 12, 2021 • edited Loading

colorfulcoin commented Jul 12, 2021

adrienlacombe commented Jul 12, 2021

colorfulcoin commented Jul 12, 2021 • edited Loading

adrienlacombe commented Jul 12, 2021

colorfulcoin commented Jul 12, 2021

fjl commented Jul 12, 2021 • edited Loading

odyslam commented Jul 13, 2021 • edited Loading

yorickdowne commented Jul 13, 2021

SeaMonkey82 commented Jul 13, 2021 • edited Loading

holiman commented Jul 15, 2021

begetan commented Jul 16, 2021

jacobfrank250 commented Jul 16, 2021

holiman commented Jul 16, 2021

holiman commented Jul 17, 2021

riptl commented Jul 19, 2021 • edited Loading

begetan commented Jul 21, 2021

agro1986 commented Jul 24, 2021

fjl commented Jul 29, 2021

agro1986 commented Jul 30, 2021

agro1986 commented Aug 3, 2021

bassoy commented Aug 19, 2021

bassoy commented Aug 27, 2021

yorickdowne commented Sep 1, 2021

yorickdowne commented Sep 19, 2021

yorickdowne commented Sep 21, 2021

MariusVanDerWijden commented Oct 22, 2021

bassoy commented Oct 22, 2021

bassoy commented Oct 29, 2021

StakeNoob commented Oct 31, 2021

MariusVanDerWijden commented Nov 1, 2021

StakeNoob commented Nov 1, 2021 • edited Loading

MariusVanDerWijden commented Nov 2, 2021

StakeNoob commented Nov 2, 2021

MariusVanDerWijden commented Nov 3, 2021

yorickdowne commented Jul 11, 2021 •

edited

Loading

adrienlacombe commented Jul 11, 2021 •

edited

Loading

colorfulcoin commented Jul 12, 2021 •

edited

Loading

colorfulcoin commented Jul 12, 2021 •

edited

Loading

dimsome commented Jul 12, 2021 •

edited

Loading

colorfulcoin commented Jul 12, 2021 •

edited

Loading

adrienlacombe commented Jul 12, 2021 •

edited

Loading

colorfulcoin commented Jul 12, 2021 •

edited

Loading

fjl commented Jul 12, 2021 •

edited

Loading

odyslam commented Jul 13, 2021 •

edited

Loading

SeaMonkey82 commented Jul 13, 2021 •

edited

Loading

riptl commented Jul 19, 2021 •

edited

Loading

StakeNoob commented Nov 1, 2021 •

edited

Loading