Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geth 1.10.4 high memory use - how to gather data #23195

Closed
yorickdowne opened this issue Jul 11, 2021 · 48 comments
Closed

Geth 1.10.4 high memory use - how to gather data #23195

yorickdowne opened this issue Jul 11, 2021 · 48 comments
Labels

Comments

@yorickdowne
Copy link

yorickdowne commented Jul 11, 2021

From Geth Discord, by fjl - this is a quote, not my words:

I have seen quite a few reports of high memory usage with Geth v1.10.4 last couple days in here,
some people have even reported that there seems to be a leak, with used memory steadily rising.
If you are experiencing this problem, we would appreciate if you could do the following steps:

  • enable the pprof server (debugging tool) by starting geth with the --pprof flag
  • this will enable an HTTP server on port 6060 on localhost
  • wait for it to reach high memory usage
  • then go to http://127.0.0.1:6060/memsize
  • click the Scan "node" button
  • this operation will freeze the node for quite a while and perform an internal scan of used memory
  • when it's done, post the resulting report to https://gist.github.com/, then link it here on Geth Discord

Note: If your system is very low on memory, the memsize scan can crash your node (because the scanning also uses some extra memory). Also, the node will not be operational while it does the scanning.

If you are using grafana, or another metrics solution, please monitor the geth/system/cpu/goroutines.gauge metric.
This metric is the number of lightweight Go threads used by geth. If this number keeps going up, it's a goroutine leak. This kind of leak can happen when there is a bug, and it will usually manifest as memory usage increasing.
It's kind of good to monitor this metric in general.
On our nodes, with 50 peers, the goroutine count is ~600-650

https://discord.com/channels/482467812179181568/482467812816977921/862323752695103519
in https://discord.gg/VNnEHqsHMr

Note: If you are looking to access the pprof web page over the LAN, you'll also need --pprof.addr=0.0.0.0 and port 6060/tcp open on your host firewall, like ufw. Be careful not to expose the port to "Internet at large".

Note: Wait until Geth uses 11GiB+, ideally even 14GiB+ of RAM, before running a report. The initial increase up to 8-9 GiB might not be that instructive. Geth 1.10.3 used 8-9 GiB of RAM on mainnet with default settings, that level of use is expected.

@DanHenton
Copy link

I restarted my node recently here is two scans that were taken a few minutes apart with increasing Bitmap sizes.

First scan
Second scan

VERSION:
1.10.4-stable-aa637fd3

Note: I'm running mainnet and pyrmont on one machine with 64GB my current average MEM usage is ~65% but was up over 80% before I restarted the machine.

@adrienlacombe
Copy link
Contributor

adrienlacombe commented Jul 11, 2021

Approx 5G of RAM used by geth at the time of the first report, 12 before restart

Reports:
"node" @ 2021-07-11 16:48:31 +0000 UTC
"node" @ 2021-07-11 19:50:47 +0000 UTC
"node" @ 2021-07-12 04:26:07 +0000 UTC

First report here: https://gist.github.com/adrienlacombe/a9d89abcf44d165f1ddf4af55076b8a1
Second report here, RAM keeps increasing: https://gist.github.com/adrienlacombe/4fae6df9c6541aa110468e69ca6fbed1
Third report here, RAM keeps increasing: https://gist.github.com/adrienlacombe/4e3c9918ab4236a47fd83795634fd458

Geth used for eth2 only

@odyslam
Copy link
Contributor

odyslam commented Jul 11, 2021

Tangiantely relevant, but if I enable the --pprof server, I can't enable the metrics server, since both use port 6060. Port for metrics must be changed with --metrics.port value so that they both co-exist.

@DanHenton
Copy link

Here is a scan roughly 12 hours later. still increasing in memory allocation

12hr scan

Hope it helps 😄

@MariusVanDerWijden
Copy link
Member

@adrienlacombe @DanHenton Thank you for providing so many data points!
Could you run geth attach --exec debug.writeMemProfile\(\"memprofile\"\) on your nodes? It creates a file memprofile which contains detailed information about where the memory is retained/allocated/etc.

@adrienlacombe
Copy link
Contributor

@MariusVanDerWijden thank you for your help! Here is the file.
memprofile.zip

@adrienlacombe
Copy link
Contributor

@MariusVanDerWijden debug.stacks results
stacks.txt

@colorfulcoin
Copy link

colorfulcoin commented Jul 12, 2021

When I run, sudo systemctl status geth, it shows 21.5Gb. Here's the output. The numbers look smaller in the report, but I'm not sure I am reading it correctly. I also just started running the node recently, so I don't have a baseline to compare against, but was told in the discord that anything above 10Gb should be posted.

Running my node with geth --http --datadir /var/lib/goethereum --cache 10240 --maxpeers 50 --pprof

gethscan.txt

@MariusVanDerWijden
Copy link
Member

@colorfulcoin Could you also run
geth attach --exec debug.stacks\(\) > file and
geth attach --exec debug.writeMemProfile\(\"memprofile\"\) and paste the resulting files memprofile and file here

@colorfulcoin
Copy link

colorfulcoin commented Jul 12, 2021

@MariusVanDerWijden no problem, happy to help. When I run geth attach http://127.0.0.1:8545 --exec debug.stacks\(\) > file and cat file I get this. Any suggestions for what flags I need to be running to enable debug?

ReferenceError: debug is not defined
	at <eval>:1:1(0)

@holiman
Copy link
Contributor

holiman commented Jul 12, 2021 via email

@dimsome
Copy link

dimsome commented Jul 12, 2021

Two reports attached:
The first one not too long after the start, the second one roughly after 24 hours (maybe more).

SCANS

@colorfulcoin
Copy link

colorfulcoin commented Jul 12, 2021

@MariusVanDerWijden See attached. I ran a new scan as well. I was unable to get sudo geth attach /var/lib/goethereum/geth.ipc --exec debug.writeMemProfile\(\"memprofile\"\) to run. Getting:

Error: open memprofile: permission denied
	at web3.js:6357:37(47)
	at web3.js:5091:62(37)
	at <eval>:1:22(4)

file.txt
gethscan.txt
memeprofile.txt

@adrienlacombe
Copy link
Contributor

adrienlacombe commented Jul 12, 2021

@colorfulcoin I had the same issue, try sudo geth attach /var/lib/goethereum/geth.ipc --exec debug.writeMemProfile\(\"tmp\/memprofile\"\)

@colorfulcoin
Copy link

@adrienlacombe that outputs null. What is that command doing? I am assuming it outputs something somewhere.

@adrienlacombe
Copy link
Contributor

That's good, you should now have a file with name memprofile in /tmp/ folder @colorfulcoin

@colorfulcoin
Copy link

colorfulcoin commented Jul 12, 2021

@adrienlacombe It seems to be encoded/encrypted. Cat the file just leads a bunch of stuff like this o�u24��Kٰ���d�.sI���Gc��Z�;��ު�{xF�C����A���n�2����dN��hG�? Not sure what the ask is here.

@adrienlacombe
Copy link
Contributor

This is fine @colorfulcoin upload the file here.

@colorfulcoin
Copy link

ok I've modified my above comment to add it.

@fjl
Copy link
Contributor

fjl commented Jul 12, 2021

@colorfulcoin The memprofile is a binary file. It contains information about objects in the memory used by geth. Specifically, it contains the place in the code where each object was created.

@odyslam
Copy link
Contributor

odyslam commented Jul 13, 2021

@yorickdowne, regarding the goroutines issue you mentioned. I have found this GETH in the wild and apparently has about 1.5K goroutines.

edit: Because the server has also exposed the pprof flag, perhaps it's better to send it over DM in discord. My username is odyslam.

@yorickdowne
Copy link
Author

@fjl ^^

@SeaMonkey82
Copy link

SeaMonkey82 commented Jul 13, 2021

This isn't entirely necessary for helping troubleshoot this, but it's relevant and something I've been meaning to add to my monitoring solution anyhow. The linked dashboard includes a graph for both memory usage and goroutines, so you could even set up a Grafana alert so you know when memory usage reaches 14GB (when the pprof scan becomes useful for troubleshooting purposes).

Monitoring Geth with InfluxDB and Grafana

@holiman
Copy link
Contributor

holiman commented Jul 15, 2021

@yorickdowne sent info to you on how to investigate further, see discord

@begetan
Copy link

begetan commented Jul 16, 2021

I did a lot of snapshot syncing tests recently for the version of 1.10.4
In all configurations, I saw a Geth memory leak. It grows up around 1GB/per day. Geth on a test node with 8GB RAM and --cache=3072 was killed by the OOM killer after 4-5 days of uptime.

I run a test node now sith disabled snapshot feature based on this ticket

Here is the current measurement:

  • 4,7 GB after sync completed
  • 5,1 GB on day 1
  • 5,3 GB on day 2
  • 5,5 GB on day 3

I know we should wait until Geth grows up to 7-9Gb before any conclusion.

But I would suggest digging into this direction.

@jacobfrank250
Copy link

I have been trying to sync, and geth will not run for more than 3 hours (often less) without stopping. I believe I am running out of memory. I only have 4 GB Ram, but I thought this was the base requirement. I generated files as suggested by @MariusVanDerWijden, by running
./geth attach --exec debug.stacks() > file and
./geth attach --exec debug.writeMemProfile("memprofile")

Here are those files
file.txt
memprofile.txt

@holiman
Copy link
Contributor

holiman commented Jul 16, 2021

Seems that many of these reports concern snap sync. I'm doing a benchmark run of 1.10.3 vs 1.10.5 now, to see if we can repro it in our internal environments.

If not, then another possibility might be that the "comomn ground" is the RPC load from being eth2 backends, and something has regressed in that department.

@holiman
Copy link
Contributor

holiman commented Jul 17, 2021

During sync, 1.10.5 is indeed a bit higher in memory usage, which is a side-effect of it being faster downloading headers/block data. It achieves higher ingress speeds, and competes that phase sooner:

1.10.5 is green, and 1.10.3 is yellow:
Screenshot_2021-07-17 Dual Geth - Grafana

Screenshot_2021-07-17 Dual Geth - Grafana(1)

Although the usage is a bit higher, I don't think it is high enough to be the culprit behind the bug reports.

@riptl
Copy link
Contributor

riptl commented Jul 19, 2021

@holiman We have counted ~20 OOM kill events on about a dozen nodes using 1.10.4 with --syncmode=snap --snapshot=true. The OOM hard limit was set to 11 GB. Nodes would allocate memory over about a week. All of these nodes were upgraded from older Geth versions that were initially synced with fast.
image

After searching for the root cause to no avail, this problem magically fixed itself two weeks later (still the same nodes on 1.10.4, no config changes).
image

We've spun up two 1.10.4 nodes as a test run:

--syncmode=snap --snapshot=true (Jul 12th)
9GB of memory used after 4.5 days of syncing (from scratch). About 1GB/day growth.

--syncmode=fast --snapshot=false (Jul 19th)
5.5 GB of memory used after 6 days of syncing (from scratch). Memory usage is stable, no OOM kills.

@begetan
Copy link

begetan commented Jul 21, 2021

I have a concern about the snapshot mode support, not about initial sync.

I've changed snapshot settings for the same synced node twice (without cleaning up blockchain data) and always saw the same behavior. Geth process running with options --syncmode=fast --snapshot=false has no memory leak, while process without these options leaking with 0,8-1GB/per day.

@agro1986
Copy link

Here's my memsize scan using 1.10.4, snapshotted when geth is using 12.8 GB of memory (per sudo pmap $GETH_PID)

https://gist.github.com/agro1986/4195473c093fa78704d1bf19c9de9518

As you can see below, since I upgraded to 1.10.4 in July 1st, geth will bring the available memory of my system down until it crashes or until I restart geth (in which the available memory will jump back again).

Screen Shot 2021-07-24 at 19 45 29

I launch geth using this

/usr/local/bin/geth --http --datadir /somewhere/geth --cache 2048 --maxpeers 40 --pprof --pprof.addr=0.0.0.0

@fjl
Copy link
Contributor

fjl commented Jul 29, 2021

Thanks everyone for providing all this memory usage data. We have been checking it out, and have enough reports now.

I've looked at a lot of these reports and found that the out-of-memory crashes cannot be explained using the information seen in memsize. In most reports, the top item are the 'cached trie nodes', which is expected.

@agro1986
Copy link

As an additional info, I updated from 1.10.4 -> 1.10.6 and the issue still persists

@agro1986
Copy link

agro1986 commented Aug 3, 2021

It seems that memory usage is better after pruning (1.10.6). You can see on graph below, the after last restart I pruned and after that available memory doesn't drop as sharply as before. Might need to see how it behaves after several more days.

Screen Shot 2021-08-03 at 23 27 11

@bassoy
Copy link

bassoy commented Aug 19, 2021

I have a concern about the snapshot mode support, not about initial sync.

I've changed snapshot settings for the same synced node twice (without cleaning up blockchain data) and always saw the same behavior. Geth process running with options --syncmode=fast --snapshot=false has no memory leak, while process without these options leaking with 0,8-1GB/per day.

Same here:
image
Using geth version 1.10.6-stable-576681f2 with --metrics --metrics.expensive --pprof --http --cache 1024.

@bassoy
Copy link

bassoy commented Aug 27, 2021

Thanks everyone for providing all this memory usage data. We have been checking it out, and have enough reports now.

I've looked at a lot of these reports and found that the out-of-memory crashes cannot be explained using the information seen in memsize. In most reports, the top item are the 'cached trie nodes', which is expected.

What geth configuration is necessary to have a stable memory consumption. A linear increase of memory over time is not desirable.

@yorickdowne
Copy link
Author

This has been hitting Bor ever since they rebased themselves on Geth 1.10.8. In their case it hits some, not all nodes, and seems to be related to P2P. Keep an eye on maticnetwork/bor#179

@yorickdowne
Copy link
Author

So from the Polygon/Bor side of things: It's a lot worse with many (150-200) peers, and considerably more mellow with few (<=50) peers. Hopefully this gives the Geth team something to look for. It should be possible to reproduce this issue more rapidly by setting maxpeers to 200.

@yorickdowne
Copy link
Author

From testing on Bor, pprof may show 12-13 GiB, and docker stats shows ~24-25 GiB. No wonder pprof didn't help to find the issue.

Bor team is testing a fix: maticnetwork/bor@54d3cf4

@MariusVanDerWijden
Copy link
Member

We fixed an issue in v.1.10.9 here: 067084f which should fix a potential memory leak in the transaction pool. Are there any nodes experiencing the memory leak post 1.10.9?

@bassoy
Copy link

bassoy commented Oct 22, 2021

I am running 1.10.11-stable-7231b3ef and didn't yet experience a spike or sth similar. Will let you know after a couple of days running.

@bassoy
Copy link

bassoy commented Oct 29, 2021

@MariusVanDerWijden

Using geth version 1.10.11-stable-7231b3ef in snapshot mode with --metrics --metrics.expensive --pprof --http --cache 1024: After 1 week recording: linear increase of 1GB/week memory usage. Before it was 1GB/2days.

image

@StakeNoob
Copy link

We fixed an issue in v.1.10.9 here: 067084f which should fix a potential memory leak in the transaction pool. Are there any nodes experiencing the memory leak post 1.10.9?

I am on Geth 1.10.10 and get unclean shutdown every few hours, htop shows me it is because Geth takes all my RAM. I have 16GB DDR4 3300Mhz on an Asus PN50. When I run --pprof I get the following but I cannot establish a link to the address to scan the node.

pprof
geth ram

@MariusVanDerWijden
Copy link
Member

Do you have two geth nodes running? Error bind address already in use only happens if some other process already servers port 30303 (which is usually another instance of geth)

@StakeNoob
Copy link

StakeNoob commented Nov 1, 2021

Do you have two geth nodes running? Error bind address already in use only happens if some other process already servers port 30303 (which is usually another instance of geth)

This is interesting because I did have to rebuild the database after my ssd had less than <50gb for pruning- I followed instructions for removing the database and then letting it re-sync. I followed Somer Esat guide for initial install. Any tips to help figure this out?

@MariusVanDerWijden
Copy link
Member

lsof -i :30303

@StakeNoob
Copy link

lsof -i :30303

This introduces the next command line and gives no information. It has got me thinking though; Somer Esat Guide recommends a certain directory for geth which is what I have when I do geth status, but --pprof seems to show a different directory. Could this mean 2 instances of Geth after all?

--pprof geth
somer geth

@MariusVanDerWijden
Copy link
Member

It seems that this issue was fixed in v.1.10.9. A race condition resulted in transactions not getting deallocated and garbage collected correctly. If you are experience high memory usage with a version higher than v.1.10.9 please open a new issue!
Thank you to everyone who helped us getting to the bottom of this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests