-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TServer-level mem-tracker consumption could be much lower than tcmalloc reports. #2566
Comments
Adding another instance of this: we recently saw this in a different customer deployment even on the master side. Seemingly benign network blip on the TS side, triggers raft group changes, then corresponding TSHeartbeat requests to the master, master server mem-tracker reported 100MB, total heap was 2GB... |
Adapted from @ttyusupov and how I ended up repro-ing this + OOM easily in a 3 node cluster
The flow then is, start a batch workload: After a couple of minutes of running the workload, stop one TS. If we then also restart the TS after some 30-60s, it will end up OOMing. cc @rajukumaryb |
Did some more experiments last night. I think much (all?) of the unexpected gap between server and root might be accounted for by internal tcmalloc memory: Comparing our
This lines up pretty well, with likely some allocations by our app not being tracked by the OS. I think we should add all the free list, etc stats from tcmalloc as a special memtracker under root:
Basically we should have the first 2 children in our root memtracker UI should be Source for tcmalloc: https://github.com/gperftools/gperftools/blob/master/src/tcmalloc.cc#L403 Relevant stats:
|
Summary: TCMalloc stats are not automatically refreshed. Add poll from root MemTracker. Test Plan: Manual - check if TCMalloc stats in `lynx 127.0.0.1:9000/mem-trackers` change with load Reviewers: sergei Reviewed By: sergei Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D7563
Re-checked on the latest master and found that delta between |
Found the commit which changes delta from ~30% to ~5%.
Previous commit: 4843cf2:
|
It might be related to space overhead in old |
During the scenario explained in #2563 I've faced the following issue.
Last memory consumption metrics for components occupying more than 100MB before read buffer overflow at 16:47:11 UTC were saved into prometheus at 16:47:05 UTC:
We see only 3.26GB peak memory consumption in prometheus before readbuffer re-allocation error, but in error message we see that consumption was ~6GB.
87.26% * 6.20GiB=5.41GiB=5.8GB
The text was updated successfully, but these errors were encountered: