-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Conversation
4d96001
to
777cd6a
Compare
Isn't that already provided by prometheus? We already have memory statistics about the nodes. |
The idea is to provide a less coarse view on the allocations. The system view of jemalloc is too coarse for good correlations between other event occurences and the total consumption, with this we can obtain the inner stats of jemalloc synced to the time of i.e. channel fill snapshots. TL;DR yes, can be done on OS level, but this is more useful for root cause analysis |
How do you want to do a root cause analysis with this? You will just see that memory usage goes up, the same as the system will tell you. You want to say that the system update interval is not good enough for the memory usage? |
It's good enough to detect OOM conditions, but it's not good enough to dig into the allocations. For the linux client we use jemalloc, as such you will get somewhat jagged / fuzzy correlations given that multiple arenas are used with jemalloc. Now this exhibits a rough representation of the true allocations to the system which then can be reported by a system level metrics collector. But this gives little insight on how many allocations, what granularity and only a rough estimate on time to allocation correlations. Using the internal jemalloc stats avoids this. Now you can remove the correct amount of memory consumed by i.e. the channels (channel based memory consumption reporting TBD) and work with a less noisy residue memory allocation and correlate these in grafana with whatever is a reasonable hypothesis. And this is helpful when digging into OOM case root causes. |
Hi @koute ! How did you track down the issue in statement distribution? Would this PR have made it easier? |
Using my memory profiler, but using it obviously requires being able to reproduce the issue, and you also need to explicitly run Polkadot/Substrate with it hooked in.
Hmm... from what I can see not really. This is still way too coarse to be useful for actually figuring out where exactly the issue originates. Simply being able to correlate the memory growth with some other events we track in Prometheus is not that useful for a codebase of our size, since even if it might give us some idea as to in what general area the problem is it will usually still leave a million potential places where the problem could have originated, so you need to do a retest anyway to figure it out. It'd be nice to have some middleground solution between "we only have a rough graph of memory usage" and "we hook a profiler and know about every allocation" which could be always turned on for every node by default, but I don't think this is it. |
Nice, thanks @koute ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Also not sure, it is going to help much in practice though.
@koute the features you desire will be added in a separate PR, jemalloc does the heavy lifting there too (statistical profiling), but this is not in scope for this PR. |
1141630
to
8dd45d4
Compare
Introduces new metrics to collect total allocation and resident memory size in bytes and publish to prometheus. Little aid in tracking down OOM conditions by providing better data on allocations.