-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record memory usage after garbage collection #6963
Merged
Merged
Changes from 3 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
4f4c701
Record memory usage after garbage collection
jack-berg 59bf66d
Check for null MemoryUsage
jack-berg 244d551
Continue instead of return
jack-berg ad771a3
Merge branch 'main' of https://github.com/open-telemetry/opentelemetr…
jack-berg c4bde8a
Rename to usage_after_last_gc
jack-berg File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reporting this metric as absolute value might not be helpful when you have a fleet of hosts each with JVMs with different max heap sizes. I understand that you wanted to align with JMX metric conventions but I think there is more value in a normalized metric. Maybe we can add an extra metric?
process.runtime.jvm.memory.usage_after_gc_percentage
?Calculating the percentage after the metric is emitted will not be trivial or feasible in all backends for the case that you want to analyze data from multiple hosts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminded me of open-telemetry/opentelemetry-specification#2392 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really a challenge for backends? Feels like doing operations across metrics like dividing one by another is pretty standard, and probably table stakes for a metrics backend. For example, if you exported these metrics to prometheus its trivial to compute a percentage with:
If we buy this argument wouldn't we also want to report the percentage for usage and committed as well? And if we report percentage instead of of the absolute value, doesn't that stand frustrate people on the other side that want to analyze their absolute figures instead of relative percentages?
Another argument against reporting utilization is that we don't actually have the data to report utilization because not all memory pools report memory usage after gc, and not all memory pools report a limit. Here's the sets of
pool
attribute values that are reported for usage, usage_after_gc, and limit for an app running with g1 garbage collector:process.runtime.jvm.memory.usage
reports values for 8 pools:CodeHeap 'non-nmethods'
,Code Heap 'non-profiled nmethods'
,CodeHeap 'profiled nmethods'
,Compressed Class Space
,G1 Eden Space
,G1 Old Gen
,G1 Survivor Space
,Metaspace
process.runtime.jvm.memory.usage_after_gc
reports values for 3 pools:G1 Eden Space
,G1 Old Gen
,G1 Survivor Space
process.runtime.jvm.memory.limit
reports values for 5 pools:CodeHeap 'non-nmethods'
,Code Heap 'non-profiled nmethods'
,CodeHeap 'profiled nmethods'
,G1 Old Gen
Notice how there's not usage_after_gc or limit values for all the pools. Reporting utilization would limit the set of pools to those that have both usage_after_gc and limit values, which is only
G1 Old Gen
. Same argument applies to reporting utilization instead of usage.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have multiple jvms (say hundreds) reporting with different
process_runtime_jvm_memory_limit
because each jvm is using a different-xmx
, how do you do that without manually creating a query that will match everyprocess_runtime_jvm_memory_usage_after_gc
to the respectiveprocess_runtime_jvm_memory_limit
? Moreover what if the attributes that uniquely identify the metrics are not predictable?Having said that, you made a good point about the lack of consistency in the memory pools. It does not make sense to report a normalized metric per memory pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not possible to recognize what instance emitted particular metrics, then I'd say that your data is nonsense anyway - if you can't correlate metric instruments with each other, or with a particular deployment, the telemetry is basically useless.
I think we should operate under an assumption that resource attributes uniquely identify the metrics source - and if this is not true, this is a broader problem that needs to be fixed across the board.