-
Notifications
You must be signed in to change notification settings - Fork 523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Translate OpenTelemetry System Metrics (CPU/Memory) #7090
Conversation
This pull request does not have a backport label. Could you fix it @jlvoiseux? 🙏
NOTE: |
CPU cores were previously counted with a map used as a set. This commit replaces that logic with a simple counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this @jlvoiseux! I left some ideas for how we might simplify things a bit, as well as for refactoring the existing translation logic to unify them.
As long as we keep the changes minimal, and limited to at most the metrics that are produced by Elastic APM agents, I'm happy to move ahead with this. For anything beyond that, I would like us to engage other teams working on adding metric definitions to ECS, such as was done in https://github.com/elastic/ecs/blob/main/rfcs/text/0005-host-metric-fields.md.
processor/otel/metrics.go
Outdated
metricName string | ||
} | ||
|
||
func (b *apmMetricBuilder) build(ms metricsets) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we could simplify this by splitting the aggregation and emitting of Elastic metrics? I'd also like it if we could refactor the existing code so we do all translation of OTel -> Elastic metrics in the same way. That probably means moving the switch out of metricsets.upsert.
Finally, I think it might be a bit more straightforward if we made apmMetricBuilder less generic. I'm imagining something like this:
type apmMetricsBuilder struct {
// System metrics
cpuCount int // from system.cpu.utilization's cpu attribute
nonIdleCPUUtilizationSum float64
freeMemoryBytes int64
usedMemoryBytes int64
// JVM metrics
jvmMemoryArea int64
jvmGCTime map[string]int64
jvmGCCount map[string]int64
jvmMemory map[jvmMemoryKey]int64
}
type jvmMemoryKey struct {
area string
type_ string
pool string // will be "" for non-pool specific memory metrics
}
// accumulate processes m, translating to and accumulating equivalent Elastic APM metrics in b.
func (b *apmMetricsBuilder) accumulate(m pdata.Metric) {
}
// emit upserts Elastic APM metrics into ms from information accumulated in b.
func (b *apmMetricsBuilder) emit(ms metricsets) {
}
You would iterate through all OTel metrics, upserting the original OTel metric and calling apmMetricsBuilder.accumulate
. Then finally, call apmMetricsBuilder.emit
to produce system.memory.total
and system.cpu.total.norm.pct
, as well as runtime.jvm.memory.area
etc.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fantastic idea ! I will try to propose something that matches your vision in my next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A proposal is available in the form of commit 39a0bdc
Hello @axw ; thank you for your feedback and suggestions !
As a summary, the exported metrics are now :
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @jlvoiseux! It's looking much easier to follow now.
I've just left a couple of questions and a handful of minor style comments, otherwise it's looking great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, just one more code change please :)
After that, I think this is ready. Please also update CHANGELOG.asciidoc and make sure CI passes (make check-full
will tell you if there are linting issues).
processor/otel/metrics.go
Outdated
ms.upsertOne( | ||
v.timestamp, fmt.Sprintf("jvm.memory.%s.%s", k.area, k.type_), pdata.NewAttributeMap(), | ||
model.MetricsetSample{Value: v.value}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we had a bug before: we have been ignoring the pool attribute.
IIANM (according to https://www.elastic.co/guide/en/apm/agent/java/current/metrics.html#metrics-jvm), if k.pool != ""
then we should name the metric jvm.memory.<area>.pool.<type>
instead, and set the attribute "name" to k.pool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you've resolved this, but I can't see where we're creating java.memory.<area>.pool.<type>
metrics. If you'd like, please leave a TODO and we can address it in a followup.
Otherwise, what we need to do is:
- in
accumulate
, set thejvmMemoryKey.pool
attribute - here in
emit
, emit eitherjvm.memory.<area>.<type>
orjvm.memory.<area>.pool.<type>
, depending on whetherjvmMemoryKey.pool
is non-empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads-up ; I think I addressed the pool issue in c8129eb, in an identical fashion to what you describe, although I might have missed the point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I must have been looking at a stale commit. Looks perfect!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but we're still missing the JVM pool metrics. It was an existing bug, so please feel free to merge this with a TODO and we can address that later.
processor/otel/metrics.go
Outdated
ms.upsertOne( | ||
v.timestamp, fmt.Sprintf("jvm.memory.%s.%s", k.area, k.type_), pdata.NewAttributeMap(), | ||
model.MetricsetSample{Value: v.value}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you've resolved this, but I can't see where we're creating java.memory.<area>.pool.<type>
metrics. If you'd like, please leave a TODO and we can address it in a followup.
Otherwise, what we need to do is:
- in
accumulate
, set thejvmMemoryKey.pool
attribute - here in
emit
, emit eitherjvm.memory.<area>.<type>
orjvm.memory.<area>.pool.<type>
, depending on whetherjvmMemoryKey.pool
is non-empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, looks great!
Test PlanThis PR implements visualisation of system metrics in the APM UI when APM data is generated by OpenTelemetry agents. Three currently supported languages/frameworks are part of this test plan:
Prerequisites
.NET
JavaThis PR was implemented with the Opentelemetry Java agent v1.9.1. Since this PR was merged, the
Should we:
In case we choose the first option, here is how to test it:
NodeJSAt the time of writing this test plan, the OTLP exporters included with the Opentelemetry JS Agent do not support recent releases of the Otel Collector. As a result, following commit f227841, the PR cannot handle metrics sent by the Opentelemetry JS agent. This language should not be included in the test plan. An Otel instrumented fork of Opbeans-Node is available nonetheless. Once the OTLP exporters have been updated, this fork can be used in the same fashion as the others to test the PR without modifications, as the JS agent implements the Otel system metrics spec. |
confirmed for dotnet and java |
@axw Is it possible to add this changes for 7.x ? |
@ZEXSM Andrew can confirm it, but we are unlikely to back port such a big change to |
Indeed, since 8.0 was released, 7.x is in maintenance mode. Only bug fixes will be backported to the last minor of 7.x (meaning we'll only produce 7.17.x releases with bug fixes), so we won't be backporting this change. |
Motivation/summary
This Pull Request arose from a project aiming to instrument the Opbeans demos with the corresponding OpenTelemetry agents. As per issue #5796, OpenTelemetry system metrics are not handled, resulting in empty metrics graphs when using the APM UI with Otel-instrumented applications.
The changes are based on the OpenTelemetry specification for System Metrics, and serve two purposes:
When possible, translate the OpenTelemetry system metrics into its corresponding APM field :
system.memory.usage (state=free) -> system.memory.actual.free
system.memory.usage (state=used) -> system.memory.actual.used.bytes
Aggregate received OpenTelemetry metrics to compute the metrics required by the APM UI to display graphs, ie. derive
system.memory.total
andsystem.cpu.total.norm.pct
fromsystem.memory.usage
andsystem.cpu.utilization
Tests have been added, based on what was written to translate JVM metrics.
Questions/Elements to review
This work was done as a proof of concept, with the primary goals being to visualise data in the APM UI. I am uncertain as to the approach chosen for aggregation of missing metrics (use a map to aggregate occurrences and build missing metrics a posteriori) is fitting for a processor.
How to test
Testing beyond the unit tests provided below requires an Otel instrumented app. The Otel Node agent is a great candidate, as a package already implements the OpenTelemetry specification for system metrics. I will add the instrumentation example to this PR as soon as it is online.
Related issues
#5796