Translate OpenTelemetry System Metrics (CPU/Memory) #7090

jlvoiseux · 2022-01-18T13:41:09Z

Motivation/summary

This Pull Request arose from a project aiming to instrument the Opbeans demos with the corresponding OpenTelemetry agents. As per issue #5796, OpenTelemetry system metrics are not handled, resulting in empty metrics graphs when using the APM UI with Otel-instrumented applications.

The changes are based on the OpenTelemetry specification for System Metrics, and serve two purposes:

When possible, translate the OpenTelemetry system metrics into its corresponding APM field :
system.memory.usage (state=free) -> system.memory.actual.free
system.memory.usage (state=used) -> system.memory.actual.used.bytes
Aggregate received OpenTelemetry metrics to compute the metrics required by the APM UI to display graphs, ie. derive system.memory.total and system.cpu.total.norm.pct from system.memory.usage and system.cpu.utilization

Tests have been added, based on what was written to translate JVM metrics.

Questions/Elements to review

This work was done as a proof of concept, with the primary goals being to visualise data in the APM UI. I am uncertain as to the approach chosen for aggregation of missing metrics (use a map to aggregate occurrences and build missing metrics a posteriori) is fitting for a processor.

How to test

Testing beyond the unit tests provided below requires an Otel instrumented app. The Otel Node agent is a great candidate, as a package already implements the OpenTelemetry specification for system metrics. I will add the instrumentation example to this PR as soon as it is online.

Related issues

#5796

mergify · 2022-01-18T13:41:31Z

This pull request does not have a backport label. Could you fix it @jlvoiseux? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-7.x is the label to automatically backport to the 7.x branch.
backport-7./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

apmmachine · 2022-01-18T13:51:44Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-02-14T01:45:26.584+0000
Duration: 64 min 31 sec

Test stats 🧪

Test	Results
Failed	0
Passed	5634
Skipped	19
Total	5653

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/hey-apm : Run the hey-apm benchmark.
/package : Generate and publish the docker images.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

CPU cores were previously counted with a map used as a set. This commit replaces that logic with a simple counter.

axw

Thanks for opening this @jlvoiseux! I left some ideas for how we might simplify things a bit, as well as for refactoring the existing translation logic to unify them.

As long as we keep the changes minimal, and limited to at most the metrics that are produced by Elastic APM agents, I'm happy to move ahead with this. For anything beyond that, I would like us to engage other teams working on adding metric definitions to ECS, such as was done in https://github.com/elastic/ecs/blob/main/rfcs/text/0005-host-metric-fields.md.

processor/otel/metrics.go

axw · 2022-01-25T04:04:18Z

processor/otel/metrics.go

+	metricName string
+}
+
+func (b *apmMetricBuilder) build(ms metricsets) {


Do you think we could simplify this by splitting the aggregation and emitting of Elastic metrics? I'd also like it if we could refactor the existing code so we do all translation of OTel -> Elastic metrics in the same way. That probably means moving the switch out of metricsets.upsert.

Finally, I think it might be a bit more straightforward if we made apmMetricBuilder less generic. I'm imagining something like this:

type apmMetricsBuilder struct { // System metrics cpuCount int // from system.cpu.utilization's cpu attribute nonIdleCPUUtilizationSum float64 freeMemoryBytes int64 usedMemoryBytes int64 // JVM metrics jvmMemoryArea int64 jvmGCTime map[string]int64 jvmGCCount map[string]int64 jvmMemory map[jvmMemoryKey]int64 } type jvmMemoryKey struct { area string type_ string pool string // will be "" for non-pool specific memory metrics } // accumulate processes m, translating to and accumulating equivalent Elastic APM metrics in b. func (b *apmMetricsBuilder) accumulate(m pdata.Metric) { } // emit upserts Elastic APM metrics into ms from information accumulated in b. func (b *apmMetricsBuilder) emit(ms metricsets) { }

You would iterate through all OTel metrics, upserting the original OTel metric and calling apmMetricsBuilder.accumulate. Then finally, call apmMetricsBuilder.emit to produce system.memory.total and system.cpu.total.norm.pct, as well as runtime.jvm.memory.area etc.

WDYT?

This is a fantastic idea ! I will try to propose something that matches your vision in my next commit.

A proposal is available in the form of commit 39a0bdc

processor/otel/metrics.go

jlvoiseux · 2022-01-25T18:11:37Z

Hello @axw ; thank you for your feedback and suggestions !
I have committed (39a0bdc) changes based on your accumulator/emitter structure proposal. It indeed makes the whole translation code a lot more readable. The changes in the commit pass all unit tests and also yield good results when tested end-to-end :

Opbeans-Java instrumented with Open Telemetry : JVM Metrics (the lack of host metrics is normal, as no related data is sent by the Otel agent)

Opbeans-Dotnet instrumented with Open Telemetry : Host metrics

Opbeans-Node instrumented with Open Telemetry : Host metrics

As a summary, the exported metrics are now :

jvm.memory.{area}.{type}
jvm.gc.time
jvm.gc.count

system.memory.actual.free
system.memory.total

system.cpu.total.norm.pct

axw

Thanks for the updates @jlvoiseux! It's looking much easier to follow now.

I've just left a couple of questions and a handful of minor style comments, otherwise it's looking great.

processor/otel/metrics.go

axw

Thanks for the updates, just one more code change please :)

After that, I think this is ready. Please also update CHANGELOG.asciidoc and make sure CI passes (make check-full will tell you if there are linting issues).

axw · 2022-02-01T01:58:48Z

processor/otel/metrics.go

+		ms.upsertOne(
+			v.timestamp, fmt.Sprintf("jvm.memory.%s.%s", k.area, k.type_), pdata.NewAttributeMap(),
+			model.MetricsetSample{Value: v.value},
+		)


I believe we had a bug before: we have been ignoring the pool attribute.

IIANM (according to https://www.elastic.co/guide/en/apm/agent/java/current/metrics.html#metrics-jvm), if k.pool != "" then we should name the metric jvm.memory.<area>.pool.<type> instead, and set the attribute "name" to k.pool.

I see you've resolved this, but I can't see where we're creating java.memory.<area>.pool.<type> metrics. If you'd like, please leave a TODO and we can address it in a followup.

Otherwise, what we need to do is:

in accumulate, set the jvmMemoryKey.pool attribute

here in emit, emit either jvm.memory.<area>.<type> or jvm.memory.<area>.pool.<type>, depending on whether jvmMemoryKey.pool is non-empty

Thanks for the heads-up ; I think I addressed the pool issue in c8129eb, in an identical fashion to what you describe, although I might have missed the point.

accumulate:
https://github.com/jlvoiseux/apm-server/blob/a47157b0238ed8e887755bea59947753421ecc1f/processor/otel/metrics.go#L178-L189

emit:
https://github.com/jlvoiseux/apm-server/blob/a47157b0238ed8e887755bea59947753421ecc1f/processor/otel/metrics.go#L249-L262

Sorry, I must have been looking at a stale commit. Looks perfect!

axw

LGTM, but we're still missing the JVM pool metrics. It was an existing bug, so please feel free to merge this with a TODO and we can address that later.

axw · 2022-02-11T01:57:32Z

processor/otel/metrics.go

+		ms.upsertOne(
+			v.timestamp, fmt.Sprintf("jvm.memory.%s.%s", k.area, k.type_), pdata.NewAttributeMap(),
+			model.MetricsetSample{Value: v.value},
+		)


I see you've resolved this, but I can't see where we're creating java.memory.<area>.pool.<type> metrics. If you'd like, please leave a TODO and we can address it in a followup.

Otherwise, what we need to do is:

in accumulate, set the jvmMemoryKey.pool attribute

here in emit, emit either jvm.memory.<area>.<type> or jvm.memory.<area>.pool.<type>, depending on whether jvmMemoryKey.pool is non-empty

axw

Thank you, looks great!

jlvoiseux · 2022-04-11T15:08:31Z

Test Plan

This PR implements visualisation of system metrics in the APM UI when APM data is generated by OpenTelemetry agents. Three currently supported languages/frameworks are part of this test plan:

.NET: I implemented part of the OpenTelemetry system metrics specification myself to further test this PR.
Java: The OpenTelemetry Java agent generates JVM-related metrics
NodeJS: The OpenTelemetry-JS agent can generate system metrics, provided that the related package is used

Prerequisites

Start a cloud instance of the Elastic stack with the 8.2.0 version of the APM integration. Most Opbeans do not support Fleet, and adding that functionality would require important changes to the local docker-compose. As a consequence, we will use the cloud-focused docker-compose.
Retrieve the corresponding CLOUD_ID and credentials for the elastic user.

.NET

Clone my fork of Opbeans-Dotnet.
Switch to the opentelemetry-instrumentation branch.
Add a .env file to the repo with the following environment variables:

STACK_VERSION=8.2.0-SNAPSHOT
ELASTIC_CLOUD_ID=<CLOUD_ID>
ELASTIC_CLOUD_CREDENTIALS=elastic:<ELASTIC_PASSWORD>
APM_AGENT_TYPE=opentelemetry
ELASTIC_APM_SERVICE_NAME=opbeans-dotnet-otel

Run docker-compose -f docker-compose-elastic-cloud.yml up
Open your Elastic instance. The following fields should be populated:

- The system metrics diagrams should be drawn in the APM UI:

Java

This PR was implemented with the Opentelemetry Java agent v1.9.1. Since this PR was merged, the Runtime-Metrics instrumentation of the Opentelemetry Java agent underwent the following changes:

v1.10.1: The runtime.jvm.memory.area and runtime.jvm.memory.pool now have the type Counter instead of Gauge
v1.13.0 (To be released): A specification has been written and implemented for JVM metrics. This specification is great news, has it will allow us to map the various pool to their corresponding area.

Should we:

Validate the PR for the Otel Java agent <= v1.9.1 ?
Implement a fix in order to be compatible with the Otel Java agent >= v1.10.1 ?
Wait for the specification to be released and validate Java metrics as part of a separate PR at the time?

In case we choose the first option, here is how to test it:

Clone my fork of Opbeans-Java.
Switch to the opentelemetry-instrumentation branch.
Add a .env file to the repo with the following environment variables:

STACK_VERSION=8.2.0-SNAPSHOT
ELASTIC_CLOUD_ID=<CLOUD_ID>
ELASTIC_CLOUD_CREDENTIALS=elastic:<ELASTIC_PASSWORD>
APM_AGENT_TYPE=opentelemetry
ELASTIC_APM_SERVICE_NAME=opbeans-java-otel

Run docker-compose -f docker-compose-elastic-cloud.yml up
Open your Elastic instance. The following fields should be populated:

- The system metrics diagrams should be drawn in the APM UI. To view GC metrics, select a time span of at least an hour:

NodeJS

At the time of writing this test plan, the OTLP exporters included with the Opentelemetry JS Agent do not support recent releases of the Otel Collector. As a result, following commit f227841, the PR cannot handle metrics sent by the Opentelemetry JS agent. This language should not be included in the test plan.

An Otel instrumented fork of Opbeans-Node is available nonetheless. Once the OTLP exporters have been updated, this fork can be used in the same fashion as the others to test the PR without modifications, as the JS agent implements the Otel system metrics spec.

stuartnelson3 · 2022-04-12T10:56:47Z

confirmed for dotnet and java

ZEXSM · 2022-05-07T13:43:32Z

@axw Is it possible to add this changes for 7.x ?

marclop · 2022-05-10T07:14:41Z

@ZEXSM Andrew can confirm it, but we are unlikely to back port such a big change to 7.17 given we're only backporting bug fixes to that branch.

axw · 2022-05-16T11:19:43Z

Indeed, since 8.0 was released, 7.x is in maintenance mode. Only bug fixes will be backported to the last minor of 7.x (meaning we'll only produce 7.17.x releases with bug fixes), so we won't be backporting this change.

jlvoiseux added 2 commits January 18, 2022 14:07

Translate and compute system CPU/Memory metrics

fd53398

Add CPU/Memory metrics translation tests

d3bd7ae

mergify bot added the backport-skip Skip notification from the automated backport with mergify label Jan 18, 2022

jlvoiseux added metrics OpenTelemetry and removed backport-skip Skip notification from the automated backport with mergify labels Jan 18, 2022

mergify bot added the backport-skip Skip notification from the automated backport with mergify label Jan 18, 2022

Optimise retrieval of number of CPU cores

0c18600

CPU cores were previously counted with a map used as a set. This commit replaces that logic with a simple counter.

axw reviewed Jan 25, 2022

View reviewed changes

Generalize translation method (accumulate + emit) to JVM metrics

39a0bdc

axw requested changes Jan 31, 2022

View reviewed changes

Remove type specification and CPU value threshold

8d501af

axw reviewed Feb 1, 2022

View reviewed changes

jlvoiseux and others added 4 commits February 8, 2022 14:30

Added pool name and updated changelog

c8129eb

Linting fixes

5d08ec8

Merge branch 'main' into otel-system-metrics

8f77cd4

Merge branch 'main' into otel-system-metrics

a47157b

jlvoiseux marked this pull request as ready for review February 10, 2022 23:30

axw reviewed Feb 11, 2022

View reviewed changes

axw approved these changes Feb 14, 2022

View reviewed changes

Merge branch 'main' into otel-system-metrics

72675f4

axw added the v8.2.0 label Feb 14, 2022

axw enabled auto-merge (squash) February 14, 2022 01:21

Adapt to new opentelemetry-collector API

1dfc930

axw merged commit 0eddee7 into elastic:main Feb 14, 2022

marclop added the test-plan label Mar 30, 2022

stuartnelson3 added the test-plan-ok label Apr 12, 2022

jlvoiseux mentioned this pull request May 19, 2022

Implement support for OTLP over HTTP (protobuf, binary) #8156

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translate OpenTelemetry System Metrics (CPU/Memory) #7090

Translate OpenTelemetry System Metrics (CPU/Memory) #7090

jlvoiseux commented Jan 18, 2022

mergify bot commented Jan 18, 2022

apmmachine commented Jan 18, 2022 •

edited

Loading

Build stats

Test stats 🧪

axw left a comment

axw Jan 25, 2022

jlvoiseux Jan 25, 2022

jlvoiseux Jan 25, 2022

jlvoiseux commented Jan 25, 2022

axw left a comment

axw left a comment

axw Feb 1, 2022

axw Feb 11, 2022

jlvoiseux Feb 11, 2022

axw Feb 14, 2022

axw left a comment

axw Feb 11, 2022

axw left a comment

jlvoiseux commented Apr 11, 2022 •

edited

Loading

stuartnelson3 commented Apr 12, 2022

ZEXSM commented May 7, 2022 •

edited

Loading

marclop commented May 10, 2022

axw commented May 16, 2022

Translate OpenTelemetry System Metrics (CPU/Memory) #7090

Translate OpenTelemetry System Metrics (CPU/Memory) #7090

Conversation

jlvoiseux commented Jan 18, 2022

Motivation/summary

Questions/Elements to review

How to test

Related issues

mergify bot commented Jan 18, 2022

apmmachine commented Jan 18, 2022 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

🤖 GitHub comments

axw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlvoiseux commented Jan 25, 2022

axw left a comment

Choose a reason for hiding this comment

axw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axw left a comment

Choose a reason for hiding this comment

jlvoiseux commented Apr 11, 2022 • edited Loading

Test Plan

Prerequisites

.NET

Java

NodeJS

stuartnelson3 commented Apr 12, 2022

ZEXSM commented May 7, 2022 • edited Loading

marclop commented May 10, 2022

axw commented May 16, 2022

apmmachine commented Jan 18, 2022 •

edited

Loading

jlvoiseux commented Apr 11, 2022 •

edited

Loading

ZEXSM commented May 7, 2022 •

edited

Loading