Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate OpenTelemetry System Metrics (CPU/Memory) #7090

Merged
merged 11 commits into from
Feb 14, 2022

Conversation

jlvoiseux
Copy link
Contributor

Motivation/summary

This Pull Request arose from a project aiming to instrument the Opbeans demos with the corresponding OpenTelemetry agents. As per issue #5796, OpenTelemetry system metrics are not handled, resulting in empty metrics graphs when using the APM UI with Otel-instrumented applications.

The changes are based on the OpenTelemetry specification for System Metrics, and serve two purposes:

  • When possible, translate the OpenTelemetry system metrics into its corresponding APM field :
    system.memory.usage (state=free) -> system.memory.actual.free
    system.memory.usage (state=used) -> system.memory.actual.used.bytes

  • Aggregate received OpenTelemetry metrics to compute the metrics required by the APM UI to display graphs, ie. derive system.memory.total and system.cpu.total.norm.pct from system.memory.usage and system.cpu.utilization

Tests have been added, based on what was written to translate JVM metrics.

Questions/Elements to review

This work was done as a proof of concept, with the primary goals being to visualise data in the APM UI. I am uncertain as to the approach chosen for aggregation of missing metrics (use a map to aggregate occurrences and build missing metrics a posteriori) is fitting for a processor.

How to test

Testing beyond the unit tests provided below requires an Otel instrumented app. The Otel Node agent is a great candidate, as a package already implements the OpenTelemetry specification for system metrics. I will add the instrumentation example to this PR as soon as it is online.

Related issues

#5796

@mergify
Copy link
Contributor

mergify bot commented Jan 18, 2022

This pull request does not have a backport label. Could you fix it @jlvoiseux? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.x is the label to automatically backport to the 7.x branch.
  • backport-7./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Jan 18, 2022
@jlvoiseux jlvoiseux added metrics OpenTelemetry and removed backport-skip Skip notification from the automated backport with mergify labels Jan 18, 2022
@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Jan 18, 2022
@apmmachine
Copy link
Contributor

apmmachine commented Jan 18, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-02-14T01:45:26.584+0000

  • Duration: 64 min 31 sec

Test stats 🧪

Test Results
Failed 0
Passed 5634
Skipped 19
Total 5653

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /hey-apm : Run the hey-apm benchmark.

  • /package : Generate and publish the docker images.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

CPU cores were previously counted with a map used as a set. This commit replaces that logic with a simple counter.
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this @jlvoiseux! I left some ideas for how we might simplify things a bit, as well as for refactoring the existing translation logic to unify them.

As long as we keep the changes minimal, and limited to at most the metrics that are produced by Elastic APM agents, I'm happy to move ahead with this. For anything beyond that, I would like us to engage other teams working on adding metric definitions to ECS, such as was done in https://github.com/elastic/ecs/blob/main/rfcs/text/0005-host-metric-fields.md.

processor/otel/metrics.go Outdated Show resolved Hide resolved
metricName string
}

func (b *apmMetricBuilder) build(ms metricsets) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we could simplify this by splitting the aggregation and emitting of Elastic metrics? I'd also like it if we could refactor the existing code so we do all translation of OTel -> Elastic metrics in the same way. That probably means moving the switch out of metricsets.upsert.

Finally, I think it might be a bit more straightforward if we made apmMetricBuilder less generic. I'm imagining something like this:

type apmMetricsBuilder struct {
    // System metrics
    cpuCount int // from system.cpu.utilization's cpu attribute
    nonIdleCPUUtilizationSum float64
    freeMemoryBytes int64
    usedMemoryBytes int64

    // JVM metrics
    jvmMemoryArea int64
    jvmGCTime map[string]int64
    jvmGCCount map[string]int64
    jvmMemory map[jvmMemoryKey]int64
}

type jvmMemoryKey struct {
    area string
    type_ string
    pool string // will be "" for non-pool specific memory metrics
}

// accumulate processes m, translating to and accumulating equivalent Elastic APM metrics in b.
func (b *apmMetricsBuilder) accumulate(m pdata.Metric) {
}

// emit upserts Elastic APM metrics into ms from information accumulated in b.
func (b *apmMetricsBuilder) emit(ms metricsets) {
}

You would iterate through all OTel metrics, upserting the original OTel metric and calling apmMetricsBuilder.accumulate. Then finally, call apmMetricsBuilder.emit to produce system.memory.total and system.cpu.total.norm.pct, as well as runtime.jvm.memory.area etc.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fantastic idea ! I will try to propose something that matches your vision in my next commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A proposal is available in the form of commit 39a0bdc

processor/otel/metrics.go Outdated Show resolved Hide resolved
@jlvoiseux
Copy link
Contributor Author

Hello @axw ; thank you for your feedback and suggestions !
I have committed (39a0bdc) changes based on your accumulator/emitter structure proposal. It indeed makes the whole translation code a lot more readable. The changes in the commit pass all unit tests and also yield good results when tested end-to-end :

  • Opbeans-Java instrumented with Open Telemetry : JVM Metrics (the lack of host metrics is normal, as no related data is sent by the Otel agent)

image

  • Opbeans-Dotnet instrumented with Open Telemetry : Host metrics

image

  • Opbeans-Node instrumented with Open Telemetry : Host metrics

image

As a summary, the exported metrics are now :

jvm.memory.{area}.{type}
jvm.gc.time
jvm.gc.count
system.memory.actual.free
system.memory.total
system.cpu.total.norm.pct

Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @jlvoiseux! It's looking much easier to follow now.

I've just left a couple of questions and a handful of minor style comments, otherwise it's looking great.

processor/otel/metrics.go Outdated Show resolved Hide resolved
processor/otel/metrics.go Outdated Show resolved Hide resolved
processor/otel/metrics.go Outdated Show resolved Hide resolved
processor/otel/metrics.go Outdated Show resolved Hide resolved
processor/otel/metrics.go Outdated Show resolved Hide resolved
processor/otel/metrics.go Outdated Show resolved Hide resolved
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, just one more code change please :)

After that, I think this is ready. Please also update CHANGELOG.asciidoc and make sure CI passes (make check-full will tell you if there are linting issues).

Comment on lines 248 to 251
ms.upsertOne(
v.timestamp, fmt.Sprintf("jvm.memory.%s.%s", k.area, k.type_), pdata.NewAttributeMap(),
model.MetricsetSample{Value: v.value},
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we had a bug before: we have been ignoring the pool attribute.

IIANM (according to https://www.elastic.co/guide/en/apm/agent/java/current/metrics.html#metrics-jvm), if k.pool != "" then we should name the metric jvm.memory.<area>.pool.<type> instead, and set the attribute "name" to k.pool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you've resolved this, but I can't see where we're creating java.memory.<area>.pool.<type> metrics. If you'd like, please leave a TODO and we can address it in a followup.

Otherwise, what we need to do is:

  • in accumulate, set the jvmMemoryKey.pool attribute
  • here in emit, emit either jvm.memory.<area>.<type> or jvm.memory.<area>.pool.<type>, depending on whether jvmMemoryKey.pool is non-empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads-up ; I think I addressed the pool issue in c8129eb, in an identical fashion to what you describe, although I might have missed the point.

accumulate:
https://github.com/jlvoiseux/apm-server/blob/a47157b0238ed8e887755bea59947753421ecc1f/processor/otel/metrics.go#L178-L189

emit:
https://github.com/jlvoiseux/apm-server/blob/a47157b0238ed8e887755bea59947753421ecc1f/processor/otel/metrics.go#L249-L262

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I must have been looking at a stale commit. Looks perfect!

@jlvoiseux jlvoiseux marked this pull request as ready for review February 10, 2022 23:30
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but we're still missing the JVM pool metrics. It was an existing bug, so please feel free to merge this with a TODO and we can address that later.

Comment on lines 248 to 251
ms.upsertOne(
v.timestamp, fmt.Sprintf("jvm.memory.%s.%s", k.area, k.type_), pdata.NewAttributeMap(),
model.MetricsetSample{Value: v.value},
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you've resolved this, but I can't see where we're creating java.memory.<area>.pool.<type> metrics. If you'd like, please leave a TODO and we can address it in a followup.

Otherwise, what we need to do is:

  • in accumulate, set the jvmMemoryKey.pool attribute
  • here in emit, emit either jvm.memory.<area>.<type> or jvm.memory.<area>.pool.<type>, depending on whether jvmMemoryKey.pool is non-empty

Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, looks great!

@axw axw added the v8.2.0 label Feb 14, 2022
@axw axw enabled auto-merge (squash) February 14, 2022 01:21
@axw axw merged commit 0eddee7 into elastic:main Feb 14, 2022
@jlvoiseux
Copy link
Contributor Author

jlvoiseux commented Apr 11, 2022

Test Plan

This PR implements visualisation of system metrics in the APM UI when APM data is generated by OpenTelemetry agents. Three currently supported languages/frameworks are part of this test plan:

  • .NET: I implemented part of the OpenTelemetry system metrics specification myself to further test this PR.
  • Java: The OpenTelemetry Java agent generates JVM-related metrics
  • NodeJS: The OpenTelemetry-JS agent can generate system metrics, provided that the related package is used

Prerequisites

  • Start a cloud instance of the Elastic stack with the 8.2.0 version of the APM integration. Most Opbeans do not support Fleet, and adding that functionality would require important changes to the local docker-compose. As a consequence, we will use the cloud-focused docker-compose.
  • Retrieve the corresponding CLOUD_ID and credentials for the elastic user.

.NET

  • Clone my fork of Opbeans-Dotnet.
  • Switch to the opentelemetry-instrumentation branch.
  • Add a .env file to the repo with the following environment variables:
STACK_VERSION=8.2.0-SNAPSHOT
ELASTIC_CLOUD_ID=<CLOUD_ID>
ELASTIC_CLOUD_CREDENTIALS=elastic:<ELASTIC_PASSWORD>
APM_AGENT_TYPE=opentelemetry
ELASTIC_APM_SERVICE_NAME=opbeans-dotnet-otel
  • Run docker-compose -f docker-compose-elastic-cloud.yml up
  • Open your Elastic instance. The following fields should be populated:

image

- The system metrics diagrams should be drawn in the APM UI:

image

Java

This PR was implemented with the Opentelemetry Java agent v1.9.1. Since this PR was merged, the Runtime-Metrics instrumentation of the Opentelemetry Java agent underwent the following changes:

  • v1.10.1: The runtime.jvm.memory.area and runtime.jvm.memory.pool now have the type Counter instead of Gauge
  • v1.13.0 (To be released): A specification has been written and implemented for JVM metrics. This specification is great news, has it will allow us to map the various pool to their corresponding area.

Should we:

  • Validate the PR for the Otel Java agent <= v1.9.1 ?
  • Implement a fix in order to be compatible with the Otel Java agent >= v1.10.1 ?
  • Wait for the specification to be released and validate Java metrics as part of a separate PR at the time?

In case we choose the first option, here is how to test it:

  • Clone my fork of Opbeans-Java.
  • Switch to the opentelemetry-instrumentation branch.
  • Add a .env file to the repo with the following environment variables:
STACK_VERSION=8.2.0-SNAPSHOT
ELASTIC_CLOUD_ID=<CLOUD_ID>
ELASTIC_CLOUD_CREDENTIALS=elastic:<ELASTIC_PASSWORD>
APM_AGENT_TYPE=opentelemetry
ELASTIC_APM_SERVICE_NAME=opbeans-java-otel
  • Run docker-compose -f docker-compose-elastic-cloud.yml up
  • Open your Elastic instance. The following fields should be populated:

image

- The system metrics diagrams should be drawn in the APM UI. To view GC metrics, select a time span of at least an hour:

image

NodeJS

At the time of writing this test plan, the OTLP exporters included with the Opentelemetry JS Agent do not support recent releases of the Otel Collector. As a result, following commit f227841, the PR cannot handle metrics sent by the Opentelemetry JS agent. This language should not be included in the test plan.

An Otel instrumented fork of Opbeans-Node is available nonetheless. Once the OTLP exporters have been updated, this fork can be used in the same fashion as the others to test the PR without modifications, as the JS agent implements the Otel system metrics spec.

@stuartnelson3
Copy link
Contributor

confirmed for dotnet and java

@ZEXSM
Copy link

ZEXSM commented May 7, 2022

@axw Is it possible to add this changes for 7.x ?

@marclop
Copy link
Contributor

marclop commented May 10, 2022

@ZEXSM Andrew can confirm it, but we are unlikely to back port such a big change to 7.17 given we're only backporting bug fixes to that branch.

@axw
Copy link
Member

axw commented May 16, 2022

Indeed, since 8.0 was released, 7.x is in maintenance mode. Only bug fixes will be backported to the last minor of 7.x (meaning we'll only produce 7.17.x releases with bug fixes), so we won't be backporting this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify metrics OpenTelemetry test-plan test-plan-ok v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants