-
Notifications
You must be signed in to change notification settings - Fork 523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Visualize JVM health indicators collected by the OpenTelemetry Java Agent #4919
Comments
@cyrille-leclerc I'm relabelling this as "enhancement", since we previously never intended to translate metrics. I think it's reasonable to do, but I haven't yet checked feasibility. |
@axw @sqren @felixbarny Any thoughts on doing this in apm server versus handling elsewhere, including ingest pipeline or UI? I see the benefit handling via apm server but also concerned about putting that responsibility there. The benefit of ingest pipeline is that it could be updated out of cycle with the stack, eventually via a simple package update. I think UI suffers form the same complexity as putting this in server but interested in your thoughts. |
@graphaelli IMO it's a job for our ingest pipeline. The server doesn't need to perform any computation on the metrics, so there's no real benefit to doing it there. I expect doing it in the ingest pipeline would help keep the UI code cleaner and more efficient, but @sqren would know better. I'd like to pick on this:
I'd rather like to think of the package as just another part of each release, and keep them in sync. IMO, ideally the package would be bundled with Kibana, just like we bundle APM Server with Elastic Agent. That way when we upgrade Kibana to X.Y.Z, we get exactly version X.Y.Z of the package installed, and the APM UI code can assume things about the data streams installed. See also elastic/kibana#93194 (comment). Maybe let's continue discussing there? |
One caveat about the pipeline approach: metric labels get combined with other attributes, as can be seen in the example in the description. This muddies the waters a little bit about where exactly the labels (which also need to be translated) come from -- are they metric dimensions, or are they just some attributes of the application? This is a bit clearer in the APM Server code. Not a big deal though. |
Another consideration is whether we should store the metrics in the original format, the translated format, or both.
I think we should rule out storing only in the translated format as it seems like a rather intrusive approach with the potential for confusion. It's also not a model that works well going forward. What if we want to translate more metrics? It could be considered a breaking change. I don't have a strong opinion about options 1 and 2, both have their pros and cons, but I gravitate towards just storing the metrics in the original format and doing a fallback in the UI for OTel. |
I don't have any strong opinions on this. I'm learning towards option 2 to keep things consistent but if needed this can be handled by the ui. |
Good point about surprising users by changing metric names. My preference would be option 2, store both by default. We can always provide guidance to users to add another ingest processor to drop duplicates if they don't want or need them. |
On the original format, is it desired to have Maybe the improvement is to have full ECS mapping for more attributes that are today handled as Reference in Prometheus:
|
@cyrille-leclerc I think it's desirable to have |
Another complication is that we can't just rename the metric name as the Java agent handles labels/dimensions differently than the OTel Metrics. The OTel metrics have area and type as a label.
But we're using it as part of the metric name itself.
Both work fine and there's nothing obviously wrong with either approach. In the philosophy followed by our agent, we'd only create a label for dimensions that are not well-known upfront. An example is the The Prometheus metrics module for Hotspot has yet another variation. It falls in-between the Java agent and the OTel agent by using a label for area` but used, committed, and max is part of the metric name:
|
@axw thanks, this plans looks good to me. I see that #4714 is marked in Zube as "backlog" rather than "ready". What's needed to move #4714 forward? @felixbarny interesting lack of consistency on attributes versus metric name. I like the guideline on when to use labels in Prometheus docs and I don't find reasons to use labels for https://prometheus.io/docs/practices/instrumentation/#use-labels |
Yes, I also like their guidelines. But it almost seems like they violate their own rules here :) |
A 25th hour in each day would be a good start ;) For 7.13 I think we can manage to translate the attributes for which there are existing analogues in our agents, e.g. |
@axw I wanted to understand if additional specification work was needed :-)
Does it mean that we/I should prepare this mapping? |
@cyrille-leclerc I've responded on #4714. |
@felixbarny @sqren and I just discussed this, and we will go with the duplicating ingest pipeline approach for now. |
I could do with some help defining the mapping. The OpenTelemetry Java metrics code is at https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-metrics/library/src/main/java/io/opentelemetry/instrumentation/runtimemetrics Our agent's metrics are documented at https://www.elastic.co/guide/en/apm/agent/java/current/metrics.html#metrics-jvm Here's the mapping I've got so far, in an ingest pipeline: apm_opentelemetry_metrics:
description: Populate Elastic APM metric fields from well-known OpenTelemetry metric counterparts
processors:
- set:
# Copy `runtime.jvm.memory.area` (OpenTelemetry) to `jvm.memory.{area}.{type}` (Elastic APM).
field: jvm.memory.{{labels.area}}.{{labels.type}}
copy_from: runtime.jvm.memory.area
if: ctx.runtime?.jvm?.memory?.area != null && ctx.labels?.area != null && ctx.labels?.type != null
- set:
# Copy `runtime.jvm.gc.collection` (OpenTelemetry) to `jvm.gc.time` (Elastic APM).
# Both are defined in milliseconds.
field: jvm.gc.time
copy_from: runtime.jvm.gc.collection
if: ctx.runtime?.jvm?.gc?.collection != null && ctx.labels?.gc != null
- set:
# Copy `labels.gc` (OpenTelemetry) to `labels.name` (Elastic APM), for jvm.gc.time.
field: labels.name
copy_from: labels.gc
override: false # don't replace existing labels.name field, if any
if: ctx.labels?.gc != null A couple of discrepancies:
|
Thanks @axw ,
|
👍 on both. |
many thanks @felixbarny ! |
The PR has been merged. The metric |
Is it worth looking for both |
I think it does make sense to look for both. It doesn't seem to add a lot of complexity and currently, there's no released version that includes the updated metrics. An important thing to note:
Once the metric names become stable we might want to require a specific version, depending on how complex it would be to support a wider range of versions. I.e. if there are not a lot of changes, we may just support all metric names, even from older versions. But if there's a lot of churn in the metric names and it would be complex to maintain that, I think it's fair to require a specific version. |
Unfortunately this is going to slip until after 7.13.0, as we found some issues with the approach we've taken: #4986 (comment) #4986 (comment) I think this will end up moving to 7.14.0. |
APM Server version (
apm-server version
):Description of the problem including expected versus actual behavior:
Steps to reproduce:
https://github.com/cyrille-leclerc/my-shopping-cart/tree/39d22e715d96e15d5540ea28eca007de9d3edab8
JVM Runtime Metrics reported by the OpenTelemetry Collector Prometheus Exporter
Sample Elastic APM Metrics document for
runtime.jvm.memory.area{area="non_heap",type="used"}
The text was updated successfully, but these errors were encountered: