Visualize JVM health indicators collected by the OpenTelemetry Java Agent #4919

cyrille-leclerc · 2021-03-04T16:55:20Z

APM Server version (apm-server version):

Description of the problem including expected versus actual behavior:

Steps to reproduce:

https://github.com/cyrille-leclerc/my-shopping-cart/tree/39d22e715d96e15d5540ea28eca007de9d3edab8

JVM Runtime Metrics reported by the OpenTelemetry Collector Prometheus Exporter

# HELP runtime_jvm_gc_collection Time spent in a given JVM garbage collector in milliseconds.
# TYPE runtime_jvm_gc_collection counter
runtime_jvm_gc_collection{gc="G1 Old Generation"} 0
runtime_jvm_gc_collection{gc="G1 Young Generation"} 22
# HELP runtime_jvm_memory_area Bytes of a given JVM memory area.
# TYPE runtime_jvm_memory_area counter
runtime_jvm_memory_area{area="heap",type="committed"} 5.6623104e+07
runtime_jvm_memory_area{area="heap",type="max"} 4.294967296e+09
runtime_jvm_memory_area{area="heap",type="used"} 1.4552776e+07
runtime_jvm_memory_area{area="non_heap",type="committed"} 4.1222144e+07
runtime_jvm_memory_area{area="non_heap",type="used"} 3.7732072e+07
# HELP runtime_jvm_memory_pool Bytes of a given JVM memory pool.
# TYPE runtime_jvm_memory_pool counter
runtime_jvm_memory_pool{pool="CodeHeap 'non-nmethods'",type="committed"} 2.555904e+06
runtime_jvm_memory_pool{pool="CodeHeap 'non-nmethods'",type="max"} 5.840896e+06
runtime_jvm_memory_pool{pool="CodeHeap 'non-nmethods'",type="used"} 1.288832e+06
runtime_jvm_memory_pool{pool="CodeHeap 'non-profiled nmethods'",type="committed"} 2.555904e+06
runtime_jvm_memory_pool{pool="CodeHeap 'non-profiled nmethods'",type="max"} 1.22908672e+08
runtime_jvm_memory_pool{pool="CodeHeap 'non-profiled nmethods'",type="used"} 1.288448e+06
runtime_jvm_memory_pool{pool="CodeHeap 'profiled nmethods'",type="committed"} 6.750208e+06
runtime_jvm_memory_pool{pool="CodeHeap 'profiled nmethods'",type="max"} 1.22908672e+08
runtime_jvm_memory_pool{pool="CodeHeap 'profiled nmethods'",type="used"} 6.70784e+06
runtime_jvm_memory_pool{pool="Compressed Class Space",type="committed"} 3.407872e+06
runtime_jvm_memory_pool{pool="Compressed Class Space",type="max"} 1.073741824e+09
runtime_jvm_memory_pool{pool="Compressed Class Space",type="used"} 3.266448e+06
runtime_jvm_memory_pool{pool="G1 Eden Space",type="committed"} 2.9360128e+07
runtime_jvm_memory_pool{pool="G1 Eden Space",type="used"} 2.097152e+06
runtime_jvm_memory_pool{pool="G1 Old Gen",type="committed"} 2.5165824e+07
runtime_jvm_memory_pool{pool="G1 Old Gen",type="max"} 4.294967296e+09
runtime_jvm_memory_pool{pool="G1 Old Gen",type="used"} 1.0550784e+07
runtime_jvm_memory_pool{pool="G1 Survivor Space",type="committed"} 2.097152e+06
runtime_jvm_memory_pool{pool="G1 Survivor Space",type="used"} 1.90484e+06
runtime_jvm_memory_pool{pool="Metaspace",type="committed"} 2.5952256e+07
runtime_jvm_memory_pool{pool="Metaspace",type="used"} 2.5180024e+07

Sample Elastic APM Metrics document for `runtime.jvm.memory.area{area="non_heap",type="used"}`

{
  "_index": "apm-7.12.0-metric-000001",
  "_type": "_doc",
  "_id": "ug4U_ncBlMla0ZqF00Vo",
  "_version": 1,
  "_score": null,
  "_source": {
    "agent": {
      "name": "opentelemetry/java",
      "version": "0.17.0"
    },
    "process": {
      "pid": 4990
    },
    "runtime": {
      "jvm": {
        "memory": {
          "area": 37609512
        }
      }
    },
    "processor": {
      "name": "metric",
      "event": "metric"
    },
    "labels": {
      "process_runtime_description": "AdoptOpenJDK OpenJDK 64-Bit Server VM 15.0.2+7",
      "area": "non_heap",
      "os_type": "DARWIN",
      "process_executable_path": "/Library/Java/JavaVirtualMachines/adoptopenjdk-15.jdk/Contents/Home:bin:java",
      "process_runtime_name": "OpenJDK Runtime Environment",
      "os_description": "Mac OS X 10.16",
      "service_namespace": "com-shoppingcart",
      "process_command_line": "/Library/Java/JavaVirtualMachines/adoptopenjdk-15.jdk/Contents/Home:bin:java -javaagent:./../.otel/opentelemetry-javaagent-all-0.17.0.jar -Dio.opentelemetry.auto.slf4j.simpleLogger.defaultLogLevel=info",
      "type": "used",
      "process_runtime_version": "15.0.2+7",
      "telemetry_auto_version": "0.17.0"
    },
    "observer": {
      "hostname": "MacBook-Pro.localdomain",
      "id": "f9c7166f-0f12-4fc4-8c1d-3c137d9b382a",
      "type": "apm-server",
      "ephemeral_id": "6e8d9a14-4392-49f5-b73a-e2df869bfc33",
      "version": "7.12.0",
      "version_major": 7
    },
    "@timestamp": "2021-03-04T16:31:52.663Z",
    "ecs": {
      "version": "1.6.0"
    },
    "service": {
      "environment": "staging",
      "name": "monitor",
      "language": {
        "name": "java"
      },
      "version": "1.0-SNAPSHOT"
    },
    "event": {
      "ingested": "2021-03-04T16:31:53.702349Z"
    }
  },
  "fields": {
    "event.ingested": [
      "2021-03-04T16:31:53.702Z"
    ],
    "@timestamp": [
      "2021-03-04T16:31:52.663Z"
    ]
  },
  "sort": [
    1614875512663
  ]
}

The text was updated successfully, but these errors were encountered:

axw · 2021-03-05T01:49:19Z

@cyrille-leclerc I'm relabelling this as "enhancement", since we previously never intended to translate metrics. I think it's reasonable to do, but I haven't yet checked feasibility.

graphaelli · 2021-03-05T18:29:23Z

@axw @sqren @felixbarny Any thoughts on doing this in apm server versus handling elsewhere, including ingest pipeline or UI? I see the benefit handling via apm server but also concerned about putting that responsibility there. The benefit of ingest pipeline is that it could be updated out of cycle with the stack, eventually via a simple package update. I think UI suffers form the same complexity as putting this in server but interested in your thoughts.

axw · 2021-03-06T00:28:16Z

@graphaelli IMO it's a job for our ingest pipeline. The server doesn't need to perform any computation on the metrics, so there's no real benefit to doing it there. I expect doing it in the ingest pipeline would help keep the UI code cleaner and more efficient, but @sqren would know better.

I'd like to pick on this:

The benefit of ingest pipeline is that it could be updated out of cycle with the stack, eventually via a simple package update.

I'd rather like to think of the package as just another part of each release, and keep them in sync. IMO, ideally the package would be bundled with Kibana, just like we bundle APM Server with Elastic Agent. That way when we upgrade Kibana to X.Y.Z, we get exactly version X.Y.Z of the package installed, and the APM UI code can assume things about the data streams installed.

See also elastic/kibana#93194 (comment). Maybe let's continue discussing there?

axw · 2021-03-06T00:31:31Z

One caveat about the pipeline approach: metric labels get combined with other attributes, as can be seen in the example in the description. This muddies the waters a little bit about where exactly the labels (which also need to be translated) come from -- are they metric dimensions, or are they just some attributes of the application? This is a bit clearer in the APM Server code. Not a big deal though.

felixbarny · 2021-03-08T08:15:10Z

Another consideration is whether we should store the metrics in the original format, the translated format, or both.

Store only original format
- Changes in UI needed to fall back to OTel metrics
- 👍 No duplication
- 👍 We store the metrics as documented by the OTel agent
- 👎 No custom dashboards for both OTel and Elastic APM services
Store both the original and the translated format
- Transform in APM Server or ingest pipeline, no changes in UI needed
- 👍 We store the metrics as documented by the OTel agent
- 👍 Custom dashboards for both OTel and Elastic APM services
- 👎 Duplication of metrics
Store only the translated format
- Transform in APM Server or ingest pipeline, no changes in UI needed
- 👍 No duplication
- 👍 Custom dashboards for both OTel and Elastic APM services
- 👎 May be confusing for OTel users (why is the metric that OTel documents not present?)

I think we should rule out storing only in the translated format as it seems like a rather intrusive approach with the potential for confusion. It's also not a model that works well going forward. What if we want to translate more metrics? It could be considered a breaking change.

I don't have a strong opinion about options 1 and 2, both have their pros and cons, but I gravitate towards just storing the metrics in the original format and doing a fallback in the UI for OTel.

sorenlouv · 2021-03-08T12:27:50Z

I don't have any strong opinions on this. I'm learning towards option 2 to keep things consistent but if needed this can be handled by the ui.

axw · 2021-03-09T03:49:15Z

Good point about surprising users by changing metric names.

My preference would be option 2, store both by default. We can always provide guidance to users to add another ingest processor to drop duplicates if they don't want or need them.

cyrille-leclerc · 2021-03-09T20:18:39Z

On the original format, is it desired to have area="non_heap" and type="used" handled as labels the same way we handle process_executable_path or os_description?

Maybe the improvement is to have full ECS mapping for more attributes that are today handled as labels.

Reference in Prometheus:

runtime_jvm_memory_area{area="non_heap",type="used"} 3.7732072e+07

axw · 2021-03-10T01:05:35Z

@cyrille-leclerc I think it's desirable to have area="non_heap" and type="used" as labels, but not for process_executable_path etc. to be labels. The latter should be translated to ECS (#4714), we just haven't handled them yet. We record unhandled attributes as labels so they're still available to users as a fallback.

felixbarny · 2021-03-10T10:06:05Z

Another complication is that we can't just rename the metric name as the Java agent handles labels/dimensions differently than the OTel Metrics.

The OTel metrics have area and type as a label.

runtime_jvm_memory_area{area="non_heap",type="used"}

But we're using it as part of the metric name itself.

jvm.memory.heap.used
jvm.memory.heap.committed
jvm.memory.heap.max
jvm.memory.non_heap.used
jvm.memory.non_heap.committed
jvm.memory.non_heap.max

Both work fine and there's nothing obviously wrong with either approach. In the philosophy followed by our agent, we'd only create a label for dimensions that are not well-known upfront. An example is the jvm.gc.count{name="<memory manager name>" metric (docs). The name of the memory managers (GC algorithms) depends on the JVM version and settings. The UI would split the chart by creating a graph for each value of the label by performing a terms aggregation on labels.name.
For heap and non_heap, all the values are well-known upfront. This makes it easier to show heap and non-heap metrics in different charts and to calculate the percentage of the usage, for example. We don't report metrics per non-heap memory pool (new gen, old gen, permgen etc) but if we did, we'd do it as a label/dimension.

The Prometheus metrics module for Hotspot has yet another variation. It falls in-between the Java agent and the OTel agent by using a label for area` but used, committed, and max is part of the metric name:

jvm_memory_bytes_used{area="heap"} 2000000
jvm_memory_bytes_committed{area="nonheap"} 200000
jvm_memory_bytes_max{area="nonheap"} 2000000

cyrille-leclerc · 2021-03-10T20:01:29Z

@axw thanks, this plans looks good to me. I see that #4714 is marked in Zube as "backlog" rather than "ready". What's needed to move #4714 forward?

@felixbarny interesting lack of consistency on attributes versus metric name. I like the guideline on when to use labels in Prometheus docs and I don't find reasons to use labels for area:heap/nonheapor for type:used/committed :-) I don't see reasons to add/average/sum these metrics. I'm not even sure the "rule of thumb, no part of a metric name should ever be procedurally generated (use labels instead)" resonates much with me for this case :-)

https://prometheus.io/docs/practices/instrumentation/#use-labels

felixbarny · 2021-03-11T08:57:49Z

Yes, I also like their guidelines. But it almost seems like they violate their own rules here :)

axw · 2021-03-11T09:40:16Z

@axw thanks, this plans looks good to me. I see that #4714 is marked in Zube as "backlog" rather than "ready". What's needed to move #4714 forward?

A 25th hour in each day would be a good start ;)

For 7.13 I think we can manage to translate the attributes for which there are existing analogues in our agents, e.g. process_executable_path, process_runtime_name, process_runtime_version. Other attributes may have to wait. Let's discuss further on that issue if needed.

cyrille-leclerc · 2021-03-11T14:32:59Z

A 25th hour in each day would be a good start ;)

@axw I wanted to understand if additional specification work was needed :-)

For 7.13 I think we can manage to translate the attributes for which there are existing analogues in our agents, e.g. process_executable_path, process_runtime_name, process_runtime_version. Other attributes may have to wait. Let's discuss further on that issue if needed.

Does it mean that we/I should prepare this mapping?
Should we align with @urso to have the same mapping used in the logs exporter for Elasticsearch and in the OTLP intake we have for traces and metrics?

axw · 2021-03-16T01:29:11Z

@cyrille-leclerc I've responded on #4714.

axw · 2021-03-16T08:51:09Z

@felixbarny @sqren and I just discussed this, and we will go with the duplicating ingest pipeline approach for now.

axw · 2021-03-22T06:50:35Z

I could do with some help defining the mapping.

The OpenTelemetry Java metrics code is at https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-metrics/library/src/main/java/io/opentelemetry/instrumentation/runtimemetrics

Our agent's metrics are documented at https://www.elastic.co/guide/en/apm/agent/java/current/metrics.html#metrics-jvm

Here's the mapping I've got so far, in an ingest pipeline:

apm_opentelemetry_metrics:                                                                               
  description: Populate Elastic APM metric fields from well-known OpenTelemetry metric counterparts      
  processors:                                                                                            
  - set:                                                                                                 
      # Copy `runtime.jvm.memory.area` (OpenTelemetry) to `jvm.memory.{area}.{type}` (Elastic APM).      
      field: jvm.memory.{{labels.area}}.{{labels.type}}                                                  
      copy_from: runtime.jvm.memory.area                                                                 
      if: ctx.runtime?.jvm?.memory?.area != null && ctx.labels?.area != null && ctx.labels?.type != null 
  - set:                                                                                                 
      # Copy `runtime.jvm.gc.collection` (OpenTelemetry) to `jvm.gc.time` (Elastic APM).                 
      # Both are defined in milliseconds.                                                                
      field: jvm.gc.time                                                                                 
      copy_from: runtime.jvm.gc.collection                                                               
      if: ctx.runtime?.jvm?.gc?.collection != null && ctx.labels?.gc != null                             
  - set:                                                                                                 
      # Copy `labels.gc` (OpenTelemetry) to `labels.name` (Elastic APM), for jvm.gc.time.                
      field: labels.name                                                                                 
      copy_from: labels.gc                                                                               
      override: false # don't replace existing labels.name field, if any                                 
      if: ctx.labels?.gc != null

A couple of discrepancies:

There does not appear to be any OpenTelemetry metrics corresponding to our jvm.gc.count or jvm.gc.alloc.
There does not appear to be any Elastic APM metrics corresponding to OpenTelemetry's runtime.jvm.memory.pool

cyrille-leclerc · 2021-03-22T09:42:39Z

Thanks @axw ,
Questions for @felixbarny:

Would it make sense to contribute to the Otel runtime metrics to add jvm.gc.count based on java.lang.management.GarbageCollectorMXBean#getCollectionCount and, if possible, jvm.gc.alloc(I didn't see where it comes from)?
- See https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/v1.0.1/instrumentation/runtime-metrics/library/src/main/java/io/opentelemetry/instrumentation/runtimemetrics/GarbageCollector.java#L42
Regarding runtime.jvm.memory.pool that is not reported by Elastic, would it make sense to track this as a possible enhancement? Maybe not very high priority.

felixbarny · 2021-03-22T15:08:53Z

👍 on both.
I can create a PR for OTel.

cyrille-leclerc · 2021-03-22T15:17:58Z

many thanks @felixbarny !

felixbarny · 2021-03-23T13:46:41Z

The PR has been merged. The metric runtime.jvm.gc.collection has been renamed to runtime.jvm.gc.time and runtime.jvm.gc.count has been added.

axw · 2021-03-24T01:47:39Z

Is it worth looking for both runtime.jvm.gc.collection and runtime.jvm.gc.time, or should we require users to upgrade their Java SDK?

felixbarny · 2021-03-24T06:57:07Z

I think it does make sense to look for both. It doesn't seem to add a lot of complexity and currently, there's no released version that includes the updated metrics.

An important thing to note:

we haven't declared stability of metric names, so it's ok for us to rename existing metrics

Once the metric names become stable we might want to require a specific version, depending on how complex it would be to support a wider range of versions. I.e. if there are not a lot of changes, we may just support all metric names, even from older versions. But if there's a lot of churn in the metric names and it would be complex to maintain that, I think it's fair to require a specific version.

axw · 2021-05-18T09:50:11Z

Unfortunately this is going to slip until after 7.13.0, as we found some issues with the approach we've taken: #4986 (comment) #4986 (comment)

I think this will end up moving to 7.14.0.

cyrille-leclerc added bug OpenTelemetry labels Mar 4, 2021

axw added enhancement and removed bug labels Mar 5, 2021

axw mentioned this issue Mar 16, 2021

[OpenTelemetry] Map OpenTelemetry Semantic Attributes to Elastic Common Schema when ingesting OTLP data #4714

Closed

axw added [zube]: Inbox v7.13.0 and removed [zube]: Inbox labels Mar 22, 2021

axw self-assigned this Mar 22, 2021

axw added the test-plan label Mar 22, 2021

axw mentioned this issue Mar 22, 2021

ingest/pipeline: convert ingest pipeline to YAML #4977

Merged

felixbarny mentioned this issue Mar 22, 2021

Add runtime.jvm.gc.count metric open-telemetry/opentelemetry-java-instrumentation#2616

Merged

axw mentioned this issue Mar 24, 2021

ingest: translate known OpenTelemetry JVM metrics #4986

Merged

2 tasks

axw added [zube]: In Review and removed [zube]: In Progress labels Mar 26, 2021

axw closed this as completed in #4986 Apr 1, 2021

zube bot added [zube]: Done and removed [zube]: In Review labels Apr 1, 2021

axw removed the [zube]: Done label Apr 7, 2021

axw removed the test-plan label Apr 28, 2021

axw reopened this May 18, 2021

axw added v7.14.0 [zube]: In Progress and removed v7.13.0 labels May 18, 2021

axw mentioned this issue May 24, 2021

Translate OpenTelemetry JVM metrics in processor/otel #5309

Merged

1 task

axw added this to the 7.14 milestone May 30, 2021

axw closed this as completed Jun 18, 2021

axw added [zube]: Done and removed [zube]: In Progress labels Jun 18, 2021

zube bot removed the [zube]: Done label Jun 23, 2021

axw mentioned this issue Jul 26, 2021

Translate OpenTelemetry system metrics #5796

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visualize JVM health indicators collected by the OpenTelemetry Java Agent #4919

Visualize JVM health indicators collected by the OpenTelemetry Java Agent #4919

cyrille-leclerc commented Mar 4, 2021

axw commented Mar 5, 2021

graphaelli commented Mar 5, 2021

axw commented Mar 6, 2021

axw commented Mar 6, 2021

felixbarny commented Mar 8, 2021

sorenlouv commented Mar 8, 2021

axw commented Mar 9, 2021

cyrille-leclerc commented Mar 9, 2021

axw commented Mar 10, 2021

felixbarny commented Mar 10, 2021

cyrille-leclerc commented Mar 10, 2021 •

edited

Loading

felixbarny commented Mar 11, 2021

axw commented Mar 11, 2021

cyrille-leclerc commented Mar 11, 2021

axw commented Mar 16, 2021

axw commented Mar 16, 2021

axw commented Mar 22, 2021

cyrille-leclerc commented Mar 22, 2021

felixbarny commented Mar 22, 2021

cyrille-leclerc commented Mar 22, 2021

felixbarny commented Mar 23, 2021

axw commented Mar 24, 2021

felixbarny commented Mar 24, 2021

axw commented May 18, 2021

Visualize JVM health indicators collected by the OpenTelemetry Java Agent #4919

Visualize JVM health indicators collected by the OpenTelemetry Java Agent #4919

Comments

cyrille-leclerc commented Mar 4, 2021

JVM Runtime Metrics reported by the OpenTelemetry Collector Prometheus Exporter

Sample Elastic APM Metrics document for runtime.jvm.memory.area{area="non_heap",type="used"}

axw commented Mar 5, 2021

graphaelli commented Mar 5, 2021

axw commented Mar 6, 2021

axw commented Mar 6, 2021

felixbarny commented Mar 8, 2021

sorenlouv commented Mar 8, 2021

axw commented Mar 9, 2021

cyrille-leclerc commented Mar 9, 2021

axw commented Mar 10, 2021

felixbarny commented Mar 10, 2021

cyrille-leclerc commented Mar 10, 2021 • edited Loading

felixbarny commented Mar 11, 2021

axw commented Mar 11, 2021

cyrille-leclerc commented Mar 11, 2021

axw commented Mar 16, 2021

axw commented Mar 16, 2021

axw commented Mar 22, 2021

cyrille-leclerc commented Mar 22, 2021

felixbarny commented Mar 22, 2021

cyrille-leclerc commented Mar 22, 2021

felixbarny commented Mar 23, 2021

axw commented Mar 24, 2021

felixbarny commented Mar 24, 2021

axw commented May 18, 2021

Sample Elastic APM Metrics document for `runtime.jvm.memory.area{area="non_heap",type="used"}`

cyrille-leclerc commented Mar 10, 2021 •

edited

Loading