Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visualize JVM health indicators collected by the OpenTelemetry Java Agent #4919

Closed
cyrille-leclerc opened this issue Mar 4, 2021 · 24 comments · Fixed by #4986
Closed

Visualize JVM health indicators collected by the OpenTelemetry Java Agent #4919

cyrille-leclerc opened this issue Mar 4, 2021 · 24 comments · Fixed by #4986

Comments

@cyrille-leclerc
Copy link
Contributor

APM Server version (apm-server version):

Description of the problem including expected versus actual behavior:

Steps to reproduce:

https://github.com/cyrille-leclerc/my-shopping-cart/tree/39d22e715d96e15d5540ea28eca007de9d3edab8

JVM Runtime Metrics reported by the OpenTelemetry Collector Prometheus Exporter

# HELP runtime_jvm_gc_collection Time spent in a given JVM garbage collector in milliseconds.
# TYPE runtime_jvm_gc_collection counter
runtime_jvm_gc_collection{gc="G1 Old Generation"} 0
runtime_jvm_gc_collection{gc="G1 Young Generation"} 22
# HELP runtime_jvm_memory_area Bytes of a given JVM memory area.
# TYPE runtime_jvm_memory_area counter
runtime_jvm_memory_area{area="heap",type="committed"} 5.6623104e+07
runtime_jvm_memory_area{area="heap",type="max"} 4.294967296e+09
runtime_jvm_memory_area{area="heap",type="used"} 1.4552776e+07
runtime_jvm_memory_area{area="non_heap",type="committed"} 4.1222144e+07
runtime_jvm_memory_area{area="non_heap",type="used"} 3.7732072e+07
# HELP runtime_jvm_memory_pool Bytes of a given JVM memory pool.
# TYPE runtime_jvm_memory_pool counter
runtime_jvm_memory_pool{pool="CodeHeap 'non-nmethods'",type="committed"} 2.555904e+06
runtime_jvm_memory_pool{pool="CodeHeap 'non-nmethods'",type="max"} 5.840896e+06
runtime_jvm_memory_pool{pool="CodeHeap 'non-nmethods'",type="used"} 1.288832e+06
runtime_jvm_memory_pool{pool="CodeHeap 'non-profiled nmethods'",type="committed"} 2.555904e+06
runtime_jvm_memory_pool{pool="CodeHeap 'non-profiled nmethods'",type="max"} 1.22908672e+08
runtime_jvm_memory_pool{pool="CodeHeap 'non-profiled nmethods'",type="used"} 1.288448e+06
runtime_jvm_memory_pool{pool="CodeHeap 'profiled nmethods'",type="committed"} 6.750208e+06
runtime_jvm_memory_pool{pool="CodeHeap 'profiled nmethods'",type="max"} 1.22908672e+08
runtime_jvm_memory_pool{pool="CodeHeap 'profiled nmethods'",type="used"} 6.70784e+06
runtime_jvm_memory_pool{pool="Compressed Class Space",type="committed"} 3.407872e+06
runtime_jvm_memory_pool{pool="Compressed Class Space",type="max"} 1.073741824e+09
runtime_jvm_memory_pool{pool="Compressed Class Space",type="used"} 3.266448e+06
runtime_jvm_memory_pool{pool="G1 Eden Space",type="committed"} 2.9360128e+07
runtime_jvm_memory_pool{pool="G1 Eden Space",type="used"} 2.097152e+06
runtime_jvm_memory_pool{pool="G1 Old Gen",type="committed"} 2.5165824e+07
runtime_jvm_memory_pool{pool="G1 Old Gen",type="max"} 4.294967296e+09
runtime_jvm_memory_pool{pool="G1 Old Gen",type="used"} 1.0550784e+07
runtime_jvm_memory_pool{pool="G1 Survivor Space",type="committed"} 2.097152e+06
runtime_jvm_memory_pool{pool="G1 Survivor Space",type="used"} 1.90484e+06
runtime_jvm_memory_pool{pool="Metaspace",type="committed"} 2.5952256e+07
runtime_jvm_memory_pool{pool="Metaspace",type="used"} 2.5180024e+07

Sample Elastic APM Metrics document for runtime.jvm.memory.area{area="non_heap",type="used"}

{
  "_index": "apm-7.12.0-metric-000001",
  "_type": "_doc",
  "_id": "ug4U_ncBlMla0ZqF00Vo",
  "_version": 1,
  "_score": null,
  "_source": {
    "agent": {
      "name": "opentelemetry/java",
      "version": "0.17.0"
    },
    "process": {
      "pid": 4990
    },
    "runtime": {
      "jvm": {
        "memory": {
          "area": 37609512
        }
      }
    },
    "processor": {
      "name": "metric",
      "event": "metric"
    },
    "labels": {
      "process_runtime_description": "AdoptOpenJDK OpenJDK 64-Bit Server VM 15.0.2+7",
      "area": "non_heap",
      "os_type": "DARWIN",
      "process_executable_path": "/Library/Java/JavaVirtualMachines/adoptopenjdk-15.jdk/Contents/Home:bin:java",
      "process_runtime_name": "OpenJDK Runtime Environment",
      "os_description": "Mac OS X 10.16",
      "service_namespace": "com-shoppingcart",
      "process_command_line": "/Library/Java/JavaVirtualMachines/adoptopenjdk-15.jdk/Contents/Home:bin:java -javaagent:./../.otel/opentelemetry-javaagent-all-0.17.0.jar -Dio.opentelemetry.auto.slf4j.simpleLogger.defaultLogLevel=info",
      "type": "used",
      "process_runtime_version": "15.0.2+7",
      "telemetry_auto_version": "0.17.0"
    },
    "observer": {
      "hostname": "MacBook-Pro.localdomain",
      "id": "f9c7166f-0f12-4fc4-8c1d-3c137d9b382a",
      "type": "apm-server",
      "ephemeral_id": "6e8d9a14-4392-49f5-b73a-e2df869bfc33",
      "version": "7.12.0",
      "version_major": 7
    },
    "@timestamp": "2021-03-04T16:31:52.663Z",
    "ecs": {
      "version": "1.6.0"
    },
    "service": {
      "environment": "staging",
      "name": "monitor",
      "language": {
        "name": "java"
      },
      "version": "1.0-SNAPSHOT"
    },
    "event": {
      "ingested": "2021-03-04T16:31:53.702349Z"
    }
  },
  "fields": {
    "event.ingested": [
      "2021-03-04T16:31:53.702Z"
    ],
    "@timestamp": [
      "2021-03-04T16:31:52.663Z"
    ]
  },
  "sort": [
    1614875512663
  ]
}
@axw
Copy link
Member

axw commented Mar 5, 2021

@cyrille-leclerc I'm relabelling this as "enhancement", since we previously never intended to translate metrics. I think it's reasonable to do, but I haven't yet checked feasibility.

@axw axw added enhancement and removed bug labels Mar 5, 2021
@graphaelli
Copy link
Member

@axw @sqren @felixbarny Any thoughts on doing this in apm server versus handling elsewhere, including ingest pipeline or UI? I see the benefit handling via apm server but also concerned about putting that responsibility there. The benefit of ingest pipeline is that it could be updated out of cycle with the stack, eventually via a simple package update. I think UI suffers form the same complexity as putting this in server but interested in your thoughts.

@axw
Copy link
Member

axw commented Mar 6, 2021

@graphaelli IMO it's a job for our ingest pipeline. The server doesn't need to perform any computation on the metrics, so there's no real benefit to doing it there. I expect doing it in the ingest pipeline would help keep the UI code cleaner and more efficient, but @sqren would know better.

I'd like to pick on this:

The benefit of ingest pipeline is that it could be updated out of cycle with the stack, eventually via a simple package update.

I'd rather like to think of the package as just another part of each release, and keep them in sync. IMO, ideally the package would be bundled with Kibana, just like we bundle APM Server with Elastic Agent. That way when we upgrade Kibana to X.Y.Z, we get exactly version X.Y.Z of the package installed, and the APM UI code can assume things about the data streams installed.

See also elastic/kibana#93194 (comment). Maybe let's continue discussing there?

@axw
Copy link
Member

axw commented Mar 6, 2021

One caveat about the pipeline approach: metric labels get combined with other attributes, as can be seen in the example in the description. This muddies the waters a little bit about where exactly the labels (which also need to be translated) come from -- are they metric dimensions, or are they just some attributes of the application? This is a bit clearer in the APM Server code. Not a big deal though.

@felixbarny
Copy link
Member

Another consideration is whether we should store the metrics in the original format, the translated format, or both.

  • Store only original format
    • Changes in UI needed to fall back to OTel metrics
    • 👍 No duplication
    • 👍 We store the metrics as documented by the OTel agent
    • 👎 No custom dashboards for both OTel and Elastic APM services
  • Store both the original and the translated format
    • Transform in APM Server or ingest pipeline, no changes in UI needed
    • 👍 We store the metrics as documented by the OTel agent
    • 👍 Custom dashboards for both OTel and Elastic APM services
    • 👎 Duplication of metrics
  • Store only the translated format
    • Transform in APM Server or ingest pipeline, no changes in UI needed
    • 👍 No duplication
    • 👍 Custom dashboards for both OTel and Elastic APM services
    • 👎 May be confusing for OTel users (why is the metric that OTel documents not present?)

I think we should rule out storing only in the translated format as it seems like a rather intrusive approach with the potential for confusion. It's also not a model that works well going forward. What if we want to translate more metrics? It could be considered a breaking change.

I don't have a strong opinion about options 1 and 2, both have their pros and cons, but I gravitate towards just storing the metrics in the original format and doing a fallback in the UI for OTel.

@sorenlouv
Copy link
Member

I don't have any strong opinions on this. I'm learning towards option 2 to keep things consistent but if needed this can be handled by the ui.

@axw
Copy link
Member

axw commented Mar 9, 2021

Good point about surprising users by changing metric names.

My preference would be option 2, store both by default. We can always provide guidance to users to add another ingest processor to drop duplicates if they don't want or need them.

@cyrille-leclerc
Copy link
Contributor Author

On the original format, is it desired to have area="non_heap" and type="used" handled as labels the same way we handle process_executable_path or os_description?

Maybe the improvement is to have full ECS mapping for more attributes that are today handled as labels.

Reference in Prometheus:

runtime_jvm_memory_area{area="non_heap",type="used"} 3.7732072e+07

@axw
Copy link
Member

axw commented Mar 10, 2021

@cyrille-leclerc I think it's desirable to have area="non_heap" and type="used" as labels, but not for process_executable_path etc. to be labels. The latter should be translated to ECS (#4714), we just haven't handled them yet. We record unhandled attributes as labels so they're still available to users as a fallback.

@felixbarny
Copy link
Member

Another complication is that we can't just rename the metric name as the Java agent handles labels/dimensions differently than the OTel Metrics.

The OTel metrics have area and type as a label.

runtime_jvm_memory_area{area="non_heap",type="used"}

But we're using it as part of the metric name itself.

jvm.memory.heap.used
jvm.memory.heap.committed
jvm.memory.heap.max
jvm.memory.non_heap.used
jvm.memory.non_heap.committed
jvm.memory.non_heap.max

Both work fine and there's nothing obviously wrong with either approach. In the philosophy followed by our agent, we'd only create a label for dimensions that are not well-known upfront. An example is the jvm.gc.count{name="<memory manager name>" metric (docs). The name of the memory managers (GC algorithms) depends on the JVM version and settings. The UI would split the chart by creating a graph for each value of the label by performing a terms aggregation on labels.name.
For heap and non_heap, all the values are well-known upfront. This makes it easier to show heap and non-heap metrics in different charts and to calculate the percentage of the usage, for example. We don't report metrics per non-heap memory pool (new gen, old gen, permgen etc) but if we did, we'd do it as a label/dimension.

The Prometheus metrics module for Hotspot has yet another variation. It falls in-between the Java agent and the OTel agent by using a label for area` but used, committed, and max is part of the metric name:

jvm_memory_bytes_used{area="heap"} 2000000
jvm_memory_bytes_committed{area="nonheap"} 200000
jvm_memory_bytes_max{area="nonheap"} 2000000

@cyrille-leclerc
Copy link
Contributor Author

cyrille-leclerc commented Mar 10, 2021

@axw thanks, this plans looks good to me. I see that #4714 is marked in Zube as "backlog" rather than "ready". What's needed to move #4714 forward?

@felixbarny interesting lack of consistency on attributes versus metric name. I like the guideline on when to use labels in Prometheus docs and I don't find reasons to use labels for area:heap/nonheapor for type:used/committed :-) I don't see reasons to add/average/sum these metrics. I'm not even sure the "rule of thumb, no part of a metric name should ever be procedurally generated (use labels instead)" resonates much with me for this case :-)

https://prometheus.io/docs/practices/instrumentation/#use-labels

@felixbarny
Copy link
Member

Yes, I also like their guidelines. But it almost seems like they violate their own rules here :)

@axw
Copy link
Member

axw commented Mar 11, 2021

@axw thanks, this plans looks good to me. I see that #4714 is marked in Zube as "backlog" rather than "ready". What's needed to move #4714 forward?

A 25th hour in each day would be a good start ;)

For 7.13 I think we can manage to translate the attributes for which there are existing analogues in our agents, e.g. process_executable_path, process_runtime_name, process_runtime_version. Other attributes may have to wait. Let's discuss further on that issue if needed.

@cyrille-leclerc
Copy link
Contributor Author

A 25th hour in each day would be a good start ;)

@axw I wanted to understand if additional specification work was needed :-)

For 7.13 I think we can manage to translate the attributes for which there are existing analogues in our agents, e.g. process_executable_path, process_runtime_name, process_runtime_version. Other attributes may have to wait. Let's discuss further on that issue if needed.

Does it mean that we/I should prepare this mapping?
Should we align with @urso to have the same mapping used in the logs exporter for Elasticsearch and in the OTLP intake we have for traces and metrics?

@axw
Copy link
Member

axw commented Mar 16, 2021

@cyrille-leclerc I've responded on #4714.

@axw
Copy link
Member

axw commented Mar 16, 2021

@felixbarny @sqren and I just discussed this, and we will go with the duplicating ingest pipeline approach for now.

@axw
Copy link
Member

axw commented Mar 22, 2021

I could do with some help defining the mapping.

The OpenTelemetry Java metrics code is at https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-metrics/library/src/main/java/io/opentelemetry/instrumentation/runtimemetrics

Our agent's metrics are documented at https://www.elastic.co/guide/en/apm/agent/java/current/metrics.html#metrics-jvm

Here's the mapping I've got so far, in an ingest pipeline:

apm_opentelemetry_metrics:                                                                               
  description: Populate Elastic APM metric fields from well-known OpenTelemetry metric counterparts      
  processors:                                                                                            
  - set:                                                                                                 
      # Copy `runtime.jvm.memory.area` (OpenTelemetry) to `jvm.memory.{area}.{type}` (Elastic APM).      
      field: jvm.memory.{{labels.area}}.{{labels.type}}                                                  
      copy_from: runtime.jvm.memory.area                                                                 
      if: ctx.runtime?.jvm?.memory?.area != null && ctx.labels?.area != null && ctx.labels?.type != null 
  - set:                                                                                                 
      # Copy `runtime.jvm.gc.collection` (OpenTelemetry) to `jvm.gc.time` (Elastic APM).                 
      # Both are defined in milliseconds.                                                                
      field: jvm.gc.time                                                                                 
      copy_from: runtime.jvm.gc.collection                                                               
      if: ctx.runtime?.jvm?.gc?.collection != null && ctx.labels?.gc != null                             
  - set:                                                                                                 
      # Copy `labels.gc` (OpenTelemetry) to `labels.name` (Elastic APM), for jvm.gc.time.                
      field: labels.name                                                                                 
      copy_from: labels.gc                                                                               
      override: false # don't replace existing labels.name field, if any                                 
      if: ctx.labels?.gc != null                                                                         

A couple of discrepancies:

  • There does not appear to be any OpenTelemetry metrics corresponding to our jvm.gc.count or jvm.gc.alloc.
  • There does not appear to be any Elastic APM metrics corresponding to OpenTelemetry's runtime.jvm.memory.pool

@cyrille-leclerc
Copy link
Contributor Author

Thanks @axw ,
Questions for @felixbarny:

@felixbarny
Copy link
Member

👍 on both.
I can create a PR for OTel.

@cyrille-leclerc
Copy link
Contributor Author

many thanks @felixbarny !

@felixbarny
Copy link
Member

The PR has been merged. The metric runtime.jvm.gc.collection has been renamed to runtime.jvm.gc.time and runtime.jvm.gc.count has been added.

@axw
Copy link
Member

axw commented Mar 24, 2021

Is it worth looking for both runtime.jvm.gc.collection and runtime.jvm.gc.time, or should we require users to upgrade their Java SDK?

@felixbarny
Copy link
Member

I think it does make sense to look for both. It doesn't seem to add a lot of complexity and currently, there's no released version that includes the updated metrics.

An important thing to note:

we haven't declared stability of metric names, so it's ok for us to rename existing metrics

Once the metric names become stable we might want to require a specific version, depending on how complex it would be to support a wider range of versions. I.e. if there are not a lot of changes, we may just support all metric names, even from older versions. But if there's a lot of churn in the metric names and it would be complex to maintain that, I think it's fair to require a specific version.

@axw
Copy link
Member

axw commented May 18, 2021

Unfortunately this is going to slip until after 7.13.0, as we found some issues with the approach we've taken: #4986 (comment) #4986 (comment)

I think this will end up moving to 7.14.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants