Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric queueSize twice in Prometheus output #4382

Closed
cbos opened this issue Apr 14, 2022 · 4 comments · Fixed by #4386
Closed

Metric queueSize twice in Prometheus output #4382

cbos opened this issue Apr 14, 2022 · 4 comments · Fixed by #4386
Labels
Bug Something isn't working

Comments

@cbos
Copy link

cbos commented Apr 14, 2022

The prometheus endpoint produces invalid output.

The java application is started with:
OTEL_METRICS_EXPORTER=prometheus
OTEL_TRACES_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp

After a while the produced output contains this information

# .... lines above this removed ....
# TYPE queueSize gauge
# HELP queueSize The number of logs queued
queueSize{logProcessorType="BatchLogProcessor"} 0.0 1649922810502
# TYPE processedLogs_total counter
# HELP processedLogs_total The number of logs processed by the BatchLogProcessor. [dropped=true if they were dropped due to high throughput]
processedLogs_total{dropped="false",logProcessorType="BatchLogProcessor"} 11.0 1649922810502
# TYPE runtime_jvm_gc_count_total counter
# HELP runtime_jvm_gc_count_total The number of collections that have occurred for a given JVM garbage collector.
runtime_jvm_gc_count_total{gc="Copy"} 230.0 1649922810502
runtime_jvm_gc_count_total{gc="MarkSweepCompact"} 13.0 1649922810502
# TYPE runtime_jvm_gc_time_total counter
# HELP runtime_jvm_gc_time_total Time spent in a given JVM garbage collector in milliseconds.
runtime_jvm_gc_time_total{gc="Copy"} 4150.0 1649922810502
runtime_jvm_gc_time_total{gc="MarkSweepCompact"} 6399.0 1649922810502
# TYPE otlp_exporter_seen_total counter
# HELP otlp_exporter_seen_total 
otlp_exporter_seen_total{type="log"} 11.0 1649922810502
otlp_exporter_seen_total{type="span"} 9371.0 1649922810502
# TYPE otlp_exporter_exported_total counter
# HELP otlp_exporter_exported_total 
otlp_exporter_exported_total{success="true",type="log"} 11.0 1649922810502
otlp_exporter_exported_total{success="true",type="span"} 9371.0 1649922810502
# TYPE processedSpans_total counter
# HELP processedSpans_total The number of spans processed by the BatchSpanProcessor. [dropped=true if they were dropped due to high throughput]
processedSpans_total{dropped="false",spanProcessorType="BatchSpanProcessor"} 9371.0 1649922810502
# TYPE queueSize gauge
# HELP queueSize The number of spans queued
queueSize{spanProcessorType="BatchSpanProcessor"} 1.0 1649922810502

We read the prometheus endpoint with Telegraf as we get this error:

[inputs.prometheus] Error in plugin: error reading metrics for http://localhost:9088/metrics: reading text format failed: text format parsing error in line 115: second TYPE line for metric name "queueSize", or TYPE reported after samples

queueSize is metric is twice in the output, once for logs and once for spans.

This should be grouped together, like this:

# TYPE queueSize gauge
# HELP queueSize The number of logs queued
queueSize{logProcessorType="BatchLogProcessor"} 0.0 1649922810502
queueSize{spanProcessorType="BatchSpanProcessor"} 1.0 1649922810502

But it is now appearing as 2 different metrics, which is not valid.

@cbos cbos added the Bug Something isn't working label Apr 14, 2022
@mateuszrzeszutek
Copy link
Member

I believe that's a problem with how BatchSpanProcessor and BatchLogProcessor are using metrics API (same instrument name, different description) - @anuraaga @jkwatson can you move this issue over to the SDK repo?

@jkwatson jkwatson transferred this issue from open-telemetry/opentelemetry-java-instrumentation Apr 14, 2022
@jkwatson
Copy link
Contributor

@jack-berg duplicate async callbacks here. What's the right solution to this?

@jack-berg
Copy link
Member

jack-berg commented Apr 14, 2022

@mateuszrzeszutek is sort of correct. BatchLogProcessor adds queueSize under io.opentelemetry.sdk.logs while BatchSpanProcessor adds queueSize under io.opentelemetry.sdk.traces. This is perfectly acceptable in the otel data model but presents problems in prometheus.

The spec is unclear how meter name / version manifest in the prometheus data model. The closed spec issue #2035 sheds some light on discussion that took place around this issue, but no definitive answer.

We could resolve this in the short term by treating the BatchLogProcessor and BatchSpanProcessor instruments as part of the same meter, using the same description, and adding an attribute for the type of data being processed.

This is an important issue to address in a general sense though: a client with two instrumented http clients recording http.client.duration would produce the same issue.

@jack-berg
Copy link
Member

Can also get around this in the short term by configuring the view API to drop metrics named queueSize:

    SdkMeterProvider.builder()
        .registerView(
            InstrumentSelector.builder().setName("queueSize").build(),
            View.builder().setAggregation(Aggregation.drop()).build())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants