[connector/spanmetrics] - Spanmetrics connector is not producing correct metrics #32043

ramanjaneyagupta · 2024-03-29T22:18:47Z

Component(s)

connector/spanmetrics

What happened?

Description

Setup: Agents -> Gateway(OtelCollectors) -> Storage.
Gateway contains multiple servers which calculates the spanmetrics and exports to Prometheus.

Steps to Reproduce

Setup spanmetrics connector, Prometheus Exporter

Start the collector and wait for sometime to generate the metrics
After restart also it is the same behaviour

Expected Result

Spanmetrics should calculate metrics Properly (total_calls, counts, histograms etc.. )

Actual Result

Spanmetrics are not producing correct results. (some of the metrics are in keep increasing)

Collector version

v0.96.0

Environment information

Environment

OS: (Linux (RHEL)

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      http:
      grpc:

exporters:
  prometheus:
    endpoint: "localhost:8889"
    namespace: gateway.spanmetrics
    resource_to_telemetry_conversion:
      enabled: true 

connectors:
  spanmetrics:
   dimensions:
      - name: http.method
      - name: http.status_code
      - name: k8s.namespace.name
    exemplars:
      enabled: true
    dimensions_cache_size: 1000
    aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"    
    metrics_flush_interval: 15s
    resource_metrics_key_attributes:
      - service.name
      - telemetry.sdk.language
      - telemetry.sdk.name
processors:
  resourcedetection/system:
    detectors: ["system"]
    system:
      hostname_sources: ["os"]
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resourcedetection/system, batch]
      exporters: [spanmetrics,tracebackend]
    metrics:
      receivers: [spanmetrics]
      processors: [resourcedetection/system, batch]
      exporters: [prometheus]

Log output

After few minutes of starting the OTEL collector, all counter metrics, histograms are keep increasing.

Additional context

No response

github-actions · 2024-03-29T22:19:02Z

Pinging code owners:

connector/spanmetrics: @portertech

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Frapschen · 2024-04-02T10:31:35Z

Can you give an example to explain which part of the connctor you think is not correct?

the Log output you posted is not enough to debug

portertech · 2024-04-04T21:40:43Z

@ramanjaneyagupta I would expect the counters to grow from collector start (with AGGREGATION_TEMPORALITY_CUMULATIVE), this includes the histogram buckets https://prometheus.io/docs/concepts/metric_types/#histogram. If you still believe the produced series to be incorrect, please share more example data.

ramanjaneyagupta · 2024-04-05T15:34:22Z

Hi @portertech and @Frapschen attached are the screenshots seems this AGGREGATION_TEMPORALITY_CUMULATIVE have some issue. My application metrics are showing different numbers (mostly correct as it is matching my test results) but span metrics are showing different numbers for same http calls. I would have get more details but at this point not sure whatelse it needs to debug/verify. please let me know any specific things that you are expecting i will try to get these. thanks!
From App: Sum Metric

From Span Metrics

Rate for the same: from App

Rate for the same: from Span Metrics

Frapschen · 2024-04-07T06:58:04Z

@ramanjaneyagupta The query results of Span Metrics are very strange, I think calls_total metrics will never down forward, can you query the calls_total metrics with out any functions?

ankitpatel96 · 2024-04-17T21:05:56Z

One guess here is that the prometheus is scraping each of your collectors - but the collectors are reporting the same series, so prometheus jumps around between them. I think 32042 has a similar issue (except that one is a remote write so the symptoms are slightly different).

A clue is the shape of the graphs:
Your graphs seem to show a prometheus series that is bouncing between different sum metrics - you can see that the metric seems to jump from collector to collector. Each collector has a different current sum metric - and prometheus will jump from collector value to collector to value.

Here's an illustration of what i mean:

I've circled each disjoint section of the series. Each color represents a different collector that prometheus is scraping. The series starts off with a scrape at the red collector. Then, it switches to the orange collector. Then, it scrapes the red collector again. Then, it goes to the yellow collector etc. There's a similar pattern in your other graph.

Once again - this is just a guess. If you can confirm that each of these collectors are exporting the same series with the exact same labels, that would probably confirm this theory.

ramanjaneyagupta · 2024-04-18T16:25:34Z

Hi we are running otel collectors as a gateway .. all agent collectors and deployments running on the VMs and k8s will send the data to this central gateway (set of collectors) in the gateway we are calculating span metrics and in another layer applying for tail sampling before sending the data to our storage.

so yes in the gateway server1 and server 2 may receive part of similar data from different instances or in different time stamps.

so if I am running as a gateway is there any better way to calculate these metrics ?

ankitpatel96 · 2024-04-25T22:12:47Z

I don't know if this is the right venue for this discussion - this is turning into more of a question on how to deploy the collector / how to make this work with your datastore rather than a bug report in the collector codebase. The CNCF slack might be a better venue for such a discussion. That being said - you may find the recommendations in https://opentelemetry.io/docs/collector/scaling/#how-to-scale helpful - there are several recommendations there for workloads similar to yours.

ramanjaneyagupta · 2024-04-25T23:40:02Z

But it is related to span metrics and it is clearly showing in gateway mode it is not calculating metrics properly. Or atleast i think, it needs a better documentation to configure spanmetrics when it runs in Gateway mode.

As configuring the spanmetrics with current documentation is not working properly when running in Gateway mode.

mx-psi · 2024-04-26T09:54:15Z

But it is related to span metrics and it is clearly showing in gateway mode it is not calculating metrics properly. Or atleast i think, it needs a better documentation to configure spanmetrics when it runs in Gateway mode.

As configuring the spanmetrics with current documentation is not working properly when running in Gateway mode.

Agreed that better docs would be helpful here!

@ramanjaneyagupta since this is a more general issue, not only pertaining to the spanmetrics connector but to any other component that produces telemetry without necessarily handling tags, I have filed open-telemetry/opentelemetry.io/issues/4368 and I am going to close this issue in favor of that one. We can work with the documentation team to improve the Collector documentation about this.

ramanjaneyagupta · 2024-04-26T21:46:58Z

HI @mx-psi and @ankitpatel96 sorry to reopen the issue here - as i am thinking some problem with span metrics

I tried couple of ways :

Multiple agetns -> Gateway (Collectors) (spanmetrics) -> storage
Prometheus Screenshot:
Multiple Agents -> Gateway(collectors) service name based load balancing -> Layer2 (spanmetrics) -> storage
Prometheus Screenshot:
Multiple Agents -> Gateway(Collectors) -> Tempo(Metrics Generator).
Prometheus Screenshot:

I am seeing correct results with Tempo Metrics Generator but not with Span Metrics Connector.

pingping95 · 2024-05-15T00:49:00Z

Same issue.

when i use tempo's metrics generator, metrics were correct.

but when i use spanmetrics connector at layer-2 otel collector,

metrics were something strange.

vaibhhavv · 2024-07-09T05:26:36Z

Hi @mx-psi, @ankitpatel96 any update on this issue?
I am also facing same issue, when we use tempo metrics generator, we can see graph with monotonic increase.
But with spanmetrics connector, it is observed the graph is not monotonic, and behaves like the 2nd way shown by @ramanjaneyagupta in the above comment.

pingping95 · 2024-07-09T05:42:20Z

@vaibhhavv

I have no issue now. see below.

#33136 (comment)

ramanjaneyagupta added bug Something isn't working needs triage New item requiring triage labels Mar 29, 2024

github-actions bot added the connector/spanmetrics label Mar 29, 2024

github-actions bot mentioned this issue Apr 2, 2024

Weekly Report: 2024-03-26 - 2024-04-02 #32082

Closed

github-actions bot mentioned this issue Apr 9, 2024

Weekly Report: 2024-04-02 - 2024-04-09 #32230

Closed

crobert-1 added the waiting for author label Apr 15, 2024

github-actions bot mentioned this issue Apr 16, 2024

Weekly Report: 2024-04-09 - 2024-04-16 #32407

Closed

ankitpatel96 mentioned this issue Apr 17, 2024

[connector/spanmetrics] - unable to export the metrics using prometheusremotewrite when batch processor enabled #32042

Closed

mx-psi added question Further information is requested documentation Improvements or additions to documentation and removed bug Something isn't working question Further information is requested labels Apr 26, 2024

mx-psi mentioned this issue Apr 26, 2024

[docs/collector/scaling, docs/collector/deployment] Describe how to deal with single-writer principle open-telemetry/opentelemetry.io#4368

Open

mx-psi closed this as not planned Won't fix, can't repro, duplicate, stale Apr 26, 2024

ramanjaneyagupta mentioned this issue Apr 30, 2024

Spanmetrics Connector are not giving correct metrics of spans #21101

Open

ankitpatel96 mentioned this issue May 24, 2024

REQUEST: New membership for @ankitpatel96 open-telemetry/community#2131

Closed

6 tasks

mx-psi mentioned this issue Jun 14, 2024

Collector docs on single-writer principle open-telemetry/opentelemetry.io#4433

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[connector/spanmetrics] - Spanmetrics connector is not producing correct metrics #32043

[connector/spanmetrics] - Spanmetrics connector is not producing correct metrics #32043

ramanjaneyagupta commented Mar 29, 2024

github-actions bot commented Mar 29, 2024

Frapschen commented Apr 2, 2024 •

edited

Loading

portertech commented Apr 4, 2024

ramanjaneyagupta commented Apr 5, 2024 •

edited

Loading

Frapschen commented Apr 7, 2024

ankitpatel96 commented Apr 17, 2024 •

edited

Loading

ramanjaneyagupta commented Apr 18, 2024 •

edited

Loading

ankitpatel96 commented Apr 25, 2024 •

edited

Loading

ramanjaneyagupta commented Apr 25, 2024

mx-psi commented Apr 26, 2024 •

edited

Loading

ramanjaneyagupta commented Apr 26, 2024

pingping95 commented May 15, 2024

vaibhhavv commented Jul 9, 2024

pingping95 commented Jul 9, 2024

[connector/spanmetrics] - Spanmetrics connector is not producing correct metrics #32043

[connector/spanmetrics] - Spanmetrics connector is not producing correct metrics #32043

Comments

ramanjaneyagupta commented Mar 29, 2024

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Mar 29, 2024

Frapschen commented Apr 2, 2024 • edited Loading

portertech commented Apr 4, 2024

ramanjaneyagupta commented Apr 5, 2024 • edited Loading

Frapschen commented Apr 7, 2024

ankitpatel96 commented Apr 17, 2024 • edited Loading

ramanjaneyagupta commented Apr 18, 2024 • edited Loading

ankitpatel96 commented Apr 25, 2024 • edited Loading

ramanjaneyagupta commented Apr 25, 2024

mx-psi commented Apr 26, 2024 • edited Loading

ramanjaneyagupta commented Apr 26, 2024

pingping95 commented May 15, 2024

vaibhhavv commented Jul 9, 2024

pingping95 commented Jul 9, 2024

Frapschen commented Apr 2, 2024 •

edited

Loading

ramanjaneyagupta commented Apr 5, 2024 •

edited

Loading

ankitpatel96 commented Apr 17, 2024 •

edited

Loading

ramanjaneyagupta commented Apr 18, 2024 •

edited

Loading

ankitpatel96 commented Apr 25, 2024 •

edited

Loading

mx-psi commented Apr 26, 2024 •

edited

Loading