Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[connector/spanmetrics] - Spanmetrics connector is not producing correct metrics #32043

Closed
ramanjaneyagupta opened this issue Mar 29, 2024 · 14 comments
Labels
connector/spanmetrics documentation Improvements or additions to documentation needs triage New item requiring triage waiting for author

Comments

@ramanjaneyagupta
Copy link

Component(s)

connector/spanmetrics

What happened?

Description

Setup: Agents -> Gateway(OtelCollectors) -> Storage.
Gateway contains multiple servers which calculates the spanmetrics and exports to Prometheus.

Steps to Reproduce

Setup spanmetrics connector, Prometheus Exporter

  1. Start the collector and wait for sometime to generate the metrics
  2. After restart also it is the same behaviour

Expected Result

Spanmetrics should calculate metrics Properly (total_calls, counts, histograms etc.. )

Actual Result

Spanmetrics are not producing correct results. (some of the metrics are in keep increasing)

Collector version

v0.96.0

Environment information

Environment

OS: (Linux (RHEL)

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      http:
      grpc:

exporters:
  prometheus:
    endpoint: "localhost:8889"
    namespace: gateway.spanmetrics
    resource_to_telemetry_conversion:
      enabled: true 

connectors:
  spanmetrics:
   dimensions:
      - name: http.method
      - name: http.status_code
      - name: k8s.namespace.name
    exemplars:
      enabled: true
    dimensions_cache_size: 1000
    aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"    
    metrics_flush_interval: 15s
    resource_metrics_key_attributes:
      - service.name
      - telemetry.sdk.language
      - telemetry.sdk.name
processors:
  resourcedetection/system:
    detectors: ["system"]
    system:
      hostname_sources: ["os"]
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resourcedetection/system, batch]
      exporters: [spanmetrics,tracebackend]
    metrics:
      receivers: [spanmetrics]
      processors: [resourcedetection/system, batch]
      exporters: [prometheus]

Log output

After few minutes of starting the OTEL collector, all counter metrics, histograms are keep increasing.

Additional context

No response

@ramanjaneyagupta ramanjaneyagupta added bug Something isn't working needs triage New item requiring triage labels Mar 29, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@Frapschen
Copy link
Contributor

Frapschen commented Apr 2, 2024

Can you give an example to explain which part of the connctor you think is not correct?

the Log output you posted is not enough to debug

@portertech
Copy link
Contributor

@ramanjaneyagupta I would expect the counters to grow from collector start (with AGGREGATION_TEMPORALITY_CUMULATIVE), this includes the histogram buckets https://prometheus.io/docs/concepts/metric_types/#histogram. If you still believe the produced series to be incorrect, please share more example data.

@ramanjaneyagupta
Copy link
Author

ramanjaneyagupta commented Apr 5, 2024

Hi @portertech and @Frapschen attached are the screenshots seems this AGGREGATION_TEMPORALITY_CUMULATIVE have some issue. My application metrics are showing different numbers (mostly correct as it is matching my test results) but span metrics are showing different numbers for same http calls. I would have get more details but at this point not sure whatelse it needs to debug/verify. please let me know any specific things that you are expecting i will try to get these. thanks!
From App: Sum Metric
IMG_6014
From Span Metrics
IMG_6015

Rate for the same: from App
IMG_6016

Rate for the same: from Span Metrics
IMG_6017

@Frapschen
Copy link
Contributor

@ramanjaneyagupta The query results of Span Metrics are very strange, I think calls_total metrics will never down forward, can you query the calls_total metrics with out any functions?

@ankitpatel96
Copy link
Contributor

ankitpatel96 commented Apr 17, 2024

One guess here is that the prometheus is scraping each of your collectors - but the collectors are reporting the same series, so prometheus jumps around between them. I think 32042 has a similar issue (except that one is a remote write so the symptoms are slightly different).

A clue is the shape of the graphs:
Your graphs seem to show a prometheus series that is bouncing between different sum metrics - you can see that the metric seems to jump from collector to collector. Each collector has a different current sum metric - and prometheus will jump from collector value to collector to value.

Here's an illustration of what i mean:
320056925-57f46ae1-98b2-4836-a1be-96aab9222f62

I've circled each disjoint section of the series. Each color represents a different collector that prometheus is scraping. The series starts off with a scrape at the red collector. Then, it switches to the orange collector. Then, it scrapes the red collector again. Then, it goes to the yellow collector etc. There's a similar pattern in your other graph.

Once again - this is just a guess. If you can confirm that each of these collectors are exporting the same series with the exact same labels, that would probably confirm this theory.

@ramanjaneyagupta
Copy link
Author

ramanjaneyagupta commented Apr 18, 2024

Hi we are running otel collectors as a gateway .. all agent collectors and deployments running on the VMs and k8s will send the data to this central gateway (set of collectors) in the gateway we are calculating span metrics and in another layer applying for tail sampling before sending the data to our storage.

so yes in the gateway server1 and server 2 may receive part of similar data from different instances or in different time stamps.

so if I am running as a gateway is there any better way to calculate these metrics ?

@ankitpatel96
Copy link
Contributor

ankitpatel96 commented Apr 25, 2024

I don't know if this is the right venue for this discussion - this is turning into more of a question on how to deploy the collector / how to make this work with your datastore rather than a bug report in the collector codebase. The CNCF slack might be a better venue for such a discussion. That being said - you may find the recommendations in https://opentelemetry.io/docs/collector/scaling/#how-to-scale helpful - there are several recommendations there for workloads similar to yours.

@ramanjaneyagupta
Copy link
Author

But it is related to span metrics and it is clearly showing in gateway mode it is not calculating metrics properly. Or atleast i think, it needs a better documentation to configure spanmetrics when it runs in Gateway mode.

As configuring the spanmetrics with current documentation is not working properly when running in Gateway mode.

@mx-psi mx-psi added question Further information is requested documentation Improvements or additions to documentation and removed bug Something isn't working question Further information is requested labels Apr 26, 2024
@mx-psi
Copy link
Member

mx-psi commented Apr 26, 2024

But it is related to span metrics and it is clearly showing in gateway mode it is not calculating metrics properly. Or atleast i think, it needs a better documentation to configure spanmetrics when it runs in Gateway mode.

As configuring the spanmetrics with current documentation is not working properly when running in Gateway mode.

Agreed that better docs would be helpful here!

@ramanjaneyagupta since this is a more general issue, not only pertaining to the spanmetrics connector but to any other component that produces telemetry without necessarily handling tags, I have filed open-telemetry/opentelemetry.io/issues/4368 and I am going to close this issue in favor of that one. We can work with the documentation team to improve the Collector documentation about this.

@mx-psi mx-psi closed this as not planned Won't fix, can't repro, duplicate, stale Apr 26, 2024
@ramanjaneyagupta
Copy link
Author

HI @mx-psi and @ankitpatel96 sorry to reopen the issue here - as i am thinking some problem with span metrics

I tried couple of ways :

  1. Multiple agetns -> Gateway (Collectors) (spanmetrics) -> storage
    Prometheus Screenshot:
    WhatsApp Image 2024-04-26 at 4 39 45 PM

  2. Multiple Agents -> Gateway(collectors) service name based load balancing -> Layer2 (spanmetrics) -> storage
    Prometheus Screenshot:
    WhatsApp Image 2024-04-26 at 4 39 45 PM (1)

  3. Multiple Agents -> Gateway(Collectors) -> Tempo(Metrics Generator).
    Prometheus Screenshot:
    WhatsApp Image 2024-04-26 at 4 39 45 PM (2)

I am seeing correct results with Tempo Metrics Generator but not with Span Metrics Connector.

@pingping95
Copy link

Same issue.

when i use tempo's metrics generator, metrics were correct.

but when i use spanmetrics connector at layer-2 otel collector,

metrics were something strange.

@vaibhhavv
Copy link

Hi @mx-psi, @ankitpatel96 any update on this issue?
I am also facing same issue, when we use tempo metrics generator, we can see graph with monotonic increase.
But with spanmetrics connector, it is observed the graph is not monotonic, and behaves like the 2nd way shown by @ramanjaneyagupta in the above comment.

@pingping95
Copy link

@vaibhhavv

I have no issue now. see below.

#33136 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
connector/spanmetrics documentation Improvements or additions to documentation needs triage New item requiring triage waiting for author
Projects
None yet
Development

No branches or pull requests

8 participants