Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram distributions are not being exported to Datadog #6129

Closed
emilgelman opened this issue Nov 4, 2021 · 4 comments
Closed

Histogram distributions are not being exported to Datadog #6129

emilgelman opened this issue Nov 4, 2021 · 4 comments
Labels
bug Something isn't working data:metrics Metric related issues exporter/datadog Datadog components

Comments

@emilgelman
Copy link

Using otel/opentelemetry-collector-contrib:0.38.0
I'm trying to export a Histogram instrument using HistogramDistribution, and it looks like it's not being exported to Datadog.

Collector configuration:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:55681


exporters:
  datadog:
    use_resource_metadata: false
    env: "local"
    send_metadata: false
    api:
      key: ${DD_API_KEY}
      site: datadoghq.eu
    metrics:
      send_monotonic_counter: false
      histograms:
        mode: distributions
  logging:
    loglevel: debug
service:
  pipelines:
    metrics:
      receivers: [ otlp ]
      exporters: [ datadog, logging ]

Example to reproduce:

package main

import (
	"context"
	"go.opentelemetry.io/otel/sdk/metric/aggregator/histogram"
	"log"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric"
	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
	"go.opentelemetry.io/otel/metric"
	"go.opentelemetry.io/otel/metric/global"
	controller "go.opentelemetry.io/otel/sdk/metric/controller/basic"
	processor "go.opentelemetry.io/otel/sdk/metric/processor/basic"
	"go.opentelemetry.io/otel/sdk/metric/selector/simple"
)

func main() {
	ctx := context.Background()
	client := otlpmetrichttp.NewClient(otlpmetrichttp.WithInsecure(), otlpmetrichttp.WithEndpoint("0.0.0.0:55681"))
	exp, err := otlpmetric.New(ctx, client)
	if err != nil {
		log.Fatalf("Failed to create the collector exporter: %v", err)
	}
	defer func() {
		ctx, cancel := context.WithTimeout(ctx, time.Second)
		defer cancel()
		if err := exp.Shutdown(ctx); err != nil {
			otel.Handle(err)
		}
	}()

	pusher := controller.New(
		processor.NewFactory(
			simple.NewWithHistogramDistribution(histogram.WithExplicitBoundaries([]float64{0, 5000, 10000, 15000, 20000})),
			exp,
		),
		controller.WithExporter(exp),
		controller.WithCollectTimeout(time.Second),
	)
	global.SetMeterProvider(pusher)

	if err := pusher.Start(ctx); err != nil {
		log.Fatalf("could not start metric controoler: %v", err)
	}

	defer func() {
		ctx, cancel := context.WithTimeout(ctx, time.Second)
		defer cancel()
		// pushes any last exports to the receiver
		if err := pusher.Stop(ctx); err != nil {
			otel.Handle(err)
		}
	}()

	meter := global.Meter("test-meter")
	h := metric.Must(meter).NewFloat64Histogram("test.histograms.distributions")

	for i := 0; i < 5; i++ {
		h.Record(ctx, float64(5000*i))
	}
}

stdout exporter output:

2021-11-04T10:47:32.519Z        DEBUG   loggingexporter/logging_exporter.go:66  ResourceMetrics #0
Resource labels:
     -> service.name: STRING(unknown_service)
     -> telemetry.sdk.language: STRING(go)
     -> telemetry.sdk.name: STRING(opentelemetry)
     -> telemetry.sdk.version: STRING(1.1.0)
InstrumentationLibraryMetrics #0
InstrumentationLibrary test-meter 
Metric #0
Descriptor:
     -> Name: test.histograms.distributions
     -> Description: 
     -> Unit: 
     -> DataType: Histogram
     -> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
HistogramDataPoints #0
StartTimestamp: 2021-11-04 10:47:32.487976 +0000 UTC
Timestamp: 2021-11-04 10:47:32.488012 +0000 UTC
Count: 5
Sum: 50000.000000
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5000.000000
ExplicitBounds #2: 10000.000000
ExplicitBounds #3: 15000.000000
ExplicitBounds #4: 20000.000000
Buckets #0, Count: 0
Buckets #1, Count: 1
Buckets #2, Count: 1
Buckets #3, Count: 1
Buckets #4, Count: 1
Buckets #5, Count: 1

@mx-psi thanks for looking into it!

@emilgelman emilgelman added the bug Something isn't working label Nov 4, 2021
@mx-psi mx-psi added exporter/datadog Datadog components data:metrics Metric related issues labels Nov 4, 2021
@mx-psi
Copy link
Member

mx-psi commented Nov 5, 2021

Hey there @emilgelman, thanks for the report!

Similar to what we discussed on #5378 (comment), OpenTelemetry instrumentation libraries generate cumulative histograms by default, thus, since Datadog is delta-based and we can't currently indicate when a 'reset' happens, we calculate the difference between fields when we have at least two points of a metric and report this difference (in your example, as a distribution, in the case of #5378, as individual counters).

The important point here is that we don't report the first point we get for a cumulative metric like this one, (if we did we could create spikes on Datadog webapp graphs when restarts of the Collector happen and we lose state). In your example, only one metric point gets to the Datadog exporter, so it gets stored but not reported. You need to send more than one point in order for those metrics to show up in the webapp.

If I apply the following diff to your code:

diff for sending several points (click to expand)
--- old_main.go	2021-11-05 13:15:55.000000000 +0100
+++ main.go	2021-11-05 13:17:53.000000000 +0100
@@ -39,6 +39,7 @@
 		),
 		controller.WithExporter(exp),
 		controller.WithCollectTimeout(time.Second),
+		controller.WithCollectPeriod(time.Millisecond*100),
 	)
 	global.SetMeterProvider(pusher)
 
@@ -58,7 +59,10 @@
 	meter := global.Meter("test-meter")
 	h := metric.Must(meter).NewFloat64Histogram("test.histograms.distributions")
 
-	for i := 0; i < 5; i++ {
-		h.Record(ctx, float64(5000*i))
+	for j := 0; j < 10; j++ {
+		for i := 0; i < 5; i++ {
+			h.Record(ctx, float64(5000*i))
+		}
+		time.Sleep(1 * time.Second)
 	}
 }

It generates several points for this metric:

logging exporter output with updated code (click to expand)
2021-11-05T11:59:16.209Z   DEBUG   loggingexporter/logging_exporter.go:66  ResourceMetrics #0
Resource labels:
     -> service.name: STRING(unknown_service:main)
     -> telemetry.sdk.language: STRING(go)
     -> telemetry.sdk.name: STRING(opentelemetry)
     -> telemetry.sdk.version: STRING(1.0.1)
InstrumentationLibraryMetrics #0
InstrumentationLibrary test-meter 
Metric #0
Descriptor:
     -> Name: test.histograms.distributions
     -> Description: 
     -> Unit: 
     -> DataType: Histogram
     -> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
HistogramDataPoints #0
StartTimestamp: 2021-11-05 11:59:15.9769341 +0000 UTC
Timestamp: 2021-11-05 11:59:16.0777179 +0000 UTC
Count: 5
Sum: 50000.000000
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5000.000000
ExplicitBounds #2: 10000.000000
ExplicitBounds #3: 15000.000000
ExplicitBounds #4: 20000.000000
Buckets #0, Count: 0
Buckets #1, Count: 1
Buckets #2, Count: 1
Buckets #3, Count: 1
Buckets #4, Count: 1
Buckets #5, Count: 1

2021-11-05T11:59:17.012Z   INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 1}
2021-11-05T11:59:17.013Z   DEBUG   loggingexporter/logging_exporter.go:66  ResourceMetrics #0
Resource labels:
     -> service.name: STRING(unknown_service:main)
     -> telemetry.sdk.language: STRING(go)
     -> telemetry.sdk.name: STRING(opentelemetry)
     -> telemetry.sdk.version: STRING(1.0.1)
InstrumentationLibraryMetrics #0
InstrumentationLibrary test-meter 
Metric #0
Descriptor:
     -> Name: test.histograms.distributions
     -> Description: 
     -> Unit: 
     -> DataType: Histogram
     -> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
HistogramDataPoints #0
StartTimestamp: 2021-11-05 11:59:15.9769341 +0000 UTC
Timestamp: 2021-11-05 11:59:16.9778853 +0000 UTC
Count: 10
Sum: 100000.000000
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5000.000000
ExplicitBounds #2: 10000.000000
ExplicitBounds #3: 15000.000000
ExplicitBounds #4: 20000.000000
Buckets #0, Count: 0
Buckets #1, Count: 2
Buckets #2, Count: 2
Buckets #3, Count: 2
Buckets #4, Count: 2
Buckets #5, Count: 2

All the points on this example will have a sum of 50k, count of 5 (so, an average of 10k) and the buckets will have all 1 point except for the first bucket (corresponding to (-inf, 0)).

After applying this diff, on the webapp I can see the distribution metric, test.histograms.distributions, with matching values. Two points to note about the webapp, to make it a bit clearer:

  • You can enable percentile metrics for individual distribution metrics; these will be estimated from the bucket counts and may have a small relative error (we don't get the actual data after all)
  • Sum and average values are also approximated as of v0.38.0, but we are looking into reusing the values the payload sends. With min and max, OTLP Histogram points will add these too in the future ( Add min/max fields to HistogramDataPoint opentelemetry-proto#279), so the accuracy of these will improve in future versions.

Lastly, note that v0.37.1 included a bug where the aggregation temporality of cumulative histograms was misidentified as delta. This explains why you could see the points in that version.

Hopefully this makes things clearer. I am going to keep this open to keep chatting in case you have further questions, and let me know if anything in the explanation doesn't make sense or if the output in the webapp does not match your expectations in any sense.

@emilgelman
Copy link
Author

Thank you @mx-psi for the detailed response.
I've tried to reproduce your example and unfortunately did not receive the expected results.

logging exporter output
Resource labels:
   -> service.name: STRING(unknown_service:___go_build_chen)
   -> telemetry.sdk.language: STRING(go)
   -> telemetry.sdk.name: STRING(opentelemetry)
   -> telemetry.sdk.version: STRING(1.1.0)
InstrumentationLibraryMetrics #0
InstrumentationLibrary test-meter
Metric #0
Descriptor:
   -> Name: test.histograms.distributions.new2
   -> Description:
   -> Unit:
   -> DataType: Histogram
   -> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
HistogramDataPoints #0
StartTimestamp: 2021-11-13 15:22:12.422802 +0000 UTC
Timestamp: 2021-11-13 15:22:22.425582 +0000 UTC
Count: 50
Sum: 500000.000000
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5000.000000
ExplicitBounds #2: 10000.000000
ExplicitBounds #3: 15000.000000
ExplicitBounds #4: 20000.000000
Buckets #0, Count: 0
Buckets #1, Count: 10
Buckets #2, Count: 10
Buckets #3, Count: 10
Buckets #4, Count: 10
Buckets #5, Count: 10

No metric named test.histograms.distributions.new2 is visible in the webapp. Is there anything else we could try?
Attaching complete code again just for reference:

code to reproduce

import (
  "context"
  "go.opentelemetry.io/otel/sdk/metric/aggregator/histogram"
  "log"
  "time"

  "go.opentelemetry.io/otel"
  "go.opentelemetry.io/otel/exporters/otlp/otlpmetric"
  "go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
  "go.opentelemetry.io/otel/metric"
  "go.opentelemetry.io/otel/metric/global"
  controller "go.opentelemetry.io/otel/sdk/metric/controller/basic"
  processor "go.opentelemetry.io/otel/sdk/metric/processor/basic"
  "go.opentelemetry.io/otel/sdk/metric/selector/simple"
)

func main() {
  ctx := context.Background()
  client := otlpmetrichttp.NewClient(otlpmetrichttp.WithInsecure(), otlpmetrichttp.WithEndpoint("0.0.0.0:55681"))
  exp, err := otlpmetric.New(ctx, client)
  if err != nil {
  	log.Fatalf("Failed to create the collector exporter: %v", err)
  }
  defer func() {
  	ctx, cancel := context.WithTimeout(ctx, time.Second)
  	defer cancel()
  	if err := exp.Shutdown(ctx); err != nil {
  		otel.Handle(err)
  	}
  }()

  pusher := controller.New(
  	processor.NewFactory(
  		simple.NewWithHistogramDistribution(histogram.WithExplicitBoundaries([]float64{0, 5000, 10000, 15000, 20000})),
  		exp,
  	),
  	controller.WithExporter(exp),
  	controller.WithCollectTimeout(time.Millisecond*100),
  )
  global.SetMeterProvider(pusher)

  if err := pusher.Start(ctx); err != nil {
  	log.Fatalf("could not start metric controoler: %v", err)
  }

  defer func() {
  	ctx, cancel := context.WithTimeout(ctx, time.Second)
  	defer cancel()
  	// pushes any last exports to the receiver
  	if err := pusher.Stop(ctx); err != nil {
  		otel.Handle(err)
  	}
  }()

  meter := global.Meter("test-meter")
  h := metric.Must(meter).NewFloat64Histogram("test.histograms.distributions.new2")

  for j := 0; j < 10; j++ {
  	for i := 0; i < 5; i++ {
  		h.Record(ctx, float64(5000*i))
  	}
  	time.Sleep(1 * time.Second)
  }
}

@mx-psi
Copy link
Member

mx-psi commented Nov 17, 2021

@emilgelman Thanks for having a look. The controller has two different settings with a very similar name: controller.WithCollectPeriod and controller.WithCollectTimeout. The one you need to set to ensure the instrumentation library produces several points is the controller.WithCollectPeriod; so, in your code you can configure the pusher as follows:

  pusher := controller.New(
  	processor.NewFactory(
  		simple.NewWithHistogramDistribution(histogram.WithExplicitBoundaries([]float64{0, 5000, 10000, 15000, 20000})),
  		exp,
  	),
  	controller.WithExporter(exp),
  	controller.WithCollectTimeout(time.Millisecond*100),
        // ↓↓ Add this line ↓↓
  	controller.WithCollectPeriod(time.Millisecond*100),
  )

That should produce several points. Let me know if that helps!

@emilgelman
Copy link
Author

My bad @mx-psi . Setting WithCollectPeriod does produce points, thank you for your help!

garypen added a commit to apollographql/router that referenced this issue Jul 11, 2023
Add a new configuration option (Temporality) to the otlp metrics
configuration.

This may be useful to fix problems with metrics when being processed by
datadog which tends to expect Delta, rather than Cumulative,
aggregations.

See:
 - open-telemetry/opentelemetry-collector-contrib#6129
 - DataDog/documentation#15840

for more details.
garypen added a commit to apollographql/router that referenced this issue Jul 11, 2023
Add a new configuration option (Temporality) to the otlp metrics
configuration.

This may be useful to fix problems with metrics when being processed by
datadog which tends to expect Delta, rather than Cumulative,
aggregations.

See:
-
open-telemetry/opentelemetry-collector-contrib#6129
 - DataDog/documentation#15840

for more details.

<!-- start metadata -->

**Checklist**

Complete the checklist (and note appropriate exceptions) before a final
PR is raised.

- [x] Changes are compatible[^1]
- [x] Documentation[^2] completed
- [x] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [ ] Unit Tests
    - [ ] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]. It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]. Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]. Tick whichever testing boxes are applicable. If you are adding
Manual Tests:
- please document the manual testing (extensively) in the Exceptions.
- please raise a separate issue to automate the test and label it (or
ask for it to be labeled) as `manual test`

---------

Co-authored-by: Bryn Cooke <BrynCooke@gmail.com>
bnjjj added a commit to apollographql/router that referenced this issue Jul 13, 2023
> **Note**
>
> When approved, this PR will merge into **the `1.24.0` branch** which
will — upon being approved itself — merge into `main`.
>
> **Things to review in this PR**:
> - Changelog correctness (There is a preview below, but it is not
necessarily the most up to date. See the _Files Changed_ for the true
reality.)
>  - Version bumps
>  - That it targets the right release branch (`1.24.0` in this case!).
>
---
## 🚀 Features

### Add support for delta aggregation to otlp metrics ([PR
#3412](#3412))

Add a new configuration option (Temporality) to the otlp metrics
configuration.

This may be useful to fix problems with metrics when being processed by
datadog which tends to expect Delta, rather than Cumulative,
aggregations.

See:
-
open-telemetry/opentelemetry-collector-contrib#6129
 - DataDog/documentation#15840

for more details.

By [@garypen](https://github.com/garypen) in
#3412

## 🐛 Fixes

### Fix error handling for subgraphs ([Issue
#3141](#3141))

The GraphQL spec is rather light on what should happen when we process
responses from subgraphs. The current behaviour within the Router was
inconsistently short circuiting response processing and this producing
confusing errors.
> #### Processing the response
> 
> If the response uses a non-200 status code and the media type of the
response payload is application/json then the client MUST NOT rely on
the body to be a well-formed GraphQL response since the source of the
response may not be the server but instead some intermediary such as API
gateways, proxies, firewalls, etc.

The logic has been simplified and made consistent using the following
rules:
1. If the content type of the response is not `application/json` or
`application/graphql-response+json` then we won't try to parse.
2. If an HTTP status is not 2xx it will always be attached as a graphql
error.
3. If the response type is `application/json` and status is not 2xx and
the body is not valid grapqhql the entire subgraph response will be
attached as an error.

By [@BrynCooke](https://github.com/BrynCooke) in
#3328

## 🛠 Maintenance

### chore: router-bridge 0.3.0+v2.4.8 -> =0.3.1+2.4.9 ([PR
#3407](#3407))

Updates `router-bridge` from ` = "0.3.0+v2.4.8"` to ` = "0.3.1+v2.4.9"`,
note that with this PR, this dependency is now pinned to an exact
version. This version update started failing tests because of a minor
ordering change and it was not immediately clear why the test was
failing. Pinning this dependency (that we own) allows us to only bring
in the update at the proper time and will make test failures caused by
the update to be more easily identified.

By [@EverlastingBugstopper](https://github.com/EverlastingBugstopper) in
#3407

### remove the compiler from Query ([Issue
#3373](#3373))

The `Query` object caches information extracted from the query that is
used to format responses. It was carrying an `ApolloCompiler` instance,
but now we don't really need it anymore, since it is now cached at the
query analysis layer. We also should not carry it in the supergraph
request and execution request, because that makes the builders hard to
manipulate for plugin authors. Since we are not exposing the compiler in
the public API yet, we move it inside the context's private entries,
where it will be easily accessible from internal code.

By [@Geal](https://github.com/Geal) in
#3367

### move AllowOnlyHttpPostMutationsLayer at the supergraph service level
([PR #3374](#3374), [PR
#3410](#3410))

Now that we have access to a compiler in supergraph requests, we don't
need to look into the query plan to know if a request contains mutations

By [@Geal](https://github.com/Geal) in
#3374 &
#3410

### update opentelemetry to 0.19.0 ([Issue
#2878](#2878))


We've updated the following opentelemetry related crates:

```
opentelemetry 0.18.0 -> 0.19.0
opentelemetry-datadog 0.6.0 -> 0.7.0
opentelemetry-http 0.7.0 -> 0.8.0
opentelemetry-jaeger 0.17.0 -> 0.18.0
opentelemetry-otlp 0.11.0 -> 0.12.0
opentelemetry-semantic-conventions 0.10.0 -> 0.11.0
opentelemetry-zipkin 0.16.0 -> 0.17.0
opentelemetry-prometheus 0.11.0 -> 0.12.0
tracing-opentelemetry 0.18.0 -> 0.19.0
```

This allows us to close a number of opentelemetry related issues.

Note:

The prometheus specification mandates naming format and, unfortunately,
the router had two metrics which weren't compliant. The otel upgrade
enforces the specification, so the affected metrics are now renamed (see
below).

The two affected metrics in the router were:

apollo_router_cache_hit_count -> apollo_router_cache_hit_count_total
apollo_router_cache_miss_count -> apollo_router_cache_miss_count_total

If you are monitoring these metrics via prometheus, please update your
dashboards with this name change.

By [@garypen](https://github.com/garypen) in
#3421

### Synthesize defer labels without RNG or collisions ([PR
#3381](#3381) and [PR
#3423](#3423))

The `@defer` directive accepts a `label` argument, but it is optional.
To more accurately handle deferred responses, the Router internally
rewrites queries to add labels on the `@defer` directive where they are
missing. Responses eventually receive the reverse treatment to look as
expected by client.

This was done be generating random strings, handling collision with
existing labels, and maintaining a `HashSet` of which labels had been
synthesized. Instead, we now add a prefix to pre-existing labels and
generate new labels without it. When processing a response, the absence
of that prefix indicates a synthetic label.

By [@SimonSapin](https://github.com/SimonSapin) and
[@o0Ignition0o](https://github.com/o0Ignition0o) in
#3381 and
#3423

### Move subscription event execution at the execution service level
([PR #3395](#3395))

In order to prepare some future integration I moved the execution loop
for subscription events at the execution_service level.

By [@bnjjj](https://github.com/bnjjj) in
#3395

## 📚 Documentation

### Document claim augmentation via coprocessors ([Issue
#3102](#3102))

Claims augmentation is a common use case where user information from the
JWT claims is used to look up more context like roles from databases,
before sending it to subgraphs. This can be done with subgraphs, but it
was not documented yet, and there was confusion on the order in which
the plugins were called. This clears the confusion and provides an
example configuration.

By [@Geal](https://github.com/Geal) in
#3386
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:metrics Metric related issues exporter/datadog Datadog components
Projects
None yet
Development

No branches or pull requests

2 participants