Histogram distributions are not being exported to Datadog #6129

emilgelman · 2021-11-04T10:50:09Z

Using otel/opentelemetry-collector-contrib:0.38.0
I'm trying to export a Histogram instrument using HistogramDistribution, and it looks like it's not being exported to Datadog.

Collector configuration:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:55681


exporters:
  datadog:
    use_resource_metadata: false
    env: "local"
    send_metadata: false
    api:
      key: ${DD_API_KEY}
      site: datadoghq.eu
    metrics:
      send_monotonic_counter: false
      histograms:
        mode: distributions
  logging:
    loglevel: debug
service:
  pipelines:
    metrics:
      receivers: [ otlp ]
      exporters: [ datadog, logging ]

Example to reproduce:

package main

import (
	"context"
	"go.opentelemetry.io/otel/sdk/metric/aggregator/histogram"
	"log"
	"time"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric"
	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
	"go.opentelemetry.io/otel/metric"
	"go.opentelemetry.io/otel/metric/global"
	controller "go.opentelemetry.io/otel/sdk/metric/controller/basic"
	processor "go.opentelemetry.io/otel/sdk/metric/processor/basic"
	"go.opentelemetry.io/otel/sdk/metric/selector/simple"
)

func main() {
	ctx := context.Background()
	client := otlpmetrichttp.NewClient(otlpmetrichttp.WithInsecure(), otlpmetrichttp.WithEndpoint("0.0.0.0:55681"))
	exp, err := otlpmetric.New(ctx, client)
	if err != nil {
		log.Fatalf("Failed to create the collector exporter: %v", err)
	}
	defer func() {
		ctx, cancel := context.WithTimeout(ctx, time.Second)
		defer cancel()
		if err := exp.Shutdown(ctx); err != nil {
			otel.Handle(err)
		}
	}()

	pusher := controller.New(
		processor.NewFactory(
			simple.NewWithHistogramDistribution(histogram.WithExplicitBoundaries([]float64{0, 5000, 10000, 15000, 20000})),
			exp,
		),
		controller.WithExporter(exp),
		controller.WithCollectTimeout(time.Second),
	)
	global.SetMeterProvider(pusher)

	if err := pusher.Start(ctx); err != nil {
		log.Fatalf("could not start metric controoler: %v", err)
	}

	defer func() {
		ctx, cancel := context.WithTimeout(ctx, time.Second)
		defer cancel()
		// pushes any last exports to the receiver
		if err := pusher.Stop(ctx); err != nil {
			otel.Handle(err)
		}
	}()

	meter := global.Meter("test-meter")
	h := metric.Must(meter).NewFloat64Histogram("test.histograms.distributions")

	for i := 0; i < 5; i++ {
		h.Record(ctx, float64(5000*i))
	}
}

stdout exporter output:

2021-11-04T10:47:32.519Z        DEBUG   loggingexporter/logging_exporter.go:66  ResourceMetrics #0
Resource labels:
     -> service.name: STRING(unknown_service)
     -> telemetry.sdk.language: STRING(go)
     -> telemetry.sdk.name: STRING(opentelemetry)
     -> telemetry.sdk.version: STRING(1.1.0)
InstrumentationLibraryMetrics #0
InstrumentationLibrary test-meter 
Metric #0
Descriptor:
     -> Name: test.histograms.distributions
     -> Description: 
     -> Unit: 
     -> DataType: Histogram
     -> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
HistogramDataPoints #0
StartTimestamp: 2021-11-04 10:47:32.487976 +0000 UTC
Timestamp: 2021-11-04 10:47:32.488012 +0000 UTC
Count: 5
Sum: 50000.000000
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5000.000000
ExplicitBounds #2: 10000.000000
ExplicitBounds #3: 15000.000000
ExplicitBounds #4: 20000.000000
Buckets #0, Count: 0
Buckets #1, Count: 1
Buckets #2, Count: 1
Buckets #3, Count: 1
Buckets #4, Count: 1
Buckets #5, Count: 1

@mx-psi thanks for looking into it!

The text was updated successfully, but these errors were encountered:

mx-psi · 2021-11-05T12:34:06Z

Hey there @emilgelman, thanks for the report!

Similar to what we discussed on #5378 (comment), OpenTelemetry instrumentation libraries generate cumulative histograms by default, thus, since Datadog is delta-based and we can't currently indicate when a 'reset' happens, we calculate the difference between fields when we have at least two points of a metric and report this difference (in your example, as a distribution, in the case of #5378, as individual counters).

The important point here is that we don't report the first point we get for a cumulative metric like this one, (if we did we could create spikes on Datadog webapp graphs when restarts of the Collector happen and we lose state). In your example, only one metric point gets to the Datadog exporter, so it gets stored but not reported. You need to send more than one point in order for those metrics to show up in the webapp.

If I apply the following diff to your code:

diff for sending several points (click to expand)

--- old_main.go	2021-11-05 13:15:55.000000000 +0100
+++ main.go	2021-11-05 13:17:53.000000000 +0100
@@ -39,6 +39,7 @@
 		),
 		controller.WithExporter(exp),
 		controller.WithCollectTimeout(time.Second),
+		controller.WithCollectPeriod(time.Millisecond*100),
 	)
 	global.SetMeterProvider(pusher)
 
@@ -58,7 +59,10 @@
 	meter := global.Meter("test-meter")
 	h := metric.Must(meter).NewFloat64Histogram("test.histograms.distributions")
 
-	for i := 0; i < 5; i++ {
-		h.Record(ctx, float64(5000*i))
+	for j := 0; j < 10; j++ {
+		for i := 0; i < 5; i++ {
+			h.Record(ctx, float64(5000*i))
+		}
+		time.Sleep(1 * time.Second)
 	}
 }

It generates several points for this metric:

logging exporter output with updated code (click to expand)

2021-11-05T11:59:16.209Z   DEBUG   loggingexporter/logging_exporter.go:66  ResourceMetrics #0
Resource labels:
     -> service.name: STRING(unknown_service:main)
     -> telemetry.sdk.language: STRING(go)
     -> telemetry.sdk.name: STRING(opentelemetry)
     -> telemetry.sdk.version: STRING(1.0.1)
InstrumentationLibraryMetrics #0
InstrumentationLibrary test-meter 
Metric #0
Descriptor:
     -> Name: test.histograms.distributions
     -> Description: 
     -> Unit: 
     -> DataType: Histogram
     -> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
HistogramDataPoints #0
StartTimestamp: 2021-11-05 11:59:15.9769341 +0000 UTC
Timestamp: 2021-11-05 11:59:16.0777179 +0000 UTC
Count: 5
Sum: 50000.000000
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5000.000000
ExplicitBounds #2: 10000.000000
ExplicitBounds #3: 15000.000000
ExplicitBounds #4: 20000.000000
Buckets #0, Count: 0
Buckets #1, Count: 1
Buckets #2, Count: 1
Buckets #3, Count: 1
Buckets #4, Count: 1
Buckets #5, Count: 1

2021-11-05T11:59:17.012Z   INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 1}
2021-11-05T11:59:17.013Z   DEBUG   loggingexporter/logging_exporter.go:66  ResourceMetrics #0
Resource labels:
     -> service.name: STRING(unknown_service:main)
     -> telemetry.sdk.language: STRING(go)
     -> telemetry.sdk.name: STRING(opentelemetry)
     -> telemetry.sdk.version: STRING(1.0.1)
InstrumentationLibraryMetrics #0
InstrumentationLibrary test-meter 
Metric #0
Descriptor:
     -> Name: test.histograms.distributions
     -> Description: 
     -> Unit: 
     -> DataType: Histogram
     -> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
HistogramDataPoints #0
StartTimestamp: 2021-11-05 11:59:15.9769341 +0000 UTC
Timestamp: 2021-11-05 11:59:16.9778853 +0000 UTC
Count: 10
Sum: 100000.000000
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5000.000000
ExplicitBounds #2: 10000.000000
ExplicitBounds #3: 15000.000000
ExplicitBounds #4: 20000.000000
Buckets #0, Count: 0
Buckets #1, Count: 2
Buckets #2, Count: 2
Buckets #3, Count: 2
Buckets #4, Count: 2
Buckets #5, Count: 2

All the points on this example will have a sum of 50k, count of 5 (so, an average of 10k) and the buckets will have all 1 point except for the first bucket (corresponding to (-inf, 0)).

After applying this diff, on the webapp I can see the distribution metric, test.histograms.distributions, with matching values. Two points to note about the webapp, to make it a bit clearer:

You can enable percentile metrics for individual distribution metrics; these will be estimated from the bucket counts and may have a small relative error (we don't get the actual data after all)
Sum and average values are also approximated as of v0.38.0, but we are looking into reusing the values the payload sends. With min and max, OTLP Histogram points will add these too in the future ( Add min/max fields to HistogramDataPoint opentelemetry-proto#279), so the accuracy of these will improve in future versions.

Lastly, note that v0.37.1 included a bug where the aggregation temporality of cumulative histograms was misidentified as delta. This explains why you could see the points in that version.

Hopefully this makes things clearer. I am going to keep this open to keep chatting in case you have further questions, and let me know if anything in the explanation doesn't make sense or if the output in the webapp does not match your expectations in any sense.

emilgelman · 2021-11-13T15:35:40Z

Thank you @mx-psi for the detailed response.
I've tried to reproduce your example and unfortunately did not receive the expected results.

logging exporter output

Resource labels:
   -> service.name: STRING(unknown_service:___go_build_chen)
   -> telemetry.sdk.language: STRING(go)
   -> telemetry.sdk.name: STRING(opentelemetry)
   -> telemetry.sdk.version: STRING(1.1.0)
InstrumentationLibraryMetrics #0
InstrumentationLibrary test-meter
Metric #0
Descriptor:
   -> Name: test.histograms.distributions.new2
   -> Description:
   -> Unit:
   -> DataType: Histogram
   -> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE
HistogramDataPoints #0
StartTimestamp: 2021-11-13 15:22:12.422802 +0000 UTC
Timestamp: 2021-11-13 15:22:22.425582 +0000 UTC
Count: 50
Sum: 500000.000000
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5000.000000
ExplicitBounds #2: 10000.000000
ExplicitBounds #3: 15000.000000
ExplicitBounds #4: 20000.000000
Buckets #0, Count: 0
Buckets #1, Count: 10
Buckets #2, Count: 10
Buckets #3, Count: 10
Buckets #4, Count: 10
Buckets #5, Count: 10

No metric named test.histograms.distributions.new2 is visible in the webapp. Is there anything else we could try?
Attaching complete code again just for reference:

code to reproduce


import (
  "context"
  "go.opentelemetry.io/otel/sdk/metric/aggregator/histogram"
  "log"
  "time"

  "go.opentelemetry.io/otel"
  "go.opentelemetry.io/otel/exporters/otlp/otlpmetric"
  "go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
  "go.opentelemetry.io/otel/metric"
  "go.opentelemetry.io/otel/metric/global"
  controller "go.opentelemetry.io/otel/sdk/metric/controller/basic"
  processor "go.opentelemetry.io/otel/sdk/metric/processor/basic"
  "go.opentelemetry.io/otel/sdk/metric/selector/simple"
)

func main() {
  ctx := context.Background()
  client := otlpmetrichttp.NewClient(otlpmetrichttp.WithInsecure(), otlpmetrichttp.WithEndpoint("0.0.0.0:55681"))
  exp, err := otlpmetric.New(ctx, client)
  if err != nil {
  	log.Fatalf("Failed to create the collector exporter: %v", err)
  }
  defer func() {
  	ctx, cancel := context.WithTimeout(ctx, time.Second)
  	defer cancel()
  	if err := exp.Shutdown(ctx); err != nil {
  		otel.Handle(err)
  	}
  }()

  pusher := controller.New(
  	processor.NewFactory(
  		simple.NewWithHistogramDistribution(histogram.WithExplicitBoundaries([]float64{0, 5000, 10000, 15000, 20000})),
  		exp,
  	),
  	controller.WithExporter(exp),
  	controller.WithCollectTimeout(time.Millisecond*100),
  )
  global.SetMeterProvider(pusher)

  if err := pusher.Start(ctx); err != nil {
  	log.Fatalf("could not start metric controoler: %v", err)
  }

  defer func() {
  	ctx, cancel := context.WithTimeout(ctx, time.Second)
  	defer cancel()
  	// pushes any last exports to the receiver
  	if err := pusher.Stop(ctx); err != nil {
  		otel.Handle(err)
  	}
  }()

  meter := global.Meter("test-meter")
  h := metric.Must(meter).NewFloat64Histogram("test.histograms.distributions.new2")

  for j := 0; j < 10; j++ {
  	for i := 0; i < 5; i++ {
  		h.Record(ctx, float64(5000*i))
  	}
  	time.Sleep(1 * time.Second)
  }
}

mx-psi · 2021-11-17T10:10:58Z

@emilgelman Thanks for having a look. The controller has two different settings with a very similar name: controller.WithCollectPeriod and controller.WithCollectTimeout. The one you need to set to ensure the instrumentation library produces several points is the controller.WithCollectPeriod; so, in your code you can configure the pusher as follows:

  pusher := controller.New(
  	processor.NewFactory(
  		simple.NewWithHistogramDistribution(histogram.WithExplicitBoundaries([]float64{0, 5000, 10000, 15000, 20000})),
  		exp,
  	),
  	controller.WithExporter(exp),
  	controller.WithCollectTimeout(time.Millisecond*100),
        // ↓↓ Add this line ↓↓
  	controller.WithCollectPeriod(time.Millisecond*100),
  )

That should produce several points. Let me know if that helps!

emilgelman · 2021-12-17T08:03:37Z

My bad @mx-psi . Setting WithCollectPeriod does produce points, thank you for your help!

Add a new configuration option (Temporality) to the otlp metrics configuration. This may be useful to fix problems with metrics when being processed by datadog which tends to expect Delta, rather than Cumulative, aggregations. See: - open-telemetry/opentelemetry-collector-contrib#6129 - DataDog/documentation#15840 for more details.

Add a new configuration option (Temporality) to the otlp metrics configuration. This may be useful to fix problems with metrics when being processed by datadog which tends to expect Delta, rather than Cumulative, aggregations. See: - open-telemetry/opentelemetry-collector-contrib#6129 - DataDog/documentation#15840 for more details.  **Checklist** Complete the checklist (and note appropriate exceptions) before a final PR is raised. - [x] Changes are compatible[^1] - [x] Documentation[^2] completed - [x] Performance impact assessed and acceptable - Tests added and passing[^3] - [ ] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]. Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]. Tick whichever testing boxes are applicable. If you are adding Manual Tests: - please document the manual testing (extensively) in the Exceptions. - please raise a separate issue to automate the test and label it (or ask for it to be labeled) as `manual test` --------- Co-authored-by: Bryn Cooke <BrynCooke@gmail.com>

@garypen

> **Note** > > When approved, this PR will merge into **the `1.24.0` branch** which will — upon being approved itself — merge into `main`. > > **Things to review in this PR**: > - Changelog correctness (There is a preview below, but it is not necessarily the most up to date. See the _Files Changed_ for the true reality.) > - Version bumps > - That it targets the right release branch (`1.24.0` in this case!). > --- ## 🚀 Features ### Add support for delta aggregation to otlp metrics ([PR #3412](#3412)) Add a new configuration option (Temporality) to the otlp metrics configuration. This may be useful to fix problems with metrics when being processed by datadog which tends to expect Delta, rather than Cumulative, aggregations. See: - open-telemetry/opentelemetry-collector-contrib#6129 - DataDog/documentation#15840 for more details. By [@garypen](https://github.com/garypen) in #3412 ## 🐛 Fixes ### Fix error handling for subgraphs ([Issue #3141](#3141)) The GraphQL spec is rather light on what should happen when we process responses from subgraphs. The current behaviour within the Router was inconsistently short circuiting response processing and this producing confusing errors. > #### Processing the response > > If the response uses a non-200 status code and the media type of the response payload is application/json then the client MUST NOT rely on the body to be a well-formed GraphQL response since the source of the response may not be the server but instead some intermediary such as API gateways, proxies, firewalls, etc. The logic has been simplified and made consistent using the following rules: 1. If the content type of the response is not `application/json` or `application/graphql-response+json` then we won't try to parse. 2. If an HTTP status is not 2xx it will always be attached as a graphql error. 3. If the response type is `application/json` and status is not 2xx and the body is not valid grapqhql the entire subgraph response will be attached as an error. By [@BrynCooke](https://github.com/BrynCooke) in #3328 ## 🛠 Maintenance ### chore: router-bridge 0.3.0+v2.4.8 -> =0.3.1+2.4.9 ([PR #3407](#3407)) Updates `router-bridge` from ` = "0.3.0+v2.4.8"` to ` = "0.3.1+v2.4.9"`, note that with this PR, this dependency is now pinned to an exact version. This version update started failing tests because of a minor ordering change and it was not immediately clear why the test was failing. Pinning this dependency (that we own) allows us to only bring in the update at the proper time and will make test failures caused by the update to be more easily identified. By [@EverlastingBugstopper](https://github.com/EverlastingBugstopper) in #3407 ### remove the compiler from Query ([Issue #3373](#3373)) The `Query` object caches information extracted from the query that is used to format responses. It was carrying an `ApolloCompiler` instance, but now we don't really need it anymore, since it is now cached at the query analysis layer. We also should not carry it in the supergraph request and execution request, because that makes the builders hard to manipulate for plugin authors. Since we are not exposing the compiler in the public API yet, we move it inside the context's private entries, where it will be easily accessible from internal code. By [@Geal](https://github.com/Geal) in #3367 ### move AllowOnlyHttpPostMutationsLayer at the supergraph service level ([PR #3374](#3374), [PR #3410](#3410)) Now that we have access to a compiler in supergraph requests, we don't need to look into the query plan to know if a request contains mutations By [@Geal](https://github.com/Geal) in #3374 & #3410 ### update opentelemetry to 0.19.0 ([Issue #2878](#2878)) We've updated the following opentelemetry related crates: ``` opentelemetry 0.18.0 -> 0.19.0 opentelemetry-datadog 0.6.0 -> 0.7.0 opentelemetry-http 0.7.0 -> 0.8.0 opentelemetry-jaeger 0.17.0 -> 0.18.0 opentelemetry-otlp 0.11.0 -> 0.12.0 opentelemetry-semantic-conventions 0.10.0 -> 0.11.0 opentelemetry-zipkin 0.16.0 -> 0.17.0 opentelemetry-prometheus 0.11.0 -> 0.12.0 tracing-opentelemetry 0.18.0 -> 0.19.0 ``` This allows us to close a number of opentelemetry related issues. Note: The prometheus specification mandates naming format and, unfortunately, the router had two metrics which weren't compliant. The otel upgrade enforces the specification, so the affected metrics are now renamed (see below). The two affected metrics in the router were: apollo_router_cache_hit_count -> apollo_router_cache_hit_count_total apollo_router_cache_miss_count -> apollo_router_cache_miss_count_total If you are monitoring these metrics via prometheus, please update your dashboards with this name change. By [@garypen](https://github.com/garypen) in #3421 ### Synthesize defer labels without RNG or collisions ([PR #3381](#3381) and [PR #3423](#3423)) The `@defer` directive accepts a `label` argument, but it is optional. To more accurately handle deferred responses, the Router internally rewrites queries to add labels on the `@defer` directive where they are missing. Responses eventually receive the reverse treatment to look as expected by client. This was done be generating random strings, handling collision with existing labels, and maintaining a `HashSet` of which labels had been synthesized. Instead, we now add a prefix to pre-existing labels and generate new labels without it. When processing a response, the absence of that prefix indicates a synthetic label. By [@SimonSapin](https://github.com/SimonSapin) and [@o0Ignition0o](https://github.com/o0Ignition0o) in #3381 and #3423 ### Move subscription event execution at the execution service level ([PR #3395](#3395)) In order to prepare some future integration I moved the execution loop for subscription events at the execution_service level. By [@bnjjj](https://github.com/bnjjj) in #3395 ## 📚 Documentation ### Document claim augmentation via coprocessors ([Issue #3102](#3102)) Claims augmentation is a common use case where user information from the JWT claims is used to look up more context like roles from databases, before sending it to subgraphs. This can be done with subgraphs, but it was not documented yet, and there was confusion on the order in which the plugins were called. This clears the confusion and provides an example configuration. By [@Geal](https://github.com/Geal) in #3386

emilgelman added the bug Something isn't working label Nov 4, 2021

mx-psi added exporter/datadog Datadog components data:metrics Metric related issues labels Nov 4, 2021

emilgelman closed this as completed Dec 17, 2021

mx-psi mentioned this issue Feb 11, 2022

[exporter/datadog] Use exact sum and count on distributions when available #7830

Merged

olsavmic mentioned this issue Nov 7, 2022

Mention configurable Temporality of OTEL with direct Datadog Agent integration DataDog/documentation#15840

Merged

4 tasks

garypen mentioned this issue Jul 11, 2023

add support for delta aggregation to otlp metrics apollographql/router#3412

Merged

6 tasks

bnjjj mentioned this issue Jul 13, 2023

prep release: v1.24.0 apollographql/router#3431

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Histogram distributions are not being exported to Datadog #6129

Histogram distributions are not being exported to Datadog #6129

emilgelman commented Nov 4, 2021

mx-psi commented Nov 5, 2021 •

edited

Loading

emilgelman commented Nov 13, 2021

mx-psi commented Nov 17, 2021 •

edited

Loading

emilgelman commented Dec 17, 2021

Histogram distributions are not being exported to Datadog #6129

Histogram distributions are not being exported to Datadog #6129

Comments

emilgelman commented Nov 4, 2021

mx-psi commented Nov 5, 2021 • edited Loading

emilgelman commented Nov 13, 2021

mx-psi commented Nov 17, 2021 • edited Loading

emilgelman commented Dec 17, 2021

mx-psi commented Nov 5, 2021 •

edited

Loading

mx-psi commented Nov 17, 2021 •

edited

Loading