[exporter/prometheusremotewrite] - out of order sample #11438

amoscatelli · 2022-06-21T16:08:51Z

Describe the bug
Under load, prometheusremotewrite (in otel/opentelemetry-collector-contrib:0.53.0 docker image) from time to time drops metrics (from different source applications) because of a "out of order sample" error.

This seems similar to :
open-telemetry/opentelemetry-collector#2315

Steps to reproduce
Running otel collector docker image.
Configure the otel collector to send metrics to a remote prometheus using prometheusremotewrite.
Send metrics to otel collector from applications.

What did you expect to see?
No errors and no dropped metrics.

What did you see instead?
Errors with dropped metrics.

What version did you use?
opentelemetry-collector-contrib:0.53.0

What config did you use?

receivers:
  otlp:
    protocols:
      http:
        cors:
          allowed_origins: [
            #OMITTED
          ]

processors:
    memory_limiter:
        check_interval: 5s
        limit_mib: 448
        spike_limit_mib: 64
    batch:
        send_batch_size: 48
        send_batch_max_size: 48
        timeout: 15s

exporters:
    otlp:
        endpoint: tempo-eu-west-0.grafana.net:443
        headers:
            authorization: Basic #OMITTED
            
    prometheusremotewrite:
        endpoint: https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push
        headers:
            authorization: Basic #OMITTED
            
    loki:
        endpoint: https://logs-prod-eu-west-0.grafana.net/loki/api/v1/push
        headers:
            authorization: Basic #OMITTED
        format: json
        labels:
            attributes:
                container_name: ""
                source: ""
            resource:
                host.name: "hostname"

extensions:
    health_check:

service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Environment
I really think doesn't matter.

Additional context

Some logs :

2022-06-21T15:25:11.648Z        error   exporterhelper/queued_retry.go:183      Exporting failed. The error is not retryable. Dropping data.    {"kind": "exporter", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = <nil>: user=470820: err: out of order sample. timestamp=2022-06-21T15:25:10.988Z, series={__name__=\"http_client_duration_bucket\", http_flavor=\"1.1\", http_method=\"PUT\", http_status_code=\"200\", job=\"iam-test\", le=\"500\", net_peer_name=\"169.254.169.254\"}\n", "dropped_items": 48}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/queued_retry.go:183
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/metrics.go:132
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/queued_retry_inmemory.go:119
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/internal/bounded_memory_queue.go:82
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/internal/bounded_memory_queue.go:69
2022-06-21T15:42:25.297Z        error   exporterhelper/queued_retry.go:183      Exporting failed. The error is not retryable. Dropping data.    {"kind": "exporter", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = <nil>: user=470820: err: out of order sample. timestamp=2022-06-21T15:42:19.612Z, series={__name__=\"process_runtime_jvm_memory_committed\", job=\"optoplus-services-cn\", pool=\"G1 Old Gen\", type=\"heap\"}\n", "dropped_items": 48}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/queued_retry.go:183
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/metrics.go:132
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/queued_retry_inmemory.go:119
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/internal/bounded_memory_queue.go:82
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/internal/bounded_memory_queue.go:69
2022-06-21T15:42:39.240Z        error   exporterhelper/queued_retry.go:183      Exporting failed. The error is not retryable. Dropping data.    {"kind": "exporter", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = <nil>: user=470820: err: out of order sample. timestamp=2022-06-21T15:42:19.612Z, series={__name__=\"process_runtime_jvm_memory_init\", job=\"optoplus-services-cn\", pool=\"CodeHeap 'non-profiled nmethods'\", type=\"non_heap\"}\n", "dropped_items": 48}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/queued_retry.go:183
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/metrics.go:132
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/queued_retry_inmemory.go:119
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/internal/bounded_memory_queue.go:82
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/internal/bounded_memory_queue.go:69

The text was updated successfully, but these errors were encountered:

github-actions · 2022-11-09T03:52:11Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Aneurysm9 · 2022-11-15T00:43:26Z

Environment
I really think doesn't matter.

I really think it does.

Are you on EKS using Fargate for compute? Have you not set CPU requests or limits? We have recently seen this configuration resulting in dropped metrics as a result of .25vCPU being allocated by default.

Are you on ECS running the collector as a sidecar and shipping it metrics that don't have sufficiently identifying resource attributes and are not using the ECS resource detector plugin to ensure they're available? We've also seen that result in this error.

It's also possible that you are sending OTLP metrics with sufficiently identifying resource attributes, but your PRW exporter configuration doesn't include resource_to_telemetry_conversion so all of those resource attributes are left behind and do not provide disambiguating dimensions once converted to PRW.

TL;DR: there are many ways to get "out of order sample" errors and without more information about your deployment environment we probably can't help you.

github-actions · 2023-01-16T03:30:50Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

amoscatelli · 2023-01-16T07:31:59Z

Environment
I really think doesn't matter.

I really think it does.

Are you on EKS using Fargate for compute? Have you not set CPU requests or limits? We have recently seen this configuration resulting in dropped metrics as a result of .25vCPU being allocated by default.

Are you on ECS running the collector as a sidecar and shipping it metrics that don't have sufficiently identifying resource attributes and are not using the ECS resource detector plugin to ensure they're available? We've also seen that result in this error.

It's also possible that you are sending OTLP metrics with sufficiently identifying resource attributes, but your PRW exporter configuration doesn't include resource_to_telemetry_conversion so all of those resource attributes are left behind and do not provide disambiguating dimensions once converted to PRW.

TL;DR: there are many ways to get "out of order sample" errors and without more information about your deployment environment we probably can't help you.

I am using AWS Beanstalk for both collector and applications to 'trace'.
Application are traced using agents (Java, NodeSDK, Plain Javascript), mostly used with minimal/default configuration, I am mostly talking about endpoints for traces/logs/spans.

Collector configuration is pasted above.

I am not aware of any explicit limit about cpu ...

Are you suggesting that a too slow vCPU may cause this issue ?

Thank you for your support

jpettit · 2023-02-08T22:24:05Z

FWIW I'm seeing this as well on 0.68.0 (it's been happening since 0.55.0 IIRC) and it's a bit perplexing. Example logs:

2023-02-08T21:57:15.839Z	error	exporterhelper/queued_retry.go:394	Exporting failed. The error is not retryable. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): user=123: err: out of order sample. timestamp=2023-02-08T21:57:15.513Z, series={__name__=\"target_info\", field1=\"value1\", source=\"otel-collector\"}\n", "dropped_items": 11}

@Aneurysm9
I've tried enabling resource_to_telemetry_conversion and it doesn't seem to fix the issue. However, I'm not confident we're attaching resource attributes that provide disambiguation to begin with. What would be the expectation around resource attributes for frontend metrics in this case?

Aneurysm9 · 2023-02-09T05:17:19Z

@amoscatelli I don't see any resource identifiers on the metrics in your error message. Can you try enabling resource_to_telemetry on the PRW exporter? Assuming you're feeding this from an OTel SDK there should be a service name and instance ID in the resource. There have also been changes to how resource information is translated by the Prometheus exporters since v0.53.0, please try with a more recent release.

series={__name__=\"http_client_duration_bucket\", http_flavor=\"1.1\", http_method=\"PUT\", http_status_code=\"200\", job=\"iam-test\", le=\"500\", net_peer_name=\"169.254.169.254\"}

github-actions · 2023-04-11T03:29:36Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions · 2023-06-10T05:18:41Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

amoscatelli added the bug Something isn't working label Jun 21, 2022

mx-psi changed the title ~~prometheusremotewrite - out of order sample~~ [exporter/prometheusremotewrite] - out of order sample Jun 22, 2022

mx-psi transferred this issue from open-telemetry/opentelemetry-collector Jun 22, 2022

mx-psi added comp:prometheus Prometheus related issues data:metrics Metric related issues labels Jun 22, 2022

github-actions bot added the Stale label Nov 9, 2022

Aneurysm9 added question Further information is requested and removed Stale labels Nov 15, 2022

github-actions bot added the Stale label Jan 16, 2023

fatsheep9146 removed the Stale label Jan 16, 2023

matej-g mentioned this issue Mar 17, 2023

HTTP status 409 Conflict Prometheus to Thanos Receiver Metrics thanos-io/thanos#5732

Open

github-actions bot added the Stale label Apr 11, 2023

github-actions bot added the closed as inactive label Jun 10, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporter/prometheusremotewrite] - out of order sample #11438

[exporter/prometheusremotewrite] - out of order sample #11438

amoscatelli commented Jun 21, 2022

github-actions bot commented Nov 9, 2022

Aneurysm9 commented Nov 15, 2022

github-actions bot commented Jan 16, 2023

amoscatelli commented Jan 16, 2023 •

edited

Loading

jpettit commented Feb 8, 2023

Aneurysm9 commented Feb 9, 2023

github-actions bot commented Apr 11, 2023

github-actions bot commented Jun 10, 2023

[exporter/prometheusremotewrite] - out of order sample #11438

[exporter/prometheusremotewrite] - out of order sample #11438

Comments

amoscatelli commented Jun 21, 2022

github-actions bot commented Nov 9, 2022

Aneurysm9 commented Nov 15, 2022

github-actions bot commented Jan 16, 2023

amoscatelli commented Jan 16, 2023 • edited Loading

jpettit commented Feb 8, 2023

Aneurysm9 commented Feb 9, 2023

github-actions bot commented Apr 11, 2023

github-actions bot commented Jun 10, 2023

amoscatelli commented Jan 16, 2023 •

edited

Loading