Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/prometheusremotewrite] - out of order sample #11438

Closed
amoscatelli opened this issue Jun 21, 2022 · 8 comments
Closed

[exporter/prometheusremotewrite] - out of order sample #11438

amoscatelli opened this issue Jun 21, 2022 · 8 comments
Labels
bug Something isn't working closed as inactive comp:prometheus Prometheus related issues data:metrics Metric related issues question Further information is requested Stale

Comments

@amoscatelli
Copy link

Describe the bug
Under load, prometheusremotewrite (in otel/opentelemetry-collector-contrib:0.53.0 docker image) from time to time drops metrics (from different source applications) because of a "out of order sample" error.

This seems similar to :
open-telemetry/opentelemetry-collector#2315

Steps to reproduce
Running otel collector docker image.
Configure the otel collector to send metrics to a remote prometheus using prometheusremotewrite.
Send metrics to otel collector from applications.

What did you expect to see?
No errors and no dropped metrics.

What did you see instead?
Errors with dropped metrics.

What version did you use?
opentelemetry-collector-contrib:0.53.0

What config did you use?

receivers:
  otlp:
    protocols:
      http:
        cors:
          allowed_origins: [
            #OMITTED
          ]

processors:
    memory_limiter:
        check_interval: 5s
        limit_mib: 448
        spike_limit_mib: 64
    batch:
        send_batch_size: 48
        send_batch_max_size: 48
        timeout: 15s

exporters:
    otlp:
        endpoint: tempo-eu-west-0.grafana.net:443
        headers:
            authorization: Basic #OMITTED
            
    prometheusremotewrite:
        endpoint: https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push
        headers:
            authorization: Basic #OMITTED
            
    loki:
        endpoint: https://logs-prod-eu-west-0.grafana.net/loki/api/v1/push
        headers:
            authorization: Basic #OMITTED
        format: json
        labels:
            attributes:
                container_name: ""
                source: ""
            resource:
                host.name: "hostname"

extensions:
    health_check:

service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Environment
I really think doesn't matter.

Additional context

Some logs :

2022-06-21T15:25:11.648Z        error   exporterhelper/queued_retry.go:183      Exporting failed. The error is not retryable. Dropping data.    {"kind": "exporter", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = <nil>: user=470820: err: out of order sample. timestamp=2022-06-21T15:25:10.988Z, series={__name__=\"http_client_duration_bucket\", http_flavor=\"1.1\", http_method=\"PUT\", http_status_code=\"200\", job=\"iam-test\", le=\"500\", net_peer_name=\"169.254.169.254\"}\n", "dropped_items": 48}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/queued_retry.go:183
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/metrics.go:132
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/queued_retry_inmemory.go:119
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/internal/bounded_memory_queue.go:82
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/internal/bounded_memory_queue.go:69
2022-06-21T15:42:25.297Z        error   exporterhelper/queued_retry.go:183      Exporting failed. The error is not retryable. Dropping data.    {"kind": "exporter", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = <nil>: user=470820: err: out of order sample. timestamp=2022-06-21T15:42:19.612Z, series={__name__=\"process_runtime_jvm_memory_committed\", job=\"optoplus-services-cn\", pool=\"G1 Old Gen\", type=\"heap\"}\n", "dropped_items": 48}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/queued_retry.go:183
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/metrics.go:132
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/queued_retry_inmemory.go:119
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/internal/bounded_memory_queue.go:82
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/internal/bounded_memory_queue.go:69
2022-06-21T15:42:39.240Z        error   exporterhelper/queued_retry.go:183      Exporting failed. The error is not retryable. Dropping data.    {"kind": "exporter", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = <nil>: user=470820: err: out of order sample. timestamp=2022-06-21T15:42:19.612Z, series={__name__=\"process_runtime_jvm_memory_init\", job=\"optoplus-services-cn\", pool=\"CodeHeap 'non-profiled nmethods'\", type=\"non_heap\"}\n", "dropped_items": 48}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/queued_retry.go:183
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/metrics.go:132
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/queued_retry_inmemory.go:119
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/internal/bounded_memory_queue.go:82
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
        go.opentelemetry.io/collector@v0.53.0/exporter/exporterhelper/internal/bounded_memory_queue.go:69
@amoscatelli amoscatelli added the bug Something isn't working label Jun 21, 2022
@mx-psi mx-psi changed the title prometheusremotewrite - out of order sample [exporter/prometheusremotewrite] - out of order sample Jun 22, 2022
@mx-psi mx-psi transferred this issue from open-telemetry/opentelemetry-collector Jun 22, 2022
@mx-psi mx-psi added comp:prometheus Prometheus related issues data:metrics Metric related issues labels Jun 22, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Nov 9, 2022

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Nov 9, 2022
@Aneurysm9
Copy link
Member

Environment
I really think doesn't matter.

I really think it does.

Are you on EKS using Fargate for compute? Have you not set CPU requests or limits? We have recently seen this configuration resulting in dropped metrics as a result of .25vCPU being allocated by default.

Are you on ECS running the collector as a sidecar and shipping it metrics that don't have sufficiently identifying resource attributes and are not using the ECS resource detector plugin to ensure they're available? We've also seen that result in this error.

It's also possible that you are sending OTLP metrics with sufficiently identifying resource attributes, but your PRW exporter configuration doesn't include resource_to_telemetry_conversion so all of those resource attributes are left behind and do not provide disambiguating dimensions once converted to PRW.

TL;DR: there are many ways to get "out of order sample" errors and without more information about your deployment environment we probably can't help you.

@Aneurysm9 Aneurysm9 added question Further information is requested and removed Stale labels Nov 15, 2022
@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Jan 16, 2023
@amoscatelli
Copy link
Author

amoscatelli commented Jan 16, 2023

Environment
I really think doesn't matter.

I really think it does.

Are you on EKS using Fargate for compute? Have you not set CPU requests or limits? We have recently seen this configuration resulting in dropped metrics as a result of .25vCPU being allocated by default.

Are you on ECS running the collector as a sidecar and shipping it metrics that don't have sufficiently identifying resource attributes and are not using the ECS resource detector plugin to ensure they're available? We've also seen that result in this error.

It's also possible that you are sending OTLP metrics with sufficiently identifying resource attributes, but your PRW exporter configuration doesn't include resource_to_telemetry_conversion so all of those resource attributes are left behind and do not provide disambiguating dimensions once converted to PRW.

TL;DR: there are many ways to get "out of order sample" errors and without more information about your deployment environment we probably can't help you.

I am using AWS Beanstalk for both collector and applications to 'trace'.
Application are traced using agents (Java, NodeSDK, Plain Javascript), mostly used with minimal/default configuration, I am mostly talking about endpoints for traces/logs/spans.

Collector configuration is pasted above.

I am not aware of any explicit limit about cpu ...

Are you suggesting that a too slow vCPU may cause this issue ?

Thank you for your support

@jpettit
Copy link

jpettit commented Feb 8, 2023

FWIW I'm seeing this as well on 0.68.0 (it's been happening since 0.55.0 IIRC) and it's a bit perplexing. Example logs:

2023-02-08T21:57:15.839Z	error	exporterhelper/queued_retry.go:394	Exporting failed. The error is not retryable. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): user=123: err: out of order sample. timestamp=2023-02-08T21:57:15.513Z, series={__name__=\"target_info\", field1=\"value1\", source=\"otel-collector\"}\n", "dropped_items": 11}

@Aneurysm9
I've tried enabling resource_to_telemetry_conversion and it doesn't seem to fix the issue. However, I'm not confident we're attaching resource attributes that provide disambiguation to begin with. What would be the expectation around resource attributes for frontend metrics in this case?

@Aneurysm9
Copy link
Member

@amoscatelli I don't see any resource identifiers on the metrics in your error message. Can you try enabling resource_to_telemetry on the PRW exporter? Assuming you're feeding this from an OTel SDK there should be a service name and instance ID in the resource. There have also been changes to how resource information is translated by the Prometheus exporters since v0.53.0, please try with a more recent release.

series={__name__=\"http_client_duration_bucket\", http_flavor=\"1.1\", http_method=\"PUT\", http_status_code=\"200\", job=\"iam-test\", le=\"500\", net_peer_name=\"169.254.169.254\"}

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Apr 11, 2023
@github-actions
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working closed as inactive comp:prometheus Prometheus related issues data:metrics Metric related issues question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

5 participants