-
Notifications
You must be signed in to change notification settings - Fork 8
Fix the “out of order samples” and “duplicate sample for timestamp” issues #10
Fix the “out of order samples” and “duplicate sample for timestamp” issues #10
Comments
I think this is due to the exporter sending samples in parallel without any sharding. The Prometheus remote write code goes to quite some length to ensure samples for a specific series are delivered in order, but between series there is parallelism - see the queuemanager code: https://github.com/prometheus/prometheus/blob/6bc67805332dad9345f9d11069f321203bf89f8d/storage/remote/queue_manager.go#L883 The easiest way to do this would be to reuse all this code - this guarantees it'll also work as we improve it upstream. In particular, we're in the process of redesigning and rewriting it so that we can guarantee metadata, exemplars and histograms are all delivered atomically. |
We are producing samples in non-chronological order due to the bug explained at open-telemetry/opentelemetry-collector#2315 (comment). |
…t of order errors Ensures that before a prompb.WriteRequest is created, that the TimeSeries contained have their Sample values sorted chronologically by Prometheus to avoid out of order errors reported by Prometheus barfing. Thanks to @rakyll for a reproducer and for diagnosing the problem, which helped distill the issue from a very complex setup that required super expensive Kubernetes clusters with many replicas, but essentially the problem became more apparently when the number of TimeSeries grew. It is important to note that the presence of such a bug signifies that with a large number of replicas are being scraped from, this stalls scraping and takes a long time which means that targets scraped in a round-robin fashion experience staleness when many. This might be even more reasons for setups to adopt a push model as opposed to scrape endpoints. Fixes open-telemetry#2315 Fixes open-telemetry/prometheus-interoperability-spec#10
…t of order errors Ensures that before a prompb.WriteRequest is created, that the TimeSeries contained have their Sample values sorted chronologically by Prometheus to avoid out of order errors reported by Prometheus barfing. Thanks to @rakyll for a reproducer and for diagnosing the problem, which helped distill the issue from a very complex setup that required super expensive Kubernetes clusters with many replicas, but essentially the problem became more apparently when the number of TimeSeries grew. It is important to note that the presence of such a bug signifies that with a large number of replicas are being scraped from, this stalls scraping and takes a long time which means that targets scraped in a round-robin fashion experience staleness when many. This might be even more reasons for setups to adopt a push model as opposed to scrape endpoints. Fixes open-telemetry#2315 Fixes open-telemetry/prometheus-interoperability-spec#10
…t of order errors Ensures that before a prompb.WriteRequest is created, that the TimeSeries contained have their Sample values sorted chronologically by Prometheus to avoid out of order errors reported by Prometheus barfing. Thanks to @rakyll for a reproducer and for diagnosing the problem, which helped distill the issue from a very complex setup that required super expensive Kubernetes clusters with many replicas, but essentially the problem became more apparently when the number of TimeSeries grew. It is important to note that the presence of such a bug signifies that with a large number of replicas are being scraped from, this stalls scraping and takes a long time which means that targets scraped in a round-robin fashion experience staleness when many. This might be even more reasons for setups to adopt a push model as opposed to scrape endpoints. Fixes open-telemetry#2315 Fixes open-telemetry/prometheus-interoperability-spec#10
…t of order errors Ensures that before a prompb.WriteRequest is created, that the TimeSeries contained have their Sample values sorted chronologically by Prometheus to avoid out of order errors reported by Prometheus barfing. Thanks to @rakyll for a reproducer and for diagnosing the problem, which helped distill the issue from a very complex setup that required super expensive Kubernetes clusters with many replicas, but essentially the problem became more apparently when the number of TimeSeries grew. It is important to note that the presence of such a bug signifies that with a large number of replicas are being scraped from, this stalls scraping and takes a long time which means that targets scraped in a round-robin fashion experience staleness when many. This might be even more reasons for setups to adopt a push model as opposed to scrape endpoints. Fixes open-telemetry#2315 Fixes open-telemetry/prometheus-interoperability-spec#10
…t of order errors (#2941) Ensures that before a prompb.WriteRequest is created, that the TimeSeries contained have their Sample values sorted chronologically by Prometheus to avoid out of order errors reported by Prometheus barfing. Thanks to @rakyll for a reproducer and for diagnosing the problem, which helped distill the issue from a very complex setup that required super expensive Kubernetes clusters with many replicas, but essentially the problem became more apparently when the number of TimeSeries grew. It is important to note that the presence of such a bug signifies that with a large number of replicas are being scraped from, this stalls scraping and takes a long time which means that targets scraped in a round-robin fashion experience staleness when many. This might be even more reasons for setups to adopt a push model as opposed to scrape endpoints. Fixes #2315 Fixes open-telemetry/prometheus-interoperability-spec#10
Fix the “out of order samples” and “duplicate sample for timestamp” issues in the Prometheus RW exporter.
OTel Components:
https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/prometheusremotewriteexporter
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/awsprometheusremotewriteexporter
Also see open-telemetry/opentelemetry-collector#2315
The text was updated successfully, but these errors were encountered: