-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exporter/prometheusremotewriter: sort Sample by Timestamp to avoid out of order errors #2941
exporter/prometheusremotewriter: sort Sample by Timestamp to avoid out of order errors #2941
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2941 +/- ##
=======================================
Coverage 91.66% 91.66%
=======================================
Files 312 312
Lines 15312 15317 +5
=======================================
+ Hits 14036 14041 +5
Misses 870 870
Partials 406 406
Continue to review full report at Codecov.
|
4c6e148
to
3f40ffc
Compare
…t of order errors Ensures that before a prompb.WriteRequest is created, that the TimeSeries contained have their Sample values sorted chronologically by Prometheus to avoid out of order errors reported by Prometheus barfing. Thanks to @rakyll for a reproducer and for diagnosing the problem, which helped distill the issue from a very complex setup that required super expensive Kubernetes clusters with many replicas, but essentially the problem became more apparently when the number of TimeSeries grew. It is important to note that the presence of such a bug signifies that with a large number of replicas are being scraped from, this stalls scraping and takes a long time which means that targets scraped in a round-robin fashion experience staleness when many. This might be even more reasons for setups to adopt a push model as opposed to scrape endpoints. Fixes open-telemetry#2315 Fixes open-telemetry/prometheus-interoperability-spec#10
3f40ffc
to
8b97422
Compare
Thanks much for this PR! I've been testing the exporter with this change and can't reproduce the bug anymore. I'm using relabel_configs to have pod name in the samples to avoid the duplicates that might be potentially coming from the replicas of my app. |
I spoke too soon, filed #2949 for a follow up. |
Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 4 to 5. - [Release notes](https://github.com/peter-evans/create-pull-request/releases) - [Commits](peter-evans/create-pull-request@v4...v5) --- updated-dependencies: - dependency-name: peter-evans/create-pull-request dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Ensures that before a prompb.WriteRequest is created, that the
TimeSeries contained have their Sample values sorted chronologically
by Prometheus to avoid out of order errors reported by Prometheus
barfing.
Thanks to @rakyll for a reproducer and for diagnosing the problem, which
helped distill the issue from a very complex setup that required super
expensive Kubernetes clusters with many replicas, but essentially the problem
became more apparently when the number of TimeSeries grew.
It is important to note that the presence of such a bug signifies that
with a large number of replicas are being scraped from, this stalls
scraping and takes a long time which means that targets scraped in a
round-robin fashion experience staleness when many. This might be even
more reasons for setups to adopt a push model as opposed to scrape
endpoints.
Fixes #2315
Fixes open-telemetry/prometheus-interoperability-spec#10