Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus: unexpected out-of-order errors when writing metrics from an Alloy cluster with exemplars and native histograms #1117

Open
thampiotr opened this issue Jun 24, 2024 · 2 comments
Labels
bug Something isn't working needs-attention

Comments

@thampiotr
Copy link
Contributor

What's wrong?

When remote-writing metrics using prometheus.remote_write component to a backend with out-of-order ingestion enabled (and a sufficiently large time window) from a cluster of Alloy instances, with exemplars and/or native histograms enabled, users can observe an increase in remote_write errors, which correlates with restarts or adding new instances to Alloy cluster.

Upon closer inspection, errors are in the form of:

ts=2024-06-19T17:38:16.946175Z level=error msg="non-recoverable error" ... url=(...) count=5 exemplarCount=1 err="server returned HTTP status 400 Bad Request: send data to ingesters: failed pushing to ingester ...: user=...: err: out of order exemplar. timestamp=2024-06-19T17:35:15Z, series=a_test_total{...}, exemplar={...}"

Or a similar error mentioning an out-of-order sample, for example in case of Mimir backend it would be:

server returned HTTP status 400 Bad Request: send data to ingesters: failed pushing to ingester ingester-zone-a-9: user=9960: the sample has been rejected because another sample with a more recent timestamp has already been ingested and out-of-order samples are not allowed (err-mimir-sample-out-of-order).

If you look at details from the "sample-out-of-order" error above and check what metrics are failing, these will be samples for native histograms.

Root cause and additional findings

I have been able to verify that the issue relates almost exclusively to exemplars and native histograms, as when I have disabled them, the errors went away and success rate went to 100% in a cluster writing over 1.5 million samples per second.

After discussing with engineers closer to the topic, we believe that the root cause for this is missing upstream support for out-of-order ingestion of exemplars and native histograms:

Note that in practice, some backends will still process all the regular samples and will only drop the exemplars and native histograms, but the remote write client will get incorrect error counts due to limited information surfaced by the protocol when the batch is partially successful. See this issue in Prometheus for details.

Possible workarounds

Users that want to avoid these errors are recommended to disable exemplars and native histograms until the support for OOO ingestion is added to Prometheus.

We could consider splitting the exemplars and native histograms into a separate pipeline, but that would require some work on Alloy to support such filtering and would be short-lived, until the upstream issues are resolved.

Steps to reproduce

  • Have the Prometheus backend set up with OOO ingestion enabled with a reasonably large window (e.g. 10 minutes)
  • Have Alloy scrape and remote_write with exemplars and native histograms enabled and present
  • Run Alloy in a cluster and add / restart instances
  • Observe batches of remote_writes fail with OOO error relating to either exemplars or native histogram samples. Note that the time delay is less than OOO window set up in the backend, indicating that something went wrong. This can be seen as low SR on the official Alloy dashboard or error logs.

System information

Likely every OS

Software version

v1.1.1 and likely every previous version

Configuration

No response

Logs

No response

@thampiotr thampiotr added the bug Something isn't working label Jun 24, 2024
@thampiotr
Copy link
Contributor Author

Note: this issue can be used to track the status of the upstream issues (prometheus/prometheus#11220 and prometheus/prometheus#13577) - there's not much we can do right now to mitigate this.

Copy link
Contributor

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-attention
Projects
None yet
Development

No branches or pull requests

1 participant