receive: Create proposal for backfiling on remote write #2599

bwplotka · 2020-05-13T15:13:01Z

As per our discussion here we decided to enable Remote Write backfilling.

Thoughts so far:

Use cases: Lagging behind clusters (currently only 2h is allowed, same with Cortex), clock skew for clusters, forgotten metrics, a batch job with old dataset, monitoring remote site. Artificial data.
Something inefficient for a start is ok. We can aim for rare cases only. Someday as something continuous.
Easiest would be to just open new TSDB on this case.

Help wanted, but the topic is extremely difficult. The design must-have apriori.

cc @gouthamve @pracucci @brancz @RichiH, @tomwilkie @squat

vflopes · 2020-06-10T12:34:34Z

As the discussion prometheus/prometheus#535 evolved to something quite complex (but I see it isn't so easy, initially I though that this was a Prometheus feature that I just didn't knew about 😄 ), would this enhancement be a Thanos ad hoc solution to backfill historical data into Prometheus?

I'm building an analytical application that reads historical usage data from customers to do some calculations and identify potential optimizations based on this historical data set of metrics. We're writing some exporters, and we wouldn't mind to write directly any Prometheus file format, or push into a Thanos endpoint that can backfill data, but I see no docs on how to do this "manually". Any references here that can help me to understand how to do this? Something using gRPC client streaming would be very nice ❤️! But this is an implementation detail that we can do by ourselves.

Anyway, just to be in sync with the community priorities, we can live without backfiling for the next months as we're currently developing features and researching solutions, but can I expect the ETA to be at most at the end of this year? If you guys need, I can help to implement once you've decided how the architecture should be!

And thanks for proactively open this issue, I think this is a really good way to start solving the backfill issue that the community want so much judging by the number of reacts and discussions in related threads to this subject.

bwplotka · 2020-06-10T17:22:08Z

Nice! Welcome to the community 👋

So actually you might be interested in those discussions:

Initial quick ideas for enabling some kind of backfill via remote read to TSDB (works for both Thanos, Cortex)
Discussion around extending Prom/Thanos for Analytic cases

We can't promise anything, but it looks like some backfilling options are coming pretty soon ! (: Currently it's possible, but it requires some magic in Go (:

stale · 2020-07-10T17:51:47Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2020-07-17T20:04:14Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

itzg · 2020-07-25T15:35:00Z

This use case is very important for us since even some normal amount of upstream metrics batch ingest will easily result in metrics with over an hour but within 2 hours of lag. The linked discussion in Cortex is very helpful, but ultimately doesn't help until it is implemented here in Thanos :) Can we have this issue re-opened so it doesn't get forgotten?

brancz · 2020-07-27T05:29:52Z

This has also been discussed at the Prometheus dev summit. Once Prometheus implements it, we should probably follow the same strategy in Thanos. I believe it's unlikely that that will happen in the receive component, but that's a detail I believe.

svenwltr · 2020-07-31T15:13:54Z

Hello. Not quite sure if this issue is exactly the right one, but coming from #2490 I have an use case. Please tell me, if there is an ticket that fits better.

We are currently implementing an anomaly detection with Thanos. It works well so far, but it is a quite complex query which needs data from up to 4 weeks ago. Due to the complexity it is more readable and has a better performance when reusing calculations by intermediate recording rules. Of course we do not have data from 4 weeks ago yet, because we just started to write the recording rules. Therefore we would need to wait full 4 weeks to be sure that the rules are working properly.

With backfilling we might be able to retrospectively calculate the recording rules and see the result immediately.

Of course we can inline most of the queries and skip the immediate recording rules, but this requires a lot of resources. Also it is impossible, when using a counter which does not have a rate recording rule yet, because something like avg_over_time(rate(foobar[5m])[1w] does not work.

Here is an example rule file:

The version without intermediate recording rules

groups:
  - name: anomaly-detection-1m
    interval: 30s
    rules:
    - record: rule_action:wafsc_evaluations:rate1m:seasonal_prediction
      expr: >
        quantile(0.5,
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 166h)
            + avg_over_time(rule_action:wafsc_evaluations:rate1m[1w])
            - avg_over_time(rule_action:wafsc_evaluations:rate1m[1w] offset 1w)
            , "offset", "1w", "", "")
          or
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 334h)
            + avg_over_time(rule_action:wafsc_evaluations:rate1m[1w])
            - avg_over_time(rule_action:wafsc_evaluations:rate1m[1w] offset 2w)
            , "offset", "2w", "", "")
          or
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 502h)
            + avg_over_time(rule_action:wafsc_evaluations:rate1m[1w])
            - avg_over_time(rule_action:wafsc_evaluations:rate1m[1w] offset 3w)
            , "offset", "3w", "", "")
        ) without (offset)

    - record: rule_action:wafsc_evaluations:rate1m:seasonal_prediction:z_score
      expr: >
        (
          rule_action:wafsc_evaluations:rate1m
          - rule_action:wafsc_evaluations:rate1m:seasonal_prediction
        ) / stddev_over_time(rule_action:wafsc_evaluations:rate1m[1w])

The optimized version

groups:
  - name: anomaly-detection-1m
    interval: 30s
    rules:
    - record: rule_action:wafsc_evaluations:rate1m:avg_over_time_1w
      expr: avg_over_time(rule_action:wafsc_evaluations:rate1m[1w])

    - record: rule_action:wafsc_evaluations:rate1m:stddev_over_time_1w
      expr: stddev_over_time(rule_action:wafsc_evaluations:rate1m[1w])

    - record: rule_action:wafsc_evaluations:rate1m:z_score
      expr: >
        (
          rule_action:wafsc_evaluations:rate1m -
          rule_action:wafsc_evaluations:rate1m:avg_over_time_1w
        ) / rule_action:wafsc_evaluations:rate1m:stddev_over_time_1w

    - record: rule_action:wafsc_evaluations:rate1m:seasonal_prediction
      expr: >
        quantile(0.5,
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 166h)
            + rule_action:wafsc_evaluations:rate1m:avg_over_time_1w
            - rule_action:wafsc_evaluations:rate1m:avg_over_time_1w offset 1w
            , "offset", "1w", "", "")
          or
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 334h)
            + rule_action:wafsc_evaluations:rate1m:avg_over_time_1w
            - rule_action:wafsc_evaluations:rate1m:avg_over_time_1w offset 2w
            , "offset", "2w", "", "")
          or
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 502h)
            + rule_action:wafsc_evaluations:rate1m:avg_over_time_1w
            - rule_action:wafsc_evaluations:rate1m:avg_over_time_1w offset 3w
            , "offset", "3w", "", "")
        ) without (offset)

    - record: rule_action:wafsc_evaluations:rate1m:seasonal_prediction:z_score
      expr: >
        (
          rule_action:wafsc_evaluations:rate1m
          - rule_action:wafsc_evaluations:rate1m:seasonal_prediction
        ) / rule_action:wafsc_evaluations:rate1m:stddev_over_time_1w

brancz · 2020-08-03T08:43:06Z

There is work on retroactive rule evaluations happening on Prometheus already. Once that's figured out there we'll probably implement the same mechanism in Thanos. We look at backfilling data more as a thing for retrofitting non Prometheus data into the system, retroactive rule evaluations may make use of the same infrastructure but should be a first class feature, at least eventually.

stale · 2020-09-02T10:48:00Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2020-09-09T19:06:53Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

sepich · 2020-09-13T08:18:59Z

For those who found this issue via search: thanks to @bwplotka and @dipack95 work on prometheus side it is now possible to import custom data in prometheus text format into Thanos via https://github.com/sepich/thanos-kit/ import command.
(It is currently reading all the imported data to memory, hope the feature would be implemented better in upstream version. Unfortunately we need this "yesterday" so can't wait anymore ;)

bwplotka · 2022-05-04T12:49:41Z

Currently, we have lots of backfill solutions that rely on block upload. We should invest in them and make them better.

BUT, this issue is for remote write backfill, which is in dev and design by the amazing Grafana team!. PTAL there.

yeya24 · 2022-11-13T03:48:32Z

Since the OOO feature has been merged into main, I will close this issue. Feel free to reopen if you think it is not addressed.

bwplotka added feature request/improvement difficulty: hard help wanted labels May 13, 2020

bwplotka mentioned this issue May 23, 2020

Receive: option to backfill data #2490

Closed

kakkoyun added the component: receive label Jun 10, 2020

stale bot added the stale label Jul 10, 2020

stale bot closed this as completed Jul 17, 2020

brancz reopened this Jul 27, 2020

stale bot removed the stale label Jul 27, 2020

stale bot added the stale label Sep 2, 2020

stale bot closed this as completed Sep 9, 2020

bwplotka reopened this Mar 11, 2022

stale bot removed the stale label Mar 11, 2022

bwplotka added the dont-go-stale Label for important issues which tells the stalebot not to close them label May 4, 2022

yeya24 mentioned this issue Oct 30, 2022

Receive: Add parameter to set out-of-order time window #5839

Merged

2 tasks

yeya24 closed this as completed Nov 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

receive: Create proposal for backfiling on remote write #2599

receive: Create proposal for backfiling on remote write #2599

bwplotka commented May 13, 2020 •

edited

Loading

vflopes commented Jun 10, 2020

bwplotka commented Jun 10, 2020 •

edited

Loading

stale bot commented Jul 10, 2020

stale bot commented Jul 17, 2020

itzg commented Jul 25, 2020 •

edited

Loading

brancz commented Jul 27, 2020

svenwltr commented Jul 31, 2020

The version without intermediate recording rules

The optimized version

brancz commented Aug 3, 2020

stale bot commented Sep 2, 2020

stale bot commented Sep 9, 2020

sepich commented Sep 13, 2020

bwplotka commented May 4, 2022

yeya24 commented Nov 13, 2022

receive: Create proposal for backfiling on remote write #2599

receive: Create proposal for backfiling on remote write #2599

Comments

bwplotka commented May 13, 2020 • edited Loading

vflopes commented Jun 10, 2020

bwplotka commented Jun 10, 2020 • edited Loading

stale bot commented Jul 10, 2020

stale bot commented Jul 17, 2020

itzg commented Jul 25, 2020 • edited Loading

brancz commented Jul 27, 2020

svenwltr commented Jul 31, 2020

The version without intermediate recording rules

The optimized version

brancz commented Aug 3, 2020

stale bot commented Sep 2, 2020

stale bot commented Sep 9, 2020

sepich commented Sep 13, 2020

bwplotka commented May 4, 2022

yeya24 commented Nov 13, 2022

bwplotka commented May 13, 2020 •

edited

Loading

bwplotka commented Jun 10, 2020 •

edited

Loading

itzg commented Jul 25, 2020 •

edited

Loading