Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receive: Create proposal for backfiling on remote write #2599

Closed
bwplotka opened this issue May 13, 2020 · 13 comments
Closed

receive: Create proposal for backfiling on remote write #2599

bwplotka opened this issue May 13, 2020 · 13 comments
Labels
component: receive difficulty: hard dont-go-stale Label for important issues which tells the stalebot not to close them feature request/improvement help wanted

Comments

@bwplotka
Copy link
Member

bwplotka commented May 13, 2020

As per our discussion here we decided to enable Remote Write backfilling.

Thoughts so far:

  • Use cases: Lagging behind clusters (currently only 2h is allowed, same with Cortex), clock skew for clusters, forgotten metrics, a batch job with old dataset, monitoring remote site. Artificial data.
  • Something inefficient for a start is ok. We can aim for rare cases only. Someday as something continuous.
  • Easiest would be to just open new TSDB on this case.

Help wanted, but the topic is extremely difficult. The design must-have apriori.

cc @gouthamve @pracucci @brancz @RichiH, @tomwilkie @squat

@vflopes
Copy link

vflopes commented Jun 10, 2020

As the discussion prometheus/prometheus#535 evolved to something quite complex (but I see it isn't so easy, initially I though that this was a Prometheus feature that I just didn't knew about 😄 ), would this enhancement be a Thanos ad hoc solution to backfill historical data into Prometheus?

I'm building an analytical application that reads historical usage data from customers to do some calculations and identify potential optimizations based on this historical data set of metrics. We're writing some exporters, and we wouldn't mind to write directly any Prometheus file format, or push into a Thanos endpoint that can backfill data, but I see no docs on how to do this "manually". Any references here that can help me to understand how to do this? Something using gRPC client streaming would be very nice ❤️! But this is an implementation detail that we can do by ourselves.

Anyway, just to be in sync with the community priorities, we can live without backfiling for the next months as we're currently developing features and researching solutions, but can I expect the ETA to be at most at the end of this year? If you guys need, I can help to implement once you've decided how the architecture should be!

And thanks for proactively open this issue, I think this is a really good way to start solving the backfill issue that the community want so much judging by the number of reacts and discussions in related threads to this subject.

@bwplotka
Copy link
Member Author

bwplotka commented Jun 10, 2020

Nice! Welcome to the community 👋

So actually you might be interested in those discussions:

We can't promise anything, but it looks like some backfilling options are coming pretty soon ! (: Currently it's possible, but it requires some magic in Go (:

@stale
Copy link

stale bot commented Jul 10, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jul 10, 2020
@stale
Copy link

stale bot commented Jul 17, 2020

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Jul 17, 2020
@itzg
Copy link

itzg commented Jul 25, 2020

This use case is very important for us since even some normal amount of upstream metrics batch ingest will easily result in metrics with over an hour but within 2 hours of lag. The linked discussion in Cortex is very helpful, but ultimately doesn't help until it is implemented here in Thanos :) Can we have this issue re-opened so it doesn't get forgotten?

@brancz
Copy link
Member

brancz commented Jul 27, 2020

This has also been discussed at the Prometheus dev summit. Once Prometheus implements it, we should probably follow the same strategy in Thanos. I believe it's unlikely that that will happen in the receive component, but that's a detail I believe.

@brancz brancz reopened this Jul 27, 2020
@stale stale bot removed the stale label Jul 27, 2020
@svenwltr
Copy link

Hello. Not quite sure if this issue is exactly the right one, but coming from #2490 I have an use case. Please tell me, if there is an ticket that fits better.

We are currently implementing an anomaly detection with Thanos. It works well so far, but it is a quite complex query which needs data from up to 4 weeks ago. Due to the complexity it is more readable and has a better performance when reusing calculations by intermediate recording rules. Of course we do not have data from 4 weeks ago yet, because we just started to write the recording rules. Therefore we would need to wait full 4 weeks to be sure that the rules are working properly.

With backfilling we might be able to retrospectively calculate the recording rules and see the result immediately.

Of course we can inline most of the queries and skip the immediate recording rules, but this requires a lot of resources. Also it is impossible, when using a counter which does not have a rate recording rule yet, because something like avg_over_time(rate(foobar[5m])[1w] does not work.

Here is an example rule file:

The version without intermediate recording rules

groups:
  - name: anomaly-detection-1m
    interval: 30s
    rules:
    - record: rule_action:wafsc_evaluations:rate1m:seasonal_prediction
      expr: >
        quantile(0.5,
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 166h)
            + avg_over_time(rule_action:wafsc_evaluations:rate1m[1w])
            - avg_over_time(rule_action:wafsc_evaluations:rate1m[1w] offset 1w)
            , "offset", "1w", "", "")
          or
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 334h)
            + avg_over_time(rule_action:wafsc_evaluations:rate1m[1w])
            - avg_over_time(rule_action:wafsc_evaluations:rate1m[1w] offset 2w)
            , "offset", "2w", "", "")
          or
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 502h)
            + avg_over_time(rule_action:wafsc_evaluations:rate1m[1w])
            - avg_over_time(rule_action:wafsc_evaluations:rate1m[1w] offset 3w)
            , "offset", "3w", "", "")
        ) without (offset)

    - record: rule_action:wafsc_evaluations:rate1m:seasonal_prediction:z_score
      expr: >
        (
          rule_action:wafsc_evaluations:rate1m
          - rule_action:wafsc_evaluations:rate1m:seasonal_prediction
        ) / stddev_over_time(rule_action:wafsc_evaluations:rate1m[1w])

The optimized version

groups:
  - name: anomaly-detection-1m
    interval: 30s
    rules:
    - record: rule_action:wafsc_evaluations:rate1m:avg_over_time_1w
      expr: avg_over_time(rule_action:wafsc_evaluations:rate1m[1w])

    - record: rule_action:wafsc_evaluations:rate1m:stddev_over_time_1w
      expr: stddev_over_time(rule_action:wafsc_evaluations:rate1m[1w])

    - record: rule_action:wafsc_evaluations:rate1m:z_score
      expr: >
        (
          rule_action:wafsc_evaluations:rate1m -
          rule_action:wafsc_evaluations:rate1m:avg_over_time_1w
        ) / rule_action:wafsc_evaluations:rate1m:stddev_over_time_1w

    - record: rule_action:wafsc_evaluations:rate1m:seasonal_prediction
      expr: >
        quantile(0.5,
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 166h)
            + rule_action:wafsc_evaluations:rate1m:avg_over_time_1w
            - rule_action:wafsc_evaluations:rate1m:avg_over_time_1w offset 1w
            , "offset", "1w", "", "")
          or
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 334h)
            + rule_action:wafsc_evaluations:rate1m:avg_over_time_1w
            - rule_action:wafsc_evaluations:rate1m:avg_over_time_1w offset 2w
            , "offset", "2w", "", "")
          or
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 502h)
            + rule_action:wafsc_evaluations:rate1m:avg_over_time_1w
            - rule_action:wafsc_evaluations:rate1m:avg_over_time_1w offset 3w
            , "offset", "3w", "", "")
        ) without (offset)

    - record: rule_action:wafsc_evaluations:rate1m:seasonal_prediction:z_score
      expr: >
        (
          rule_action:wafsc_evaluations:rate1m
          - rule_action:wafsc_evaluations:rate1m:seasonal_prediction
        ) / rule_action:wafsc_evaluations:rate1m:stddev_over_time_1w

@brancz
Copy link
Member

brancz commented Aug 3, 2020

There is work on retroactive rule evaluations happening on Prometheus already. Once that's figured out there we'll probably implement the same mechanism in Thanos. We look at backfilling data more as a thing for retrofitting non Prometheus data into the system, retroactive rule evaluations may make use of the same infrastructure but should be a first class feature, at least eventually.

@stale
Copy link

stale bot commented Sep 2, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Sep 2, 2020
@stale
Copy link

stale bot commented Sep 9, 2020

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Sep 9, 2020
@sepich
Copy link
Contributor

sepich commented Sep 13, 2020

For those who found this issue via search: thanks to @bwplotka and @dipack95 work on prometheus side it is now possible to import custom data in prometheus text format into Thanos via https://github.com/sepich/thanos-kit/ import command.
(It is currently reading all the imported data to memory, hope the feature would be implemented better in upstream version. Unfortunately we need this "yesterday" so can't wait anymore ;)

@bwplotka bwplotka reopened this Mar 11, 2022
@stale stale bot removed the stale label Mar 11, 2022
@bwplotka
Copy link
Member Author

bwplotka commented May 4, 2022

Currently, we have lots of backfill solutions that rely on block upload. We should invest in them and make them better.

BUT, this issue is for remote write backfill, which is in dev and design by the amazing Grafana team!. PTAL there.

@bwplotka bwplotka added the dont-go-stale Label for important issues which tells the stalebot not to close them label May 4, 2022
@yeya24
Copy link
Contributor

yeya24 commented Nov 13, 2022

Since the OOO feature has been merged into main, I will close this issue. Feel free to reopen if you think it is not addressed.

@yeya24 yeya24 closed this as completed Nov 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: receive difficulty: hard dont-go-stale Label for important issues which tells the stalebot not to close them feature request/improvement help wanted
Projects
None yet
Development

No branches or pull requests

8 participants