Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re:dash should support incremental scheduled jobs #35

Closed
fbertsch opened this issue Feb 3, 2017 · 6 comments · Fixed by #339
Closed

re:dash should support incremental scheduled jobs #35

fbertsch opened this issue Feb 3, 2017 · 6 comments · Fixed by #339
Assignees
Milestone

Comments

@fbertsch
Copy link

fbertsch commented Feb 3, 2017

A lot of our data in Presto is partitioned by submission_date. When we run queries over this data, often times the result is also partitioned by submission_date (for example, MAU/DAU/WAU). That means for historical data, the results don't change, but we re-compute anyways.

Incremental jobs would solve this. The results from yesterday would be joined with the results from today. The query would have to have $SUBMISSION_DATE in it somewhere, which re:dash would automatically fill in with $YESTERDAY. It would be up to the query writer to ensure correctness; i.e. that the query is idempotent.

@washort
Copy link

washort commented Feb 28, 2017

in #28 Arik mentioned an upcoming feature will be making queries based on existing query results. Let's wait until that lands before digging into this.

@rafrombrc rafrombrc added this to the 6 milestone Jun 21, 2017
@alison985 alison985 removed this from the 6 milestone Jul 13, 2017
@rafrombrc rafrombrc added this to the 12 milestone Nov 29, 2017
@rafrombrc rafrombrc added ready and removed blocked labels Nov 29, 2017
@rafrombrc rafrombrc modified the milestones: 12, 13 Jan 10, 2018
@washort
Copy link

washort commented Jan 17, 2018

I misunderstood this - I thought outputs from previous queries were needed as inputs to new queries. Just appending new results to old ones should be simple -- the major question is if we need to handle removing results or if they should just accumulate indefinitely.

@fbertsch
Copy link
Author

@washort we should certainly remove old results. Here are two options:

  • Require a LIMIT clause on every incremental query. This would also require users to sort the results, but would be less work on the re:dash side.
  • Include a variable configuration for every incremental query that limits results to the N most recent runs. The re:dash DB would then need to keep a monotonically increasing id with every result set, in order to just include the most recent N (and remove those prior).

@washort
Copy link

washort commented Jan 24, 2018

I don't see how a LIMIT clause would help, since that would just affect the results from a single run. An N-most-recent setting for the scheduler seems reasonable and I'll start there.

@fbertsch
Copy link
Author

@washort some other thoughts that came up in regards to this feature:

  • How will it deal with failure/rescheduling, when the new day's query fails? When will the query hard fail (after e.g. 3 failures), and will it email the job owner?
  • How will the data be backfilled? For example: I create a new query for Firefox DAU, but the incremental query is only for today. I need the past 6 months to show as well. Would this be possible?

Not all of these need to be answered right now, we can improve them incrementally.

jezdez pushed a commit that referenced this issue May 13, 2019
jezdez pushed a commit that referenced this issue May 15, 2019
washort pushed a commit that referenced this issue Jun 10, 2019
jezdez pushed a commit that referenced this issue Jun 13, 2019
washort pushed a commit that referenced this issue Jun 27, 2019
washort pushed a commit that referenced this issue Jun 28, 2019
emtwo pushed a commit that referenced this issue Jul 15, 2019
emtwo pushed a commit that referenced this issue Jul 17, 2019
jezdez pushed a commit that referenced this issue Aug 12, 2019
jezdez pushed a commit that referenced this issue Aug 14, 2019
jezdez pushed a commit that referenced this issue Aug 19, 2019
washort pushed a commit that referenced this issue Sep 16, 2019
emtwo pushed a commit that referenced this issue Nov 5, 2019
jezdez pushed a commit that referenced this issue Jan 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants