-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schedule retry based on schedule on recurring tasks #83682
Merged
gmmorris
merged 5 commits into
elastic:master
from
gmmorris:task-manager/retry-by-schedule
Nov 19, 2020
Merged
Changes from 1 commit
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
cadcab0
schedule retry based on schedule on recurring tasks
gmmorris e1ba694
Merge branch 'master' into task-manager/retry-by-schedule
kibanamachine 75e221b
use max between timeout and schedule when calculating retryAt
gmmorris 6452100
Merge branch 'master' into task-manager/retry-by-schedule
gmmorris 9c0b9c3
Merge branch 'task-manager/retry-by-schedule' of github.com:gmmorris/…
gmmorris File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what your thoughts are on my comment here: #83490 (comment).
The only concern I had in regards to it is when an alert runs every 5 seconds or less and may get retried too early (in case it takes longer than interval to run).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone is running a task that takes longer than X seconds, and is scheduled to run on an X second interval ... I dunno, all bets are off?
I'd guess we will eventually make the retry stuff more complicated anyway, eventually, (eg, add exponential backoff?) so we'll be in this again ... later, with even more interesting cases!
For now, the fix in this PR seems right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My scenario would be the task running usually fast but taking longer when Elasticsearch is under pressure. I don't have strong preference either way but wanted to make sure we discuss such possibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I went back and forth on this...
It wouldn't break the timeout logic as best as I can tell, but it has a potential pit fall - like Patrick noted, if a timeout is longer than the schedule, it will always trump the schedule. It also occurred to me that if we ever introduce a default timeout and that is higher than a schedule specified by the user, then it will win.
I decided to leave it an open question for now rather than introduce a new behaviour that I'm unsure about.
That said, having read you scenario @mikecote I'm less sure about it... 🤔
So, here's the thing: if a timeout is specified, then we should allow the task to run that long before retrying, so we should use the higher of both. But it's worth noting - alerting tasks do not have a timeout, and we don't allow implementers to specify one, so alerting tasks will use the default timeout of
5m
, which means theirretryAt
will never be their schedule...Perhaps we need to change the default timeout behaviour to equal the schedule when it's specified, otherwise it will be
5m
.The end result would be:
retryAt
of5m
(default timeout).retryAt
of the specifiedtimeout
(plus the added attempts *5m
).retryAt
of the specifiedschedule
, so an overruning10s
task will short circuit at10s
. But also means a1h
task that stalled and just never completes will retry on its next scheduled time.retryAt
ofMath.max(task.timeout, task.schedule)
, so a fast task that's got a long timeout (such asschedule:10s
andtimeout:30s
will be allowed to run for30s
before retrying, but aschedule:5m
task with a shortertimeout:30
will only retry after5m
.Perhaps we should also add support for specified
timeout
in alert definitions?Does that sound right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking "no timeout = default timeout" in the scenarios above. So any task with a schedule would fall into option 4 otherwise fall into option 2.
The reason I'm thinking this path is because of "schedules" being a recommended next run (startAt + schedule) and if the task took longer to run than the schedule, it gets re-scheduled to run right away. Similar to how we don't do true cron like scheduling.
Note / question: This may want to do some sort of
Max.max(now, intervalFromDate(...))
otherwise it could re-schedule itself at the a higher level in the queue to run again if the task took longer than the interval to run.This is definitely something we should support at some point, if it's easy to add at the same time, I'm +1 on it otherwise an issue is perfectly fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍