Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schedule retry based on schedule on recurring tasks #83682

Merged
merged 5 commits into from
Nov 19, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@ export class TaskManagerRunner implements TaskRunner {
startedAt: now,
attempts,
retryAt: this.instance.schedule
? intervalFromNow(this.definition.timeout)!
? intervalFromNow(this.instance.schedule!.interval)!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what your thoughts are on my comment here: #83490 (comment).

It seems like the retryAt should be the next scheduled run or the timeout window (whichever is greater) but that may break the timeout logic..

The only concern I had in regards to it is when an alert runs every 5 seconds or less and may get retried too early (in case it takes longer than interval to run).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone is running a task that takes longer than X seconds, and is scheduled to run on an X second interval ... I dunno, all bets are off?

I'd guess we will eventually make the retry stuff more complicated anyway, eventually, (eg, add exponential backoff?) so we'll be in this again ... later, with even more interesting cases!

For now, the fix in this PR seems right to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone is running a task that takes longer than X seconds

My scenario would be the task running usually fast but taking longer when Elasticsearch is under pressure. I don't have strong preference either way but wanted to make sure we discuss such possibility.

Copy link
Contributor Author

@gmmorris gmmorris Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I went back and forth on this...

It wouldn't break the timeout logic as best as I can tell, but it has a potential pit fall - like Patrick noted, if a timeout is longer than the schedule, it will always trump the schedule. It also occurred to me that if we ever introduce a default timeout and that is higher than a schedule specified by the user, then it will win.
I decided to leave it an open question for now rather than introduce a new behaviour that I'm unsure about.

That said, having read you scenario @mikecote I'm less sure about it... 🤔

So, here's the thing: if a timeout is specified, then we should allow the task to run that long before retrying, so we should use the higher of both. But it's worth noting - alerting tasks do not have a timeout, and we don't allow implementers to specify one, so alerting tasks will use the default timeout of 5m, which means their retryAt will never be their schedule...

Perhaps we need to change the default timeout behaviour to equal the schedule when it's specified, otherwise it will be 5m.

The end result would be:

  1. A task type with no timeout or schedule will have a retryAt of 5m (default timeout).
  2. A task type with a timeout will have a retryAt of the specified timeout (plus the added attempts * 5m ).
  3. A task type with a schedule will have a retryAt of the specified schedule, so an overruning 10s task will short circuit at 10s. But also means a 1h task that stalled and just never completes will retry on its next scheduled time.
  4. A task with both a timeout and a schedule will have a retryAt of Math.max(task.timeout, task.schedule), so a fast task that's got a long timeout (such as schedule:10s and timeout:30s will be allowed to run for 30s before retrying, but a schedule:5m task with a shorter timeout:30 will only retry after 5m.

Perhaps we should also add support for specified timeout in alert definitions?

Does that sound right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking "no timeout = default timeout" in the scenarios above. So any task with a schedule would fall into option 4 otherwise fall into option 2.

The reason I'm thinking this path is because of "schedules" being a recommended next run (startAt + schedule) and if the task took longer to run than the schedule, it gets re-scheduled to run right away. Similar to how we don't do true cron like scheduling.

Note / question: This may want to do some sort of Max.max(now, intervalFromDate(...)) otherwise it could re-schedule itself at the a higher level in the queue to run again if the task took longer than the interval to run.

Perhaps we should also add support for specified timeout in alert definitions?

This is definitely something we should support at some point, if it's easy to add at the same time, I'm +1 on it otherwise an issue is perfectly fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

: this.getRetryDelay({
attempts,
// Fake an error. This allows retry logic when tasks keep timing out
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,17 @@ export class SampleTaskManagerFixturePlugin
},
}),
},
sampleRecurringTaskWhichHangs: {
title: 'Sample Recurring Task that Hangs for a minute',
description: 'A sample task that Hangs for a minute on each run.',
maxAttempts: 3,
timeout: '60s',
createTaskRunner: () => ({
async run() {
return await new Promise((resolve) => {});
},
}),
},
sampleOneTimeTaskTimingOut: {
title: 'Sample One-Time Task that Times Out',
description: 'A sample task that times out each run.',
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,28 @@ export default function ({ getService }: FtrProviderContext) {
});
});

it('should schedule the retry of recurring tasks to run at the next schedule when they time out', async () => {
const intervalInMinutes = 30;
const intervalInMilliseconds = intervalInMinutes * 60 * 1000;
const task = await scheduleTask({
taskType: 'sampleRecurringTaskWhichHangs',
schedule: { interval: `${intervalInMinutes}m` },
params: {},
});

await retry.try(async () => {
const [scheduledTask] = (await currentTasks()).docs;
expect(scheduledTask.id).to.eql(task.id);
const retryAt = Date.parse(scheduledTask.retryAt!);
expect(isNaN(retryAt)).to.be(false);

const buffer = 10000; // 10 second buffer
const retryDelay = retryAt - Date.parse(task.runAt);
expect(retryDelay).to.be.greaterThan(intervalInMilliseconds - buffer);
expect(retryDelay).to.be.lessThan(intervalInMilliseconds + buffer);
});
});

it('should reschedule if task returns runAt', async () => {
const nextRunMilliseconds = _.random(60000, 200000);
const count = _.random(1, 20);
Expand Down