Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding exponential backoff to error/retry (#1162) #1172

Closed
wants to merge 1 commit into from
Closed

Conversation

mathewc
Copy link
Member

@mathewc mathewc commented Feb 1, 2017

Addresses #760 and #1162.

}
// Exponential backoff on retries with
// a maximum wait of 5 minutes
await Utility.DelayWithBackoffAsync(attempt, CancellationToken.None, max: TimeSpan.FromMinutes(5));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass the _cancellationToken here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't know there was one. Done.

@@ -82,9 +82,12 @@ public void Delete_SendsExpectedNotification()
// 1 trace per attempt + 1 trace per failed attempt
int expectedTracesBeforeRecovery = (expectedNumberOfAttempts * 2) - 1;
// Before + recovery trace
int expectedTracesAfterRecovery = expectedTracesBeforeRecovery + 1;
int expectedTracesAfterRecovery = expectedTracesBeforeRecovery + 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the differences on the trace counts here is because of the addition of the new trace you have, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now log even on the first interval where there is no wait, whereas previously we only logged if we were delaying.

@@ -38,6 +41,70 @@ public class TestPocoEx : TestPoco

public class UtilityTests
{
[Fact]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would probably be good to have a test to ensure min and max are properly applied (although that's pretty straight forward)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh... missed it.

{
if (_restartDelayTokenSource != null && !_restartDelayTokenSource.IsCancellationRequested)
{
// if we're currently awaiting an error/restart delay
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this here? So any file changes under the script root path will halt restart? Is this just to avoid conflicts when actual script file changes are made? What about conditions that would not trigger a host restart?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any such changes will break out of the delay, resuming the restart immediately. This is to allow the delay to backoff in steady state app scenarios, while retaining development time responsiveness.

Copy link
Member Author

@mathewc mathewc Feb 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The theory is that any host errors are due to invalid user files. If you haven't touched them in a while, we'll backoff. When you finally go fix the issue, we restart immediately. I was also considering weaving IsDebug mode into this, but that might complicate things unecessarily.

Of course a host error could also be due to an external service being down.


if (wait > 0)
{
Trace($"Next recovery attempt in {wait} seconds...", TraceLevel.Warning);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little sad about losing this.... but I'll live.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I couldn't think of a good way to keep it. In the end, we really only need logs that let us know we're retrying though, which we still have.

Copy link
Member

@brettsam brettsam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple comments

}
}
while (!_stopped && !cancellationToken.IsCancellationRequested);
}

private Task CreateRestartBackoffDelay(int consecutiveErrorCount)
{
_restartDelayTokenSource = new CancellationTokenSource();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you dispose this CTS right after you're done with it here? Or wrap this in a using statement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this guy has to stay alive for the lifetime of the token we've pulled out of it.

// a rapid restart cycle
Task.Delay(_config.RestartInterval).GetAwaiter().GetResult();
// attempt restarts using an exponential backoff strategy
CreateRestartBackoffDelay(consecutiveErrorCount).GetAwaiter().GetResult();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we (verbosely) log the number of consecutive errors here?

I'm thinking about when we try to debug with the logs -- seeing a large delay may be confusing to some. Seeing something that says "Consecutive error count: 3. Backing off restart." may help explain why there's a gap in the timestamps between "A ScriptHost error has occurred." and the actual restart of the host.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the attempt to the above "Starting Host" log message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants