-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding exponential backoff to error/retry (#1162) #1172
Conversation
} | ||
// Exponential backoff on retries with | ||
// a maximum wait of 5 minutes | ||
await Utility.DelayWithBackoffAsync(attempt, CancellationToken.None, max: TimeSpan.FromMinutes(5)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass the _cancellationToken
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't know there was one. Done.
@@ -82,9 +82,12 @@ public void Delete_SendsExpectedNotification() | |||
// 1 trace per attempt + 1 trace per failed attempt | |||
int expectedTracesBeforeRecovery = (expectedNumberOfAttempts * 2) - 1; | |||
// Before + recovery trace | |||
int expectedTracesAfterRecovery = expectedTracesBeforeRecovery + 1; | |||
int expectedTracesAfterRecovery = expectedTracesBeforeRecovery + 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the differences on the trace counts here is because of the addition of the new trace you have, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now log even on the first interval where there is no wait, whereas previously we only logged if we were delaying.
@@ -38,6 +41,70 @@ public class TestPocoEx : TestPoco | |||
|
|||
public class UtilityTests | |||
{ | |||
[Fact] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would probably be good to have a test to ensure min and max are properly applied (although that's pretty straight forward)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh... missed it.
{ | ||
if (_restartDelayTokenSource != null && !_restartDelayTokenSource.IsCancellationRequested) | ||
{ | ||
// if we're currently awaiting an error/restart delay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify this here? So any file changes under the script root path will halt restart? Is this just to avoid conflicts when actual script file changes are made? What about conditions that would not trigger a host restart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any such changes will break out of the delay, resuming the restart immediately. This is to allow the delay to backoff in steady state app scenarios, while retaining development time responsiveness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The theory is that any host errors are due to invalid user files. If you haven't touched them in a while, we'll backoff. When you finally go fix the issue, we restart immediately. I was also considering weaving IsDebug mode into this, but that might complicate things unecessarily.
Of course a host error could also be due to an external service being down.
|
||
if (wait > 0) | ||
{ | ||
Trace($"Next recovery attempt in {wait} seconds...", TraceLevel.Warning); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little sad about losing this.... but I'll live.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I couldn't think of a good way to keep it. In the end, we really only need logs that let us know we're retrying though, which we still have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple comments
} | ||
} | ||
while (!_stopped && !cancellationToken.IsCancellationRequested); | ||
} | ||
|
||
private Task CreateRestartBackoffDelay(int consecutiveErrorCount) | ||
{ | ||
_restartDelayTokenSource = new CancellationTokenSource(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you dispose this CTS right after you're done with it here? Or wrap this in a using statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this guy has to stay alive for the lifetime of the token we've pulled out of it.
// a rapid restart cycle | ||
Task.Delay(_config.RestartInterval).GetAwaiter().GetResult(); | ||
// attempt restarts using an exponential backoff strategy | ||
CreateRestartBackoffDelay(consecutiveErrorCount).GetAwaiter().GetResult(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we (verbosely) log the number of consecutive errors here?
I'm thinking about when we try to debug with the logs -- seeing a large delay may be confusing to some. Seeing something that says "Consecutive error count: 3. Backing off restart." may help explain why there's a gap in the timestamps between "A ScriptHost error has occurred." and the actual restart of the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the attempt to the above "Starting Host" log message
Addresses #760 and #1162.