Adding exponential backoff to error/retry (#1162) #1172

mathewc · 2017-02-01T21:44:45Z

Addresses #760 and #1162.

fabiocav · 2017-02-01T22:43:11Z

src/WebJobs.Script/IO/AutoRecoveringFileSystemWatcher.cs

-            }
+            // Exponential backoff on retries with
+            // a maximum wait of 5 minutes
+            await Utility.DelayWithBackoffAsync(attempt, CancellationToken.None, max: TimeSpan.FromMinutes(5));


Pass the _cancellationToken here.

Ah, I didn't know there was one. Done.

fabiocav · 2017-02-01T22:58:16Z

test/WebJobs.Script.Tests/IO/AutoRecoveringFileSystemWatcherTests.cs

@@ -82,9 +82,12 @@ public void Delete_SendsExpectedNotification()
                // 1 trace per attempt + 1 trace per failed attempt
                int expectedTracesBeforeRecovery = (expectedNumberOfAttempts * 2) - 1;
                // Before + recovery trace
-                int expectedTracesAfterRecovery = expectedTracesBeforeRecovery + 1;
+                int expectedTracesAfterRecovery = expectedTracesBeforeRecovery + 2;


Assuming the differences on the trace counts here is because of the addition of the new trace you have, correct?

We now log even on the first interval where there is no wait, whereas previously we only logged if we were delaying.

fabiocav · 2017-02-01T23:00:59Z

test/WebJobs.Script.Tests/UtilityTests.cs

@@ -38,6 +41,70 @@ public class TestPocoEx : TestPoco

    public class UtilityTests
    {
+        [Fact]


Would probably be good to have a test to ensure min and max are properly applied (although that's pretty straight forward)

Ahh... missed it.

fabiocav · 2017-02-01T23:06:11Z

src/WebJobs.Script/Host/ScriptHostManager.cs

+        {
+            if (_restartDelayTokenSource != null && !_restartDelayTokenSource.IsCancellationRequested)
+            {
+                // if we're currently awaiting an error/restart delay


Can you clarify this here? So any file changes under the script root path will halt restart? Is this just to avoid conflicts when actual script file changes are made? What about conditions that would not trigger a host restart?

Any such changes will break out of the delay, resuming the restart immediately. This is to allow the delay to backoff in steady state app scenarios, while retaining development time responsiveness.

The theory is that any host errors are due to invalid user files. If you haven't touched them in a while, we'll backoff. When you finally go fix the issue, we restart immediately. I was also considering weaving IsDebug mode into this, but that might complicate things unecessarily.

Of course a host error could also be due to an external service being down.

fabiocav · 2017-02-01T23:07:00Z

src/WebJobs.Script/IO/AutoRecoveringFileSystemWatcher.cs

-
-            if (wait > 0)
-            {
-                Trace($"Next recovery attempt in {wait} seconds...", TraceLevel.Warning);


A little sad about losing this.... but I'll live.

Yeah, I couldn't think of a good way to keep it. In the end, we really only need logs that let us know we're retrying though, which we still have.

brettsam

a couple comments

brettsam · 2017-02-02T00:42:31Z

src/WebJobs.Script/Host/ScriptHostManager.cs

                }
            }
            while (!_stopped && !cancellationToken.IsCancellationRequested);
        }

+        private Task CreateRestartBackoffDelay(int consecutiveErrorCount)
+        {
+            _restartDelayTokenSource = new CancellationTokenSource();


Should you dispose this CTS right after you're done with it here? Or wrap this in a using statement?

I think this guy has to stay alive for the lifetime of the token we've pulled out of it.

brettsam · 2017-02-02T00:46:05Z

src/WebJobs.Script/Host/ScriptHostManager.cs

-                    // a rapid restart cycle
-                    Task.Delay(_config.RestartInterval).GetAwaiter().GetResult();
+                    // attempt restarts using an exponential backoff strategy
+                    CreateRestartBackoffDelay(consecutiveErrorCount).GetAwaiter().GetResult();


Can we (verbosely) log the number of consecutive errors here?

I'm thinking about when we try to debug with the logs -- seeing a large delay may be confusing to some. Seeing something that says "Consecutive error count: 3. Backing off restart." may help explain why there's a gap in the timestamps between "A ScriptHost error has occurred." and the actual restart of the host.

I'll add the attempt to the above "Starting Host" log message

dnfclas added the cla-already-signed label Feb 1, 2017

ahmelsayed added the in progress label Feb 1, 2017

mathewc force-pushed the mathewc-dev branch from 941a4fd to 12e0499 Compare February 1, 2017 22:34

fabiocav reviewed Feb 1, 2017

View reviewed changes

mathewc force-pushed the mathewc-dev branch from 12e0499 to bf88d50 Compare February 1, 2017 23:21

fabiocav approved these changes Feb 1, 2017

View reviewed changes

Adding exponential backoff to error/retry (#1162)

7bc5d05

mathewc force-pushed the mathewc-dev branch from bf88d50 to 7bc5d05 Compare February 1, 2017 23:37

brettsam approved these changes Feb 2, 2017

View reviewed changes

mathewc closed this Feb 2, 2017

ahmelsayed removed the in progress label Feb 2, 2017

maiqbal11 pushed a commit that referenced this pull request Nov 1, 2019

Update PowerShell worker to 0.1.90-preview. (#1172)

2fcedb3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding exponential backoff to error/retry (#1162) #1172

Adding exponential backoff to error/retry (#1162) #1172

mathewc commented Feb 1, 2017 •

edited

Loading

fabiocav Feb 1, 2017

mathewc Feb 1, 2017

fabiocav Feb 1, 2017

mathewc Feb 1, 2017

fabiocav Feb 1, 2017

mathewc Feb 1, 2017

fabiocav Feb 1, 2017

fabiocav Feb 1, 2017

mathewc Feb 1, 2017

mathewc Feb 1, 2017 •

edited

Loading

fabiocav Feb 1, 2017

mathewc Feb 1, 2017

brettsam left a comment

brettsam Feb 2, 2017

mathewc Feb 2, 2017

brettsam Feb 2, 2017

mathewc Feb 2, 2017

Adding exponential backoff to error/retry (#1162) #1172

Adding exponential backoff to error/retry (#1162) #1172

Conversation

mathewc commented Feb 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mathewc Feb 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brettsam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mathewc commented Feb 1, 2017 •

edited

Loading

mathewc Feb 1, 2017 •

edited

Loading