Host Startup Partial Mode (#2497) #1647

mathewc · 2018-03-28T22:16:25Z

Core changes required to address Functions issue Azure/azure-functions-host#2497. In Functions, we'll set this new mode to true. The default behavior for Functions will be the same as we have now in that we allow the host to start partially - however we'll no longer ignore listener startup failures. Listeners will now be retried in the background.

While this feature is being added primarily for Azure Functions usage, it can be generally useful. E.g. in a continuous WebJob with many different functions, the back end service for one listener might be down for a bit preventing the listener from starting. However all the other functions can run fine. With this feature, rather than the host not being able to start at all until ALL listeners can start (i.e. the standard continuous WebJobs restart loop), most functions can start running immediately while we wait for the one to recover.

mathewc · 2018-03-28T22:19:04Z

src/Microsoft.Azure.WebJobs.Host/Indexers/FunctionIndexer.cs

@@ -109,7 +112,13 @@ internal class FunctionIndexer
                }
                catch (FunctionIndexingException fex)
                {
+                    if (_allowPartialHostStartup)
+                    {
+                        fex.Handled = true;


When this mode is set, we'll now be defaulting these to handled. In Functions currently we're setting this manually (code here). We'll be able to remove that in favor of setting the new host config property.

mathewc · 2018-03-28T22:36:21Z

src/Microsoft.Azure.WebJobs.Host/Utility.cs

+using System.Threading;
+using System.Threading.Tasks;
+
+namespace Microsoft.Azure.WebJobs.Host


These utility methods and tests for exponential backoff are copied from an Azure Functions internal helper I also wrote.

mathewc · 2018-03-28T22:37:22Z

test/Microsoft.Azure.WebJobs.Host.UnitTests/Listeners/FunctionListenerTests.cs

+
+            var validators = new Action<string>[]
+            {
+                p => Assert.Equal(p, "The listener for function 'testfunc' was unable to start."),


These are the logs users will see while this is going on in the background

mathewc · 2018-03-28T22:38:20Z

src/Microsoft.Azure.WebJobs.Host/JobHostConfiguration.cs

+        ///   - Functions listener startup failures will be retried in the background
+        ///     until they start.
+        /// </remarks>
+        public bool AllowPartialHostStartup { get; set; }


New public surface area. Still considering the right name for this

paulbatum · 2018-03-28T22:34:02Z

src/Microsoft.Azure.WebJobs.Host/Listeners/FunctionListener.cs

+                    // if to here, the listener exception was handled and
+                    // we're in partial startup mode.
+                    // we spin up a background task retrying to start the listener
+                    Task taskIgnore = Task.Run(() => RetryStartWithBackoffAsync(cancellationToken), cancellationToken);


Just curious, what does the additional Task.Run(..) achieve here? How is it different from just
`Task taskIgnore = RetryStartWithBackoffAsync(cancellationToken);

My understanding is that invoking directly (fire and forget) will capture the current synchronization context and resume on that context. Task.Run will just start and complete on a thread pool thread without capturing/restoring a context.

paulbatum · 2018-03-28T22:37:20Z

src/Microsoft.Azure.WebJobs.Host/Listeners/FunctionListener.cs

+                    string message = $"Retrying to start listener for function '{_descriptor.ShortName}' (Attempt {attempt})";
+                    _trace.Info(message);
+                    _logger?.LogInformation(message);
+


I wonder, is this going to be really noisy in the case of misconfigured functions where we never successfully start the listener? If so, can we use some sort of category here to make it easy to filter the noise out?

What do you mean by "misconfigured"? Many issues will be indexing time failures (e.g. app setting for Connection doesn't exist, and in many cases even an invalid connection string for bindings like queue that do up front validation). If the function is truly broken, the backoff will max out to once every 2 minutes, which won't be very noisy. But yes, there will be an initial flurry of retries.

I'm open to putting them in a category

I'm thinking we don't use a separate category for these now and wait to see if it's actually a problem. Always easy to add later. I started creating a new logger instance for these under a different category and it seemed unnecessarily artificial.

paulbatum · 2018-03-28T22:37:48Z

src/Microsoft.Azure.WebJobs.Host/Listeners/FunctionListener.cs

+                    // swallow and log exceptions since we're in a retry loop
+                    // at this point doing this in the background
+                    _trace.Error(ex.Message, ex);
+                    _logger?.LogError(0, ex, ex.Message);


Same as above, wondering about whether this will generate lots of log noise..

Noise for us you mean? Or for customers - I don't think it is noise for customers. They should be seeing these.

paulbatum · 2018-03-28T22:45:07Z

Its a bit hard for me to see from the diff, is there any public surface area change here? It looks like the flag is on internal types but maybe I'm misunderstanding..

mathewc · 2018-03-28T23:17:31Z

The new public surface area is the new JobHostConfiguration property which is public and is a new feature that people can use.

paulbatum · 2018-03-28T23:26:30Z

Ahh yes that makes sense, thanks.

brettsam

This was very timely: Azure/azure-webjobs-sdk-extensions#398

brettsam · 2018-03-29T14:39:26Z

src/Microsoft.Azure.WebJobs.Host/Utility.cs

+        /// <param name="min">The minimum delay.</param>
+        /// <param name="max">The maximum delay.</param>
+        /// <returns>A <see cref="Task"/> representing the computed backoff interval.</returns>
+        public static async Task DelayWithBackoffAsync(int exponent, CancellationToken cancellationToken, TimeSpan? unit = null, TimeSpan? min = null, TimeSpan? max = null)


We already have a RandomizedExponentialBackoffStrategy -- can you use that existing logic here?

Yes, I'll move to that and remove the new code I added for this. I had considered that previously but thought the core backoff stuff was tied too deeply with the TaskSeriesTimer stuff (I was wrong).

mathewc commented Mar 28, 2018

View reviewed changes

mathewc force-pushed the mathewc-work-v2.x branch from 027bbe8 to 215dea4 Compare March 28, 2018 22:35

mathewc commented Mar 28, 2018

View reviewed changes

mathewc requested a review from brettsam March 28, 2018 22:37

mathewc commented Mar 28, 2018

View reviewed changes

paulbatum reviewed Mar 28, 2018

View reviewed changes

mathewc force-pushed the mathewc-work-v2.x branch from 215dea4 to 0656285 Compare March 28, 2018 23:18

mathewc requested a review from fabiocav March 28, 2018 23:18

brettsam approved these changes Mar 29, 2018

View reviewed changes

mathewc force-pushed the mathewc-work-v2.x branch from 0656285 to e7f69bb Compare March 29, 2018 23:46

Host Startup Partial Mode (#2497)

957b974

mathewc force-pushed the mathewc-work-v2.x branch from e7f69bb to 957b974 Compare March 29, 2018 23:47

mathewc closed this Mar 30, 2018

mathewc mentioned this pull request Mar 30, 2018

Partial Host Startup support for transient startup failures (#2497) Azure/azure-functions-host#2606

Closed

brettsam mentioned this pull request May 9, 2018

Race condition when retrying failed listeners #1691

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Host Startup Partial Mode (#2497) #1647

Host Startup Partial Mode (#2497) #1647

mathewc commented Mar 28, 2018 •

edited

Loading

mathewc Mar 28, 2018

mathewc Mar 28, 2018

mathewc Mar 28, 2018

mathewc Mar 28, 2018 •

edited

Loading

paulbatum Mar 28, 2018

mathewc Mar 28, 2018

paulbatum Mar 28, 2018

mathewc Mar 28, 2018

mathewc Mar 30, 2018

paulbatum Mar 28, 2018

mathewc Mar 28, 2018

paulbatum commented Mar 28, 2018

mathewc commented Mar 28, 2018

paulbatum commented Mar 28, 2018

brettsam left a comment

brettsam Mar 29, 2018

mathewc Mar 29, 2018

Host Startup Partial Mode (#2497) #1647

Host Startup Partial Mode (#2497) #1647

Conversation

mathewc commented Mar 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mathewc Mar 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulbatum commented Mar 28, 2018

mathewc commented Mar 28, 2018

paulbatum commented Mar 28, 2018

brettsam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mathewc commented Mar 28, 2018 •

edited

Loading

mathewc Mar 28, 2018 •

edited

Loading