Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host Startup Partial Mode (#2497) #1647

Closed
wants to merge 1 commit into from
Closed

Host Startup Partial Mode (#2497) #1647

wants to merge 1 commit into from

Conversation

mathewc
Copy link
Member

@mathewc mathewc commented Mar 28, 2018

Core changes required to address Functions issue Azure/azure-functions-host#2497. In Functions, we'll set this new mode to true. The default behavior for Functions will be the same as we have now in that we allow the host to start partially - however we'll no longer ignore listener startup failures. Listeners will now be retried in the background.

While this feature is being added primarily for Azure Functions usage, it can be generally useful. E.g. in a continuous WebJob with many different functions, the back end service for one listener might be down for a bit preventing the listener from starting. However all the other functions can run fine. With this feature, rather than the host not being able to start at all until ALL listeners can start (i.e. the standard continuous WebJobs restart loop), most functions can start running immediately while we wait for the one to recover.

@@ -109,7 +112,13 @@ internal class FunctionIndexer
}
catch (FunctionIndexingException fex)
{
if (_allowPartialHostStartup)
{
fex.Handled = true;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this mode is set, we'll now be defaulting these to handled. In Functions currently we're setting this manually (code here). We'll be able to remove that in favor of setting the new host config property.

@mathewc mathewc force-pushed the mathewc-work-v2.x branch from 027bbe8 to 215dea4 Compare March 28, 2018 22:35
using System.Threading;
using System.Threading.Tasks;

namespace Microsoft.Azure.WebJobs.Host
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These utility methods and tests for exponential backoff are copied from an Azure Functions internal helper I also wrote.


var validators = new Action<string>[]
{
p => Assert.Equal(p, "The listener for function 'testfunc' was unable to start."),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the logs users will see while this is going on in the background

@mathewc mathewc requested a review from brettsam March 28, 2018 22:37
/// - Functions listener startup failures will be retried in the background
/// until they start.
/// </remarks>
public bool AllowPartialHostStartup { get; set; }
Copy link
Member Author

@mathewc mathewc Mar 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New public surface area. Still considering the right name for this

// if to here, the listener exception was handled and
// we're in partial startup mode.
// we spin up a background task retrying to start the listener
Task taskIgnore = Task.Run(() => RetryStartWithBackoffAsync(cancellationToken), cancellationToken);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what does the additional Task.Run(..) achieve here? How is it different from just
`Task taskIgnore = RetryStartWithBackoffAsync(cancellationToken);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that invoking directly (fire and forget) will capture the current synchronization context and resume on that context. Task.Run will just start and complete on a thread pool thread without capturing/restoring a context.

string message = $"Retrying to start listener for function '{_descriptor.ShortName}' (Attempt {attempt})";
_trace.Info(message);
_logger?.LogInformation(message);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, is this going to be really noisy in the case of misconfigured functions where we never successfully start the listener? If so, can we use some sort of category here to make it easy to filter the noise out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "misconfigured"? Many issues will be indexing time failures (e.g. app setting for Connection doesn't exist, and in many cases even an invalid connection string for bindings like queue that do up front validation). If the function is truly broken, the backoff will max out to once every 2 minutes, which won't be very noisy. But yes, there will be an initial flurry of retries.

I'm open to putting them in a category

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we don't use a separate category for these now and wait to see if it's actually a problem. Always easy to add later. I started creating a new logger instance for these under a different category and it seemed unnecessarily artificial.

// swallow and log exceptions since we're in a retry loop
// at this point doing this in the background
_trace.Error(ex.Message, ex);
_logger?.LogError(0, ex, ex.Message);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, wondering about whether this will generate lots of log noise..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noise for us you mean? Or for customers - I don't think it is noise for customers. They should be seeing these.

@paulbatum
Copy link
Member

Its a bit hard for me to see from the diff, is there any public surface area change here? It looks like the flag is on internal types but maybe I'm misunderstanding..

@mathewc
Copy link
Member Author

mathewc commented Mar 28, 2018

The new public surface area is the new JobHostConfiguration property which is public and is a new feature that people can use.

@mathewc mathewc force-pushed the mathewc-work-v2.x branch from 215dea4 to 0656285 Compare March 28, 2018 23:18
@mathewc mathewc requested a review from fabiocav March 28, 2018 23:18
@paulbatum
Copy link
Member

Ahh yes that makes sense, thanks.

Copy link
Member

@brettsam brettsam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// <param name="min">The minimum delay.</param>
/// <param name="max">The maximum delay.</param>
/// <returns>A <see cref="Task"/> representing the computed backoff interval.</returns>
public static async Task DelayWithBackoffAsync(int exponent, CancellationToken cancellationToken, TimeSpan? unit = null, TimeSpan? min = null, TimeSpan? max = null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a RandomizedExponentialBackoffStrategy -- can you use that existing logic here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll move to that and remove the new code I added for this. I had considered that previously but thought the core backoff stuff was tied too deeply with the TaskSeriesTimer stuff (I was wrong).

@mathewc mathewc force-pushed the mathewc-work-v2.x branch from 0656285 to e7f69bb Compare March 29, 2018 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants