Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial Host Startup support for transient startup failures (#2497) #2606

Closed
wants to merge 1 commit into from

Conversation

mathewc
Copy link
Member

@mathewc mathewc commented Mar 30, 2018

Fix for #2497. See details in PR for SDK feature here: Azure/azure-webjobs-sdk#1647.

The main impact for Functions is that now listener level exceptions will be retried behind the scenes. E.g. you could see these changes in action if you modify the EventHubTrigger sample function by changing the 'path' property of the trigger binding to an invalid value. Previously we would log the error and start the host with that function disabled. Now, we'll still allow the host to start, but we'll also continue trying to start that function's listener in the background, exponentially backing off. While the host is up and retrying, if you went and added that path to your EventHub the listener would start and the function would be active.

As the issue indicates this is intended to address transient listener failures. However we will now be retrying all listener exceptions, meaning if a function is permanently broken and the failure is a listener level failure, we'll retry. Most broken functions will result in indexing failures not listener failures, so hopefully this won't generate too much background activity in general. People should be fixing (or disabling) their broken functions anyways.

Another change here is that the new host.json config setting 'allowPartialHostStartup' can now be set by the user, so people can turn this behavior off if they wanted too.

@@ -1795,9 +1804,6 @@ private void HandleHostError(Exception exception)
// Also notify the invoker so the error can also be written to the function
// log file
NotifyInvoker(functionException.MethodName, functionException);

// Mark the error as handled so execution will continue with this function disabled
functionException.Handled = true;
Copy link
Member Author

@mathewc mathewc Mar 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer need to set this - the host level knob replaces this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming we already had some tests validating this, and they just continue passing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. However I just added a bit to that test to verify the background retries.

This test was a ServiceBus listener test, and it turned out that the retries exposed a bug in ServiceBusListener state management. See supporting PR https://github.com/Azure/azure-webjobs-sdk/pull/1650/files

@mathewc mathewc requested review from paulbatum and brettsam March 30, 2018 01:40
@mathewc mathewc changed the title Partial Host support for transient indexing failures (#2497) Partial Host Startup support for transient indexing failures (#2497) Mar 30, 2018
// Default AllowHostPartialStartup to true, but allow it
// to be overridden by config
hostConfig.AllowPartialHostStartup = true;
JToken allowPartialHostStartup = (JToken)config["allowPartialHostStartup"];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can now turn this feature off completely if they wish.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a 'full' test where we turn this off with a 'bad' binding? What happens? Does the host throw an exception and the ScriptHostManager goes into it's backoff/retry loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this off, you'll get normal SDK behavior. The host will fail to start and will enter the ScriptHostManager host start loop, as it would for any other unhandled host level error.

@mathewc mathewc changed the title Partial Host Startup support for transient indexing failures (#2497) Partial Host Startup support for transient startup failures (#2497) Mar 30, 2018
Copy link
Member

@brettsam brettsam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one question about whether we need a new test. Otherwise, looks good.

@@ -1795,9 +1804,6 @@ private void HandleHostError(Exception exception)
// Also notify the invoker so the error can also be written to the function
// log file
NotifyInvoker(functionException.MethodName, functionException);

// Mark the error as handled so execution will continue with this function disabled
functionException.Handled = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming we already had some tests validating this, and they just continue passing?

// Default AllowHostPartialStartup to true, but allow it
// to be overridden by config
hostConfig.AllowPartialHostStartup = true;
JToken allowPartialHostStartup = (JToken)config["allowPartialHostStartup"];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a 'full' test where we turn this off with a 'bad' binding? What happens? Does the host throw an exception and the ScriptHostManager goes into it's backoff/retry loop?

@mathewc mathewc closed this Mar 30, 2018
@mathewc mathewc deleted the partial_startup branch May 18, 2018 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants