Transient indexing failures can inadvertently disable functions #2497

cgillum · 2018-03-09T01:28:06Z

This was a finding based on an issue raised by a customer: Azure/Azure-Functions#709. Here is effectively what we observed:

The host restarts (for whatever reason - it's not actually important)
While indexing an Event Hub trigger function, there is a timeout when accessing an Event Hubs API.
The host's indexer skips the Event Hub trigger function because of the error and continues the startup process.
Customer's Event Hub trigger stops running for hours until they manually restart the function app.

At a high level, this feels like a design flaw in the host startup path. If a function cannot be indexed, we should prevent the host from starting and notify the owner. Swallowing these kinds of transient failures could result in downtime.

Investigative information

Timestamp: 2018-03-07 07:06:53 (UTC)
Function App version (1.0 or 2.0-beta): 1.0
Function App name: Search for instance ID '8c75169cff7e98bbc4fd1e6e30cd707d'.
Function name(s) (as appropriate): send_to_router
Invocation ID: N/A
Region: UK South (LN1)

/CC @mathewc

mathewc · 2018-03-16T18:41:14Z

In the WebJobs core SDK JobHost startup is all or nothing - if any listeners fail to start the host fails to start and nothing runs. In the context of a continuous WebJob deployment, the continuous job will try to start up periodically, eventually succeeding once the transient issue is gone.

For Azure Functions, we did work in the core SDK to allow such listener startup exceptions to be handled by the Functions runtime - on startup the Functions Runtime host will mark listener failures as handled and host startup for other functions will continue. Let's call this partial host startup We log the fact that the affected functions aren't running. This work was done primarily to address completely broken funtions (e.g. invalid/missing connnection string, etc.). We wanted to allow other functions to run even if a few are broken. However that didn't take into account the possibility of transient errors.

To address the transient failure issue, when the host is in partial startup with some function listeners not started, I propose we introduce retries with backoff on the affected listeners. Proposed design for core SDK:

to the existing FunctionListenerException, in addition to its Handled bool property which we set to true to ignore these errors, we add another property, e.g. Retry allowing the handler to specify that the exception should be retried.
we'd then modify the Function runtime to set the Retry property causing the host to start up partially, and failed listeners to retry in the background until they start.

In addition, we'll turn off partial mode completely in cases where the function app is read-only, e.g. when using RUN_FROM_ZIP or when the app is precompiled.

davidebbo · 2018-03-16T19:21:57Z

In addition, we'll turn off partial mode completely in cases where the function app is read-only

So what would be the behavior in that case?

mathewc · 2018-03-16T22:10:16Z

In that case the behavior is similar to the core SDK behavior I mentioned above. The functions host will fail startup and enter it's existing host startup retry loop (exponential backoff). The difference in this case compared to partial startup is that the restart is at the host level, rather than individual function level.

Alternatively, we could have a consistent behavior in all cases and do the partial host + retries logic in all cases.

bryceg · 2018-03-26T21:24:43Z

We seem to be experiencing this frequently with our functions that trigger off event hubs. They will periodically just stop triggering, and go silent. Restarting the app service or enable/disabling the function a few times gets it working again, but this keeps causing production downtime for us.

sergsalo · 2018-03-28T21:11:11Z

any suggestions what can be done as a workaround?

mathewc · 2018-03-28T22:42:15Z

I can't think of any workaround unfortunately. I have a PR out now with a fix, which we'll try to get out soon.

mathewc · 2018-03-30T20:00:29Z

Fix is in and will go out in the next release.

paulbatum · 2018-03-30T22:14:35Z

@mathewc Reopening as I only see a merge for V1 right now, is there a separate issue tracking this change for V2? I couldn't find it. If so feel free to reclose this.

mathewc · 2018-03-30T22:16:20Z

I will be merging to dev in the next day or so, as per my normal workflow.

paulbatum · 2018-03-30T22:17:06Z

thats fine, I just dont want to have issues closed until the change is in both branches (or tracked by two separate issues)

davidebbo · 2018-03-30T22:25:15Z

And more and more, we should consider just doing v2 if non-critical.

sergsalo · 2018-03-30T23:14:49Z

we are running v1 Azure Functions, which experience these issues.

mathewc · 2018-04-30T18:02:11Z

Changes for v2 have been merged to Dev. For v1, we've already released the changes as part of https://github.com/Azure/azure-functions-host/releases/tag/v1.0.11702

cgillum mentioned this issue Mar 9, 2018

EventHub trigger stopped working after couple of days Azure/Azure-Functions#709

Closed

fabiocav added the needs-discussion label Mar 14, 2018

fabiocav assigned mathewc Mar 14, 2018

fabiocav added this to the Sprint 20 milestone Mar 14, 2018

paulbatum modified the milestones: Sprint 20, Backlog Mar 21, 2018

mathewc mentioned this issue Mar 28, 2018

Host Startup Partial Mode (#2497) Azure/azure-webjobs-sdk#1647

Closed

brettsam mentioned this issue Mar 29, 2018

[CosmosDB] Trigger doesn't handle transient connection failures on startup Azure/azure-webjobs-sdk-extensions#398

Closed

mathewc added a commit that referenced this issue Mar 30, 2018

Partial Host support for transient indexing failures (#2497)

8013267

mathewc mentioned this issue Mar 30, 2018

Partial Host Startup support for transient startup failures (#2497) #2606

Closed

mathewc added a commit that referenced this issue Mar 30, 2018

Partial Host support for transient indexing failures (#2497)

3d63fb3

mathewc added a commit that referenced this issue Mar 30, 2018

Partial Host support for transient indexing failures (#2497)

c99ae40

mathewc closed this as completed Mar 30, 2018

paulbatum reopened this Mar 30, 2018

mathewc mentioned this issue May 10, 2022

Listener Stop Handling improvements Azure/azure-sdk-for-net#28668

Closed

mathewc added a commit that referenced this issue Apr 30, 2018

Partial Host support for transient indexing failures (#2497)

ca5ed5e

mathewc closed this as completed Apr 30, 2018

ghost locked as resolved and limited conversation to collaborators Jan 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transient indexing failures can inadvertently disable functions #2497

Transient indexing failures can inadvertently disable functions #2497

cgillum commented Mar 9, 2018

mathewc commented Mar 16, 2018 •

edited

Loading

davidebbo commented Mar 16, 2018

mathewc commented Mar 16, 2018

bryceg commented Mar 26, 2018

sergsalo commented Mar 28, 2018

mathewc commented Mar 28, 2018

mathewc commented Mar 30, 2018

paulbatum commented Mar 30, 2018

mathewc commented Mar 30, 2018

paulbatum commented Mar 30, 2018

davidebbo commented Mar 30, 2018

sergsalo commented Mar 30, 2018

mathewc commented Apr 30, 2018

Transient indexing failures can inadvertently disable functions #2497

Transient indexing failures can inadvertently disable functions #2497

Comments

cgillum commented Mar 9, 2018

Investigative information

mathewc commented Mar 16, 2018 • edited Loading

davidebbo commented Mar 16, 2018

mathewc commented Mar 16, 2018

bryceg commented Mar 26, 2018

sergsalo commented Mar 28, 2018

mathewc commented Mar 28, 2018

mathewc commented Mar 30, 2018

paulbatum commented Mar 30, 2018

mathewc commented Mar 30, 2018

paulbatum commented Mar 30, 2018

davidebbo commented Mar 30, 2018

sergsalo commented Mar 30, 2018

mathewc commented Apr 30, 2018

mathewc commented Mar 16, 2018 •

edited

Loading