Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient indexing failures can inadvertently disable functions #2497

Closed
cgillum opened this issue Mar 9, 2018 · 13 comments
Closed

Transient indexing failures can inadvertently disable functions #2497

cgillum opened this issue Mar 9, 2018 · 13 comments
Assignees
Milestone

Comments

@cgillum
Copy link
Member

cgillum commented Mar 9, 2018

This was a finding based on an issue raised by a customer: Azure/Azure-Functions#709. Here is effectively what we observed:

  1. The host restarts (for whatever reason - it's not actually important)
  2. While indexing an Event Hub trigger function, there is a timeout when accessing an Event Hubs API.
  3. The host's indexer skips the Event Hub trigger function because of the error and continues the startup process.
  4. Customer's Event Hub trigger stops running for hours until they manually restart the function app.

At a high level, this feels like a design flaw in the host startup path. If a function cannot be indexed, we should prevent the host from starting and notify the owner. Swallowing these kinds of transient failures could result in downtime.

Investigative information

  • Timestamp: 2018-03-07 07:06:53 (UTC)
  • Function App version (1.0 or 2.0-beta): 1.0
  • Function App name: Search for instance ID '8c75169cff7e98bbc4fd1e6e30cd707d'.
  • Function name(s) (as appropriate): send_to_router
  • Invocation ID: N/A
  • Region: UK South (LN1)

/CC @mathewc

@mathewc
Copy link
Member

mathewc commented Mar 16, 2018

In the WebJobs core SDK JobHost startup is all or nothing - if any listeners fail to start the host fails to start and nothing runs. In the context of a continuous WebJob deployment, the continuous job will try to start up periodically, eventually succeeding once the transient issue is gone.

For Azure Functions, we did work in the core SDK to allow such listener startup exceptions to be handled by the Functions runtime - on startup the Functions Runtime host will mark listener failures as handled and host startup for other functions will continue. Let's call this partial host startup We log the fact that the affected functions aren't running. This work was done primarily to address completely broken funtions (e.g. invalid/missing connnection string, etc.). We wanted to allow other functions to run even if a few are broken. However that didn't take into account the possibility of transient errors.

To address the transient failure issue, when the host is in partial startup with some function listeners not started, I propose we introduce retries with backoff on the affected listeners. Proposed design for core SDK:

  • to the existing FunctionListenerException, in addition to its Handled bool property which we set to true to ignore these errors, we add another property, e.g. Retry allowing the handler to specify that the exception should be retried.
  • we'd then modify the Function runtime to set the Retry property causing the host to start up partially, and failed listeners to retry in the background until they start.

In addition, we'll turn off partial mode completely in cases where the function app is read-only, e.g. when using RUN_FROM_ZIP or when the app is precompiled.

@davidebbo
Copy link
Contributor

In addition, we'll turn off partial mode completely in cases where the function app is read-only

So what would be the behavior in that case?

@mathewc
Copy link
Member

mathewc commented Mar 16, 2018

In that case the behavior is similar to the core SDK behavior I mentioned above. The functions host will fail startup and enter it's existing host startup retry loop (exponential backoff). The difference in this case compared to partial startup is that the restart is at the host level, rather than individual function level.

Alternatively, we could have a consistent behavior in all cases and do the partial host + retries logic in all cases.

@paulbatum paulbatum modified the milestones: Sprint 20, Backlog Mar 21, 2018
@bryceg
Copy link

bryceg commented Mar 26, 2018

We seem to be experiencing this frequently with our functions that trigger off event hubs. They will periodically just stop triggering, and go silent. Restarting the app service or enable/disabling the function a few times gets it working again, but this keeps causing production downtime for us.

@sergsalo
Copy link

any suggestions what can be done as a workaround?

@mathewc
Copy link
Member

mathewc commented Mar 28, 2018

I can't think of any workaround unfortunately. I have a PR out now with a fix, which we'll try to get out soon.

@mathewc
Copy link
Member

mathewc commented Mar 30, 2018

Fix is in and will go out in the next release.

@mathewc mathewc closed this as completed Mar 30, 2018
@paulbatum
Copy link
Member

@mathewc Reopening as I only see a merge for V1 right now, is there a separate issue tracking this change for V2? I couldn't find it. If so feel free to reclose this.

@paulbatum paulbatum reopened this Mar 30, 2018
@mathewc
Copy link
Member

mathewc commented Mar 30, 2018

I will be merging to dev in the next day or so, as per my normal workflow.

@paulbatum
Copy link
Member

thats fine, I just dont want to have issues closed until the change is in both branches (or tracked by two separate issues)

@davidebbo
Copy link
Contributor

And more and more, we should consider just doing v2 if non-critical.

@sergsalo
Copy link

we are running v1 Azure Functions, which experience these issues.

@mathewc
Copy link
Member

mathewc commented Apr 30, 2018

Changes for v2 have been merged to Dev. For v1, we've already released the changes as part of https://github.com/Azure/azure-functions-host/releases/tag/v1.0.11702

@mathewc mathewc closed this as completed Apr 30, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Jan 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants