-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transient indexing failures can inadvertently disable functions #2497
Comments
In the WebJobs core SDK JobHost startup is all or nothing - if any listeners fail to start the host fails to start and nothing runs. In the context of a continuous WebJob deployment, the continuous job will try to start up periodically, eventually succeeding once the transient issue is gone. For Azure Functions, we did work in the core SDK to allow such listener startup exceptions to be handled by the Functions runtime - on startup the Functions Runtime host will mark listener failures as handled and host startup for other functions will continue. Let's call this partial host startup We log the fact that the affected functions aren't running. This work was done primarily to address completely broken funtions (e.g. invalid/missing connnection string, etc.). We wanted to allow other functions to run even if a few are broken. However that didn't take into account the possibility of transient errors. To address the transient failure issue, when the host is in partial startup with some function listeners not started, I propose we introduce retries with backoff on the affected listeners. Proposed design for core SDK:
In addition, we'll turn off partial mode completely in cases where the function app is read-only, e.g. when using RUN_FROM_ZIP or when the app is precompiled. |
So what would be the behavior in that case? |
In that case the behavior is similar to the core SDK behavior I mentioned above. The functions host will fail startup and enter it's existing host startup retry loop (exponential backoff). The difference in this case compared to partial startup is that the restart is at the host level, rather than individual function level. Alternatively, we could have a consistent behavior in all cases and do the partial host + retries logic in all cases. |
We seem to be experiencing this frequently with our functions that trigger off event hubs. They will periodically just stop triggering, and go silent. Restarting the app service or enable/disabling the function a few times gets it working again, but this keeps causing production downtime for us. |
any suggestions what can be done as a workaround? |
I can't think of any workaround unfortunately. I have a PR out now with a fix, which we'll try to get out soon. |
Fix is in and will go out in the next release. |
@mathewc Reopening as I only see a merge for V1 right now, is there a separate issue tracking this change for V2? I couldn't find it. If so feel free to reclose this. |
I will be merging to dev in the next day or so, as per my normal workflow. |
thats fine, I just dont want to have issues closed until the change is in both branches (or tracked by two separate issues) |
And more and more, we should consider just doing v2 if non-critical. |
we are running v1 Azure Functions, which experience these issues. |
Changes for v2 have been merged to Dev. For v1, we've already released the changes as part of https://github.com/Azure/azure-functions-host/releases/tag/v1.0.11702 |
This was a finding based on an issue raised by a customer: Azure/Azure-Functions#709. Here is effectively what we observed:
At a high level, this feels like a design flaw in the host startup path. If a function cannot be indexed, we should prevent the host from starting and notify the owner. Swallowing these kinds of transient failures could result in downtime.
Investigative information
/CC @mathewc
The text was updated successfully, but these errors were encountered: