-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable Retry Logic - Preview 1 #693
Configurable Retry Logic - Preview 1 #693
Conversation
Where is the proposal? |
There's been some previous conversation involving @scoriani in #416. As I wrote there (#416 (comment)), I'm personally quite sceptical of the idea of a database driver attempting to internally handle resiliency; the various complexities around what to do when something fails in the middle of a transaction point towards resiliency being handled at a higher layer. But I'm definitely interested in hearing views to the contrary - any input on how the concerns raised in #416 are to be addressed? |
The main aim of this PR is realizing the best-predefined retry solutions and being sure this design can support them in addition to support the customer's customized approaches. |
I'm still very confused about how this approach will actually work, especially around transactions - how can any sort of driver retry logic handle things, regardless of whether the logic is user-injected or not? I can't see how any meaningful retry strategy can work without an additional layer on top of the driver, at which point any sort of driver support seems unneeded (or rather, unusable). I may be missing the big picture here (or just misunderstanding what's being attempted)... would it be possible to get some sort of short description of how this concept works around transactions? |
@roji The push is coming from application migrations from on-premise databases to cloud databases. There is much desire to get something out and provide an infrastructure for extending retry scenarios without major application re-writes. While cloud services offer uptimes like 99.99%, "outages" of a few seconds, due to failover for example, can occur at almost any time. Many of those applications being migrated were designed for a database that used to sit right next to it and rarely suffered from little "blips" like this so designing for them was overkill. Transaction scenarios are probably not all going to be solved in this implementation. But depending on the error returned, it may even be safe to retry an operation within a transaction. I think providing retry capabilities at least gives customers potential options short of a major refactor. This will be off by default and give customers new knobs to test with. |
@David-Engel I do understand why people are looking for this - it's very tempting to speak of turning a knob and getting 100% resiliency in the face of transient errors, without any code changes. Unfortunately, I think it's simply not possible to do this at such a low level in the stack (ADO.NET driver), and that resiliency is a problem that must be faced at a higher level up the stack (e.g. by something like EF or the application itself, via something like Polly). Here are a few of the problematic scenarios I'm thinking about:
There are other similar issues - this is just off the top of my head. Now, IMHO this is one of those cases where providing a very partial solution which does not cover many cases does more harm than good; people see an official "SqlClient resiliency" knob and get a very false sense of confidence, when in fact they may still be vulnerable to serious issues and data corruption in many cases. I guess I'd investigate exactly what this would cover, and what not. For example, if SQL Azure transient exceptions do keep the transaction usable (i.e. allow retrying a single failed command), then maybe this makes sense, if the un-covered cases are clearly documented etc. If you've already done such an investigation and I'm unaware, than apologies... But without some knowledge on what exactly this will address, I think this could cause more problems than it solves. |
@roji I pretty much agree with you. I think a lot of scenarios will end up being more complex than can be covered by a feature at the level of SqlClient. However, this effort is being pushed from higher up. So we are going to try to move it along to see what works and doesn't work. Please note, this feature will be under a context switch and be considered preview so that users can begin trying it and seeing where it falls down and where it helps. There are currently no plans to make it a GA feature in an upcoming release. Much of the groundwork and investigation was done previously by @scoriani. He might be able to answer your questions specific to transactions during transient errors against Azure SQL. |
Understood. It will certainly be interesting to see how this involves and what it covers. I'd also be interested in further info from @scoriani or others on what can actually be covered and what can't. |
Apologies for my absence in this thread despite multiple references. @roji, make no mistake, your concerns are extremely valid and shared. And yes, we've been recommending customers to implement retry strategies higher in the stack through tools like Polly, TFH, etc. in order to consider app specific logic and requirements since quite some time, but reality is that only a small fraction of them could afford anything but connection string or simple changes in their codebases. |
aa74fe5
to
b6a1a6c
Compare
# Conflicts: # src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlCommand.cs # src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs
b6a1a6c
to
9b344fc
Compare
Thanks for your thoughts and for the added details @scoriani. I think we're all more or less aligned with regards to the partiality of the proposed solution, and I understand the customer demand here. A few last comments:
That is certainly an ambitious undertaking... And if the goal really is robust resiliency which does not require application awareness, then I agree that the server-side investments would be a prerequisite. FWIW I still think the goal itself is problematic; I know very little about Azure SQL, but I was made to understand that at least earlier iterations shifted the bulk of the reliability/resiliency problem to the client; to me this seems to be the problematic part. I really do believe that resiliency either needs to be handled at the server side (without the driver or application needing to know about it), or by the application. FWIW I'm not aware of other database/cloud products/services where the bulk of resiliency is shifted to the driver in the way this proposal does it - this seems somewhat unique. But I definitely haven't done a detailed analysis on this.
OK. I hope the basic coverage that this can provide could be sufficient, and that those customers understand the intricacies of what can be done here and what can't. Good luck and I'll be curious to follow this! |
Thanks @roji, and please keep the great feedback coming!
In my experience, this is true for most database services, as high availability is implemented with remarkably similar design everywhere. Just as a quick example: |
Some additional thoughts on the overall design: I see automatic retry being problematic for most applications if it is just an on/off thing for all queries. Some other knobs to consider:
CC: @scoriani |
fd7c1ed
to
5f9cca9
Compare
* Skip retry in a transactional command * Skip retry in an unauthorized SQL statements * Apply a custom transient faults list * Configurable retry provider factory for pre-defined strategies
5f9cca9
to
f0ada60
Compare
temprary for netcore
# Conflicts: # src/Microsoft.Data.SqlClient/netcore/src/Microsoft.Data.SqlClient.csproj # src/Microsoft.Data.SqlClient/netfx/src/Microsoft.Data.SqlClient.csproj
63e7dc3
to
0a576cf
Compare
} | ||
// </Snippet1> | ||
// <Snippet2> | ||
private static void RetryCommand(SqlRetryLogicBaseProvider provider) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments to distinguish these private static void RetryCommand(SqlRetryLogicBaseProvider provider)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can find the description in the related SqlCommand.xml file. There is a common explanation for this series of samples. Feel free to suggest here If you need something more.
|10053|Could not convert the data value due to reasons other than sign mismatch or overflow.| | ||
|997|A connection was successfully established with the server, but then an error occurred during the login process. (provider: Named Pipes Provider, error: 0 - Overlapped I/O operation is in progress)| | ||
|233|A connection was successfully established with the server, but then an error occurred during the login process. (provider: Shared Memory Provider, error: 0 - No process is on the other end of the pipe.) (Microsoft SQL Server, Error: 233)| | ||
|64|| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No descrptions for 64
, 20
, and 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for bringing it up. I couldn't find an acceptable description for these error numbers, especially error number 0.
Any suggestions @David-Engel @cheenamalhotra?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DavoudEshtehari Did these error numbers come from Silvano?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except number 997, all of them come from Silvano.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For error 0, below exceptions are seen (Researched, it may not be conclusive):
Microsoft.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - No such host is known.)
This is retriable as the message "No such host is known` is known retriable error.
A connection was successfully established with the server, but then an error occurred during the login process. (provider: SSL Provider, error: 0 - The target principal name is incorrect.)
This is an error generated by Domain controller during authentication. More info: https://docs.microsoft.com/en-us/troubleshoot/windows-server/identity/target-principal-name-is-incorrect-when-replicating-data
No info on retry options.
They are server timeout errors, which are also retriable.
Summarizing
Error 0 is very generic number, and should ideally not be retried upon. But looking at errors that arise with it, it may be okay to retry on.
src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/netcore/src/Microsoft.Data.SqlClient.csproj
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/netcore/ref/Microsoft.Data.SqlClient.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/netfx/src/Microsoft.Data.SqlClient.csproj
Outdated
Show resolved
Hide resolved
Co-authored-by: Karina Zhou <v-jizho2@microsoft.com>
+ improvement
This PR is created for discussion and share the progress status of the 'Configurable Retry Logic' proposed by Silvano Coriani.
In the first step, it has designed to cover customers' provider customization.
Overview
An application that communicates with elements running in a service, especially in the cloud must be sensitive to the transient faults that can occur in this environment. These faults are typically self-corrective, and if the action that triggered a fault is repeated after a suitable delay it is likely to be successful.
Since this feature is in the preview stage, it is required to enable the considered application context switch.
AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.EnableRetryLogic", true);
Goal
Applying the retry policy on opening a connection and executing a command in the expectation that they will succeed.
Proposal
There are two sections:
1. Core API
Generally, to support the retry logic on an operation, developing the following interfaces is required:
Enumerator: it generates a sequence of the time intervals.
SqlRetryIntervalBaseEnumerator
Retry logic: it retrieves the next time interval with respect to the number of retries if a transient condition happens.
SqlRetryLogicBase
Retry provider: it applies a retry logic on an operation.
SqlRetryLogicBaseProvider
2. Configurable retry logic
The suggested implementation by the core API is included in the driver.
This implementation is supposed to cover all or a part of transient faults and is going to improve simultaneously with discussions here.
Notes & Thoughts
Each environment has its own configuration and probable faults, so this is the customer’s responsibility to monitor and detect them to have a measurable configuration value before getting to use the retry logic. It is strongly recommended do not include all transient errors without investigating the environment.
Customers can implement their own retry logic by developing the Core API if the implemented retry logic does not cover their requirements.
The best practice is to apply a specific configuration for a specific transient error according to the fault reason. This approach will have the best performance, but it increases the complexity in addition to decrease the useability.
On the other hand, to simplify the retry configuration for the user we must sacrifice the best performance and apply a common configuration for all active transient errors.
Despite the vital role of the transaction, in beginning, it would be played as a regular connection since probably it will need support from SQL teams.
Related discussions
Resources