WIP: Add retry logic to connection open and command execution #307

scoriani · 2019-11-14T00:41:54Z

Add basic retry logic capabilities to SqlClient core for now, derived from transient fault handling application block.
Add new connection string attributes and related validations to govern retry strategy and behavior
Add a RetryPolicy property to SqlConnection and SqlCommand that exposes what retry strategy has been selected
RetryPolicy also exposes an event that is triggered any time there's a retry operation
Add diagnostic even that can be used to trace from a listener.

dnfclas · 2019-11-14T00:42:09Z

All CLA requirements met.

saurabh500 · 2019-11-14T18:26:41Z

src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/Common/DbConnectionStringCommon.cs

+        internal const string RetryStrategy = "None";
+        internal const int RetryCount = 3;
+
+        internal const int RetryInterval = 10;


This seems like the time interval. Suffix the variables with unit of time.

Applicable to all internal members.

Sounds good!

saurabh500 · 2019-11-14T18:30:22Z

src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/Common/DbConnectionStringCommon.cs

@@ -679,6 +696,19 @@ internal static partial class DbConnectionStringKeywords
        internal const string WorkstationID = "Workstation ID";
        internal const string ConnectRetryCount = "ConnectRetryCount";
        internal const string ConnectRetryInterval = "ConnectRetryInterval";
+
+        // Retry Logic
+        internal const string RetryStrategy = "RetryStrategy";


These are so many connection string options. And they dont seem descriptive enough for anyone to look at the keyword and get an idea of where these might apply. Are these meant to be for connections only ? Should they be prefixed with Connect?

I agree there are many, but effectively different retry strategies have different attributes. For example, for "FixedInterval" you only need a "RetryInterval" attribute, while for "Incremental" you need a "RetryIncrement" as well, and for "ExponentialBackoff" you need to specify min, max and delta. On purpose i was not prefixing them with Connect to avoid any confusion with existing Connection Resiliency parameters.

saurabh500 · 2019-11-14T18:40:56Z

...lClient/netcore/src/Microsoft/Data/SqlClient/Reliability/ITransientErrorDetectionStrategy.cs

+        /// </summary>
+        /// <param name="ex">The exception object to be verified.</param>
+        /// <returns>true if the specified exception is considered as transient; otherwise, false.</returns>
+        bool IsTransient(Exception ex);


Do we anticipate anything apart from SqlException to be retryable ?

As of now, we're retrying on selected SqlException and TimeoutException types.

saurabh500 · 2019-11-14T18:42:22Z

...osoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/Reliability/RetryPolicy.Generic.cs

+    /// Provides a generic version of the <see cref="RetryPolicy"/> class.
+    /// </summary>
+    /// <typeparam name="T">The type that implements the <see cref="ITransientErrorDetectionStrategy"/> interface that is responsible for detecting transient conditions.</typeparam>
+    public class RetryPolicy<T> : RetryPolicy where T : ITransientErrorDetectionStrategy, new()


Can we rename this to be SqlRetryPolicy, and apply similar naming pattern for all the public classes to be inline with other classes that this library offers?

saurabh500 · 2019-11-14T18:43:56Z

...Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlClientDiagnosticListenerExtensions.cs

@@ -38,6 +38,10 @@ internal static class SqlClientDiagnosticListenerExtensions
        public const string SqlAfterRollbackTransaction = SqlClientPrefix + nameof(WriteTransactionRollbackAfter);
        public const string SqlErrorRollbackTransaction = SqlClientPrefix + nameof(WriteTransactionRollbackError);

+        // Retry Logic
+        public const string SqlRetryOpenConnection = SqlClientPrefix + nameof(WriteConnectionOpenRetry);
+


Nit: Extra new line.

saurabh500 · 2019-11-14T18:44:20Z

src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlCommand.cs

+        internal RetryPolicy _retryPolicy = null;
+
+        /// <include file='../../../../../../../doc/snippets/Microsoft.Data.SqlClient/SqlCommand.xml' path='docs/members[@name="SqlCommand"]/RetryPolicy/*'/>
+        public RetryPolicy RetryPolicy


Can be public RetryPolicy RetryPolicy => Connection.RetryPolicy;

saurabh500 · 2019-11-14T18:54:58Z

src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/Reliability/RetryPolicy.cs

+
+                if (retryCount > 0)
+                {
+                    Task.Delay(delay).Wait();


Wouldn't this potentially need a new thread for execution causing the need for more threads from the pool and causing application scalability issues?

Would you recommend a Thread.Sleep instead? How would it be performance-wise? Technically speaking, these events should be relatively rare, so i'm not sure how this can make a difference.

Task.Delay() will spin up a task which may potentially run on a different thread and utilize another thread from the threadpool. And Wait() would simply wait for the delay to complete. If this method is only called synchronously, then Thread.Sleep is the way to go, which would call the same thread to be delayed instead of spinning up another one.

If there is something going wrong causing a lot of retries to happen, then I can imagine a bunch of threads being utilized from the thread pool, to just wait on nothing, causing further delays in queuing up .Net tasks. This could potentially impact scalability of the feature.

Makes sense, thanks!

While reading this, it occured to me that integrating a circuit breaker pattern could be a follow-up-ask for this feature.
I.e. often these exceptions will be encountered by multiple threads running, all trying to e.g. reconnect to an (Azure) SQL. Having each and every thread going through the same retry logic might even (in case of resource starvation) aggravate the problem. Having a single circuit breaker per distinct connectionstring could be an interesting topic to research at such a point.
Wdyt?

I think this wait should be interruptible through cancellation.

Unrelated to that: Task.Delay never uses a thread for anything. It starts a timer and completes the Task based on that. The Wait does use a thread, though.

That's a really interesting point, thanks for the suggestion! We're planning to make this aspect completely extensible, and may start introducing a "default behavior" but leaving room for more complex scenarios like the one you mentioned. I'll add this to our list.

David-Engel · 2019-11-15T20:08:54Z

src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlCommand.cs

@@ -934,6 +947,12 @@ override public object ExecuteScalar()
                statistics = SqlStatistics.StartTimer(Statistics);
                SqlDataReader ds;
                ds = RunExecuteReader(0, RunBehavior.ReturnImmediately, returnStream: true);


I think you meant to replace this line with the if/else. Otherwise it gets executed twice.

perf improvement

GSPP · 2019-12-18T12:51:15Z

...lClient/netcore/src/Microsoft/Data/SqlClient/Reliability/ITransientErrorDetectionStrategy.cs

+        /// Determines whether the specified exception represents a transient failure that can be compensated by a retry.
+        /// </summary>
+        /// <returns>true if the specified exception is considered as transient; otherwise, false.</returns>
+        List<int> RetriableErrors


List<int> should not occur in a public interface. Likely, some immutable value should be returned. Also, it seems that the idea that errors are identified by a set of integers is specific to SqlClient. Other providers might identify errors though other types such as long, enum, string or large ranges of integers.

GSPP · 2019-12-18T12:53:01Z

...ft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/Reliability/SqlRetryPolicy.Generic.cs

+    /// Provides a generic version of the <see cref="SqlRetryPolicy"/> class.
+    /// </summary>
+    /// <typeparam name="T">The type that implements the <see cref="ITransientErrorDetectionStrategy"/> interface that is responsible for detecting transient conditions.</typeparam>
+    public class SqlRetryPolicy<T> : SqlRetryPolicy where T : ITransientErrorDetectionStrategy, new()


Why does this T have to be new(). Is it not enough that callers pass in a value of type T? The caller can instantiate that type. This would reduce the dependencies that this class requires (dependency injection).

GSPP · 2019-12-18T12:54:39Z

src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/Reliability/SqlRetryPolicy.cs

+                    }
+                }
+
+                // Perform an extra check in the delay interval. Should prevent from accidentally ending up with the value of -1 that will block a thread indefinitely. 


Invalid delays such as negative values (e.g. minus 42 seconds) should throw an exception. These are logic bugs. They should not be silenced.

GSPP · 2019-12-18T12:57:38Z

...icrosoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/Reliability/SqlRetryStrategy.cs

+        /// <summary>
+        /// Represents the default amount of time used when calculating a random delta in the exponential delay between retries.
+        /// </summary>
+        public static readonly TimeSpan DefaultDeltaBackoff = TimeSpan.FromSeconds(20.0);


These should not be fields but properties. There is no need to expose fields which are an implementation detail. A trivial property hides that but causes no performance impact.

Also, I wonder if these values should be exposed at all. They are very application-specific. Also, the very concept of backoff and "increment" is specific to the concrete SqlRetryStrategy. Some strategies would not understand this very notion. It seems like a layering violation to put such values on the base class SqlRetryStrategy.

Good point, will look into it.

GSPP · 2019-12-18T12:58:40Z

...ft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/Reliability/SqlThrottlingCondition.cs

+        /// <summary>
+        /// Corresponds to the "Reject Update / Insert" throttling mode, in which SQL statements such as INSERT, UPDATE, CREATE TABLE, and CREATE INDEX are rejected.
+        /// </summary>
+        RejectUpdateInsert = 1,


The name is incorrect. This option does not just affect updates and inserts but a lot more. Maybe use Write. How is this different from RejectAllWrites?

This was the decoding of server-side throttling conditions, we'll look into it to check if still relevant.

GSPP · 2019-12-18T13:01:55Z

...lient/netcore/src/Microsoft/Data/SqlClient/Reliability/SqlTransientErrorDetectionStrategy.cs

+    /// </summary>
+    public sealed class TransientErrorDetectionStrategy : ITransientErrorDetectionStrategy
+    {
+        private List<int> _retriableErrors = new List<int>


After the initial release, this list will become frozen forever for compatibility reasons. This is a very dangerous thing.

For example, snapshot isolation errors are missing here. Any mistake like that is going to be locked into the product forever.

I think it is a mistake to ship a concrete retry strategy. Any retry strategy is application-specific. It is not possible to create a strategy that is general enough and stands the test of time. In my opinion, this should not become part of the product.

Based on many customer interactions, we need to have a basic concrete implementation to address all those scenarios where existing application can't be modified. The initial list is the result of various conversations within SQL organization and should be quite solid trapping connection-specific common errors, then app-specific errors can be added/removed through configuration.

I agree, as the OP, this list is inherent to SQL Server (on-prem or Azure) and as such the product team knows best which basic scenario's are transient to the application based on the decisions made by the product team to achieve high availability.

The ask was exactly not to have these hang in 3 different external implementations, which inhibits lift & shift scenario's of existing simple applications without thorough code review and changes.
Also, using 3rd party packages that use SqlClient can not be trusted to behave correctly for a lift & shift to Azure. Having these options directly in SqlClient does enable that.

GSPP · 2019-12-18T13:03:50Z

src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs

+                            int err;
+                            if (s.IndexOf('+') != -1)
+                            {
+                                if(int.TryParse(s.Replace("+",""), out err))


Parsing errors should not be ignored. The validation should be as strict as possible.

MarcoRossignoli · 2019-12-19T09:53:05Z

src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnectionString.cs

+                throw ADP.InvalidRetryStrategy();
+            }
+
+            if ((_retryCount < 0) || (_retryCount > 60))


Ciao Silvano, micro-optimization (uint)_retryCount > 60 better code generation one less branch.

Why is there a limit of 60 in the first place? I can see it making sense to retry a lot more. Maybe it is a good idea to retry once per second for one hour. That's 3600 times.

If a failure last for like one hour stops to be "transient" and become a permanent one, so generally speaking even 60 times is more than it's effectively needed. Even more than that, a documented best practice for Azure SQL Database (https://docs.microsoft.com/en-us/azure/sql-database/sql-database-connectivity-issues#retry-logic-for-transient-errors) is to actually retry with at least a 5 sec interval minimum, otherwise that's going to be counterproductive. Usually, after a certain amount of retries (say 3-5) it's much better to return the underlying exception and let the application logic to deal with it.

Is this feature only meant for Azure SQL?

I would caution against judging what's "best practice" so narrowly. Customer scenarios are very diverse.

If the only reason for that 60 retry limit is to force customers to use a "best practice" then I think that limit causes more harm than good. Let customers chose what's right for them.

MarcoRossignoli · 2019-12-19T09:53:48Z

src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnectionString.cs

+        // Retry Logic
+        private readonly string _retryStrategy;
+        private readonly int _retryCount;
+        private readonly int _RetryInterval;


nit: _retryInterval

GSPP · 2019-12-19T17:49:54Z

Thanks for your feedback @scoriani.

Add retry logic to connection open and command execution

a0ba96e

saurabh500 reviewed Nov 14, 2019

View reviewed changes

David-Engel reviewed Nov 15, 2019

View reviewed changes

scoriani added 10 commits November 19, 2019 16:37

Addressing initial pull request comments

ab9e93b

Merge https://github.com/dotnet/SqlClient

04b6988

Change Task.Delay to Thread.Sleep

1b05582

perf improvement

Merge https://github.com/dotnet/SqlClient

718dcdd

Cleanup async execution

b46cc1b

Resolve merge conflict

1bed647

Rename RetryLimitExceededException to SqlRetryLimitExceededException

3a1620b

Fix SR.resx formatting

020519e

Merge https://github.com/dotnet/SqlClient

b06ee01

Fix for OpenAsync

a1aa23e

cheenamalhotra changed the title ~~Add retry logic to connection open and command execution~~ WIP: Add retry logic to connection open and command execution Dec 13, 2019

scoriani mentioned this pull request Dec 17, 2019

Can the SqlServerTransientExceptionDetector be moved to System.Data.SqlClient? #39

Closed

Merge https://github.com/dotnet/SqlClient

58617ad

GSPP reviewed Dec 18, 2019

View reviewed changes

MarcoRossignoli reviewed Dec 19, 2019

View reviewed changes

scoriani added 2 commits January 3, 2020 16:36

Merge https://github.com/dotnet/SqlClient

2f5aaf0

Merge https://github.com/dotnet/SqlClient

166316d

scoriani closed this Jan 13, 2020

Plasma mentioned this pull request Feb 9, 2020

SqlClient's transient error handling strategy going forward? #416

Closed

cheenamalhotra mentioned this pull request Aug 17, 2020

Configurable Retry Logic - Preview 1 #693

Merged

WIP: Add retry logic to connection open and command execution #307

WIP: Add retry logic to connection open and command execution #307

Conversation

scoriani commented Nov 14, 2019

dnfclas commented Nov 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GSPP commented Dec 19, 2019

dnfclas commented Nov 14, 2019 •

edited

Loading