Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add retry logic to connection open and command execution #307

Closed
wants to merge 14 commits into from

Conversation

scoriani
Copy link

  • Add basic retry logic capabilities to SqlClient core for now, derived from transient fault handling application block.
  • Add new connection string attributes and related validations to govern retry strategy and behavior
  • Add a RetryPolicy property to SqlConnection and SqlCommand that exposes what retry strategy has been selected
  • RetryPolicy also exposes an event that is triggered any time there's a retry operation
  • Add diagnostic even that can be used to trace from a listener.

@dnfclas
Copy link

dnfclas commented Nov 14, 2019

CLA assistant check
All CLA requirements met.

internal const string RetryStrategy = "None";
internal const int RetryCount = 3;

internal const int RetryInterval = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the time interval. Suffix the variables with unit of time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applicable to all internal members.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

@@ -679,6 +696,19 @@ internal static partial class DbConnectionStringKeywords
internal const string WorkstationID = "Workstation ID";
internal const string ConnectRetryCount = "ConnectRetryCount";
internal const string ConnectRetryInterval = "ConnectRetryInterval";

// Retry Logic
internal const string RetryStrategy = "RetryStrategy";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are so many connection string options. And they dont seem descriptive enough for anyone to look at the keyword and get an idea of where these might apply. Are these meant to be for connections only ? Should they be prefixed with Connect?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there are many, but effectively different retry strategies have different attributes. For example, for "FixedInterval" you only need a "RetryInterval" attribute, while for "Incremental" you need a "RetryIncrement" as well, and for "ExponentialBackoff" you need to specify min, max and delta. On purpose i was not prefixing them with Connect to avoid any confusion with existing Connection Resiliency parameters.

/// </summary>
/// <param name="ex">The exception object to be verified.</param>
/// <returns>true if the specified exception is considered as transient; otherwise, false.</returns>
bool IsTransient(Exception ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we anticipate anything apart from SqlException to be retryable ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, we're retrying on selected SqlException and TimeoutException types.

/// Provides a generic version of the <see cref="RetryPolicy"/> class.
/// </summary>
/// <typeparam name="T">The type that implements the <see cref="ITransientErrorDetectionStrategy"/> interface that is responsible for detecting transient conditions.</typeparam>
public class RetryPolicy<T> : RetryPolicy where T : ITransientErrorDetectionStrategy, new()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to be SqlRetryPolicy, and apply similar naming pattern for all the public classes to be inline with other classes that this library offers?

@@ -38,6 +38,10 @@ internal static class SqlClientDiagnosticListenerExtensions
public const string SqlAfterRollbackTransaction = SqlClientPrefix + nameof(WriteTransactionRollbackAfter);
public const string SqlErrorRollbackTransaction = SqlClientPrefix + nameof(WriteTransactionRollbackError);

// Retry Logic
public const string SqlRetryOpenConnection = SqlClientPrefix + nameof(WriteConnectionOpenRetry);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Extra new line.

internal RetryPolicy _retryPolicy = null;

/// <include file='../../../../../../../doc/snippets/Microsoft.Data.SqlClient/SqlCommand.xml' path='docs/members[@name="SqlCommand"]/RetryPolicy/*'/>
public RetryPolicy RetryPolicy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be public RetryPolicy RetryPolicy => Connection.RetryPolicy;


if (retryCount > 0)
{
Task.Delay(delay).Wait();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this potentially need a new thread for execution causing the need for more threads from the pool and causing application scalability issues?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you recommend a Thread.Sleep instead? How would it be performance-wise? Technically speaking, these events should be relatively rare, so i'm not sure how this can make a difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task.Delay() will spin up a task which may potentially run on a different thread and utilize another thread from the threadpool. And Wait() would simply wait for the delay to complete. If this method is only called synchronously, then Thread.Sleep is the way to go, which would call the same thread to be delayed instead of spinning up another one.

If there is something going wrong causing a lot of retries to happen, then I can imagine a bunch of threads being utilized from the thread pool, to just wait on nothing, causing further delays in queuing up .Net tasks. This could potentially impact scalability of the feature.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reading this, it occured to me that integrating a circuit breaker pattern could be a follow-up-ask for this feature.
I.e. often these exceptions will be encountered by multiple threads running, all trying to e.g. reconnect to an (Azure) SQL. Having each and every thread going through the same retry logic might even (in case of resource starvation) aggravate the problem. Having a single circuit breaker per distinct connectionstring could be an interesting topic to research at such a point.
Wdyt?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this wait should be interruptible through cancellation.

Unrelated to that: Task.Delay never uses a thread for anything. It starts a timer and completes the Task based on that. The Wait does use a thread, though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really interesting point, thanks for the suggestion! We're planning to make this aspect completely extensible, and may start introducing a "default behavior" but leaving room for more complex scenarios like the one you mentioned. I'll add this to our list.

@@ -934,6 +947,12 @@ override public object ExecuteScalar()
statistics = SqlStatistics.StartTimer(Statistics);
SqlDataReader ds;
ds = RunExecuteReader(0, RunBehavior.ReturnImmediately, returnStream: true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant to replace this line with the if/else. Otherwise it gets executed twice.

@cheenamalhotra cheenamalhotra changed the title Add retry logic to connection open and command execution WIP: Add retry logic to connection open and command execution Dec 13, 2019
/// Determines whether the specified exception represents a transient failure that can be compensated by a retry.
/// </summary>
/// <returns>true if the specified exception is considered as transient; otherwise, false.</returns>
List<int> RetriableErrors
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List<int> should not occur in a public interface. Likely, some immutable value should be returned. Also, it seems that the idea that errors are identified by a set of integers is specific to SqlClient. Other providers might identify errors though other types such as long, enum, string or large ranges of integers.

/// Provides a generic version of the <see cref="SqlRetryPolicy"/> class.
/// </summary>
/// <typeparam name="T">The type that implements the <see cref="ITransientErrorDetectionStrategy"/> interface that is responsible for detecting transient conditions.</typeparam>
public class SqlRetryPolicy<T> : SqlRetryPolicy where T : ITransientErrorDetectionStrategy, new()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this T have to be new(). Is it not enough that callers pass in a value of type T? The caller can instantiate that type. This would reduce the dependencies that this class requires (dependency injection).

}
}

// Perform an extra check in the delay interval. Should prevent from accidentally ending up with the value of -1 that will block a thread indefinitely.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid delays such as negative values (e.g. minus 42 seconds) should throw an exception. These are logic bugs. They should not be silenced.

/// <summary>
/// Represents the default amount of time used when calculating a random delta in the exponential delay between retries.
/// </summary>
public static readonly TimeSpan DefaultDeltaBackoff = TimeSpan.FromSeconds(20.0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should not be fields but properties. There is no need to expose fields which are an implementation detail. A trivial property hides that but causes no performance impact.

Also, I wonder if these values should be exposed at all. They are very application-specific. Also, the very concept of backoff and "increment" is specific to the concrete SqlRetryStrategy. Some strategies would not understand this very notion. It seems like a layering violation to put such values on the base class SqlRetryStrategy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will look into it.

/// <summary>
/// Corresponds to the "Reject Update / Insert" throttling mode, in which SQL statements such as INSERT, UPDATE, CREATE TABLE, and CREATE INDEX are rejected.
/// </summary>
RejectUpdateInsert = 1,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is incorrect. This option does not just affect updates and inserts but a lot more. Maybe use Write. How is this different from RejectAllWrites?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the decoding of server-side throttling conditions, we'll look into it to check if still relevant.

/// </summary>
public sealed class TransientErrorDetectionStrategy : ITransientErrorDetectionStrategy
{
private List<int> _retriableErrors = new List<int>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the initial release, this list will become frozen forever for compatibility reasons. This is a very dangerous thing.

For example, snapshot isolation errors are missing here. Any mistake like that is going to be locked into the product forever.

I think it is a mistake to ship a concrete retry strategy. Any retry strategy is application-specific. It is not possible to create a strategy that is general enough and stands the test of time. In my opinion, this should not become part of the product.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on many customer interactions, we need to have a basic concrete implementation to address all those scenarios where existing application can't be modified. The initial list is the result of various conversations within SQL organization and should be quite solid trapping connection-specific common errors, then app-specific errors can be added/removed through configuration.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, as the OP, this list is inherent to SQL Server (on-prem or Azure) and as such the product team knows best which basic scenario's are transient to the application based on the decisions made by the product team to achieve high availability.

The ask was exactly not to have these hang in 3 different external implementations, which inhibits lift & shift scenario's of existing simple applications without thorough code review and changes.
Also, using 3rd party packages that use SqlClient can not be trusted to behave correctly for a lift & shift to Azure. Having these options directly in SqlClient does enable that.

int err;
if (s.IndexOf('+') != -1)
{
if(int.TryParse(s.Replace("+",""), out err))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing errors should not be ignored. The validation should be as strict as possible.

throw ADP.InvalidRetryStrategy();
}

if ((_retryCount < 0) || (_retryCount > 60))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ciao Silvano, micro-optimization (uint)_retryCount > 60 better code generation one less branch.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a limit of 60 in the first place? I can see it making sense to retry a lot more. Maybe it is a good idea to retry once per second for one hour. That's 3600 times.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a failure last for like one hour stops to be "transient" and become a permanent one, so generally speaking even 60 times is more than it's effectively needed. Even more than that, a documented best practice for Azure SQL Database (https://docs.microsoft.com/en-us/azure/sql-database/sql-database-connectivity-issues#retry-logic-for-transient-errors) is to actually retry with at least a 5 sec interval minimum, otherwise that's going to be counterproductive. Usually, after a certain amount of retries (say 3-5) it's much better to return the underlying exception and let the application logic to deal with it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this feature only meant for Azure SQL?

I would caution against judging what's "best practice" so narrowly. Customer scenarios are very diverse.

If the only reason for that 60 retry limit is to force customers to use a "best practice" then I think that limit causes more harm than good. Let customers chose what's right for them.

// Retry Logic
private readonly string _retryStrategy;
private readonly int _retryCount;
private readonly int _RetryInterval;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _retryInterval

@GSPP
Copy link

GSPP commented Dec 19, 2019

Thanks for your feedback @scoriani.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants