Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection string property "Connection Timeout" is not respected. #2430

Closed
dmytro-gokun opened this issue Mar 26, 2024 · 17 comments
Closed

Connection string property "Connection Timeout" is not respected. #2430

dmytro-gokun opened this issue Mar 26, 2024 · 17 comments

Comments

@dmytro-gokun
Copy link

We have a high-load service which we want to run robustly when SQL server becomes (temporarily) unavailable. To achieve this, we set Connection Timeout (and Command Timeout as well, but that is not important for the purpose of this issue). We then expect connection attempt fail fast if the SQL server is not available. This way, our server will be able to do its job even if the SQL server is down for some time. But, somehow this does not work as expected. Following code emulates our multithreaded server attempting to connect to an (unavailable) SQL server from multiple threads a time:

using System.Diagnostics;
using Microsoft.Data.SqlClient;

const string connectionString = "Server=tcp:noserver.database.windows.net,1433;Database=nodb;User=nouser;Password=nopassword;Connection Timeout=3";
const int numberOfTasks = 1000;

var tasks = new Task[numberOfTasks];

for (var i = 0; i < numberOfTasks; ++i)
{
    tasks[i] = Task.Run(async () =>
    {
        var timer = Stopwatch.StartNew();

        try
        {
            await using var conn = new SqlConnection(connectionString);
            await conn.OpenAsync();
        }
        catch (Exception)
        {
            Console.WriteLine("Connection failed after {0} sec", timer.Elapsed.TotalSeconds);
        }
    });
}

Task.WaitAll(tasks);

When this code is run outputs something like:

Connection failed after 0.5961428 sec
Connection failed after 0.901156 sec
Connection failed after 1.2156149 sec
...
Connection failed after 30.7706116 sec
...
Connection failed after 56.6531985 sec
Connection failed after 56.9537902 sec
....

So, clearly Connection Timeout is not being respected. This causes our server to stall completely when SQL server is down. This is not the way we want it to work :). Any idea why we are observing this behavior?

@roji
Copy link
Member

roji commented Mar 26, 2024

@dmytro-gokun which exact versions of EF Core and SqlClient and SqlClient are you using? It seems like SqlClient 5.1.2 - which EF Core 8.0.2 took a dependency on - introduced new behavior which triggers this (see dotnet/efcore#33399 for more details).

@dmytro-gokun
Copy link
Author

@roji We are not really using EF Core. Microsoft.Data.SqlClient version 5.2.0 is referenced in this test project

see #33399 for more details

I'm not sure what this number means :). Could you give me a link?

@roji
Copy link
Member

roji commented Mar 26, 2024

@dmytro-gokun apologies, I mistakenly thought this issue was on the EF repo. To help narrow down the issue, can you please try downgrading to Microsoft.Data.SqlClient v5.1.1, and see if you're still seeing the slowness?

@dmytro-gokun
Copy link
Author

@roji Yes, I do. I've tried 5.1.1, 4.0.5, 3.1.5, 2.1.7. All show the same behavior. In fact, I was able to "fix" by introducing timeout in the calling code:

using System.Diagnostics;
using Microsoft.Data.SqlClient;

const string connectionString = "Server=tcp:noserver.database.windows.net,1433;Database=nodb;User=nouser;Password=nopassword;Connection Timeout=3";
const int numberOfTasks = 1000;

var tasks = new Task[numberOfTasks];

for (var i = 0; i < numberOfTasks; ++i)
{
    tasks[i] = Task.Run(async () =>
    {
        var ct = new CancellationTokenSource(TimeSpan.FromSeconds(3));
        var timer = Stopwatch.StartNew();

        try
        {
            await using var conn = new SqlConnection(connectionString);
            await conn.OpenAsync(ct.Token);
        }
        catch (Exception)
        {
            Console.WriteLine("Connection failed after {0} sec", timer.Elapsed.TotalSeconds);
        }
    });
}

Task.WaitAll(tasks);

This does not look like a good solution though, just a temporary one until SqlConnection is fixed

@roji
Copy link
Member

roji commented Mar 26, 2024

OK, I'll let the SqlClient folks take it from here...

@vonzshik
Copy link

It all pretty much comes down to the fact that SqlClient opens physical connections one at a time (and in a separate thread).

// it is better to StartNew too many times than not enough
if (_pendingOpensWaiting == 0)
{
Thread waitOpenThread = new Thread(WaitForPendingOpen);
waitOpenThread.IsBackground = true;
waitOpenThread.Start();
}

That thread synchronously goes through every single open request and tries to open a physical connection:

while (_pendingOpens.TryDequeue(out next))
{
if (next.Completion.Task.IsCompleted)
{
continue;
}
uint delay;
if (next.DueTime == Timeout.Infinite)
{
delay = unchecked((uint)Timeout.Infinite);
}
else
{
delay = (uint)Math.Max(ADP.TimerRemainingMilliseconds(next.DueTime), 0);
}
DbConnectionInternal connection = null;
bool timeout = false;
Exception caughtException = null;
try
{
bool allowCreate = true;
bool onlyOneCheckConnection = false;
ADP.SetCurrentTransaction(next.Completion.Task.AsyncState as System.Transactions.Transaction);
timeout = !TryGetConnection(next.Owner, delay, allowCreate, onlyOneCheckConnection, next.UserOptions, out connection);

The way opening a physical connection works is that there are multiple handles (one for getting a connection from a pool, one to open a physical connection, one for errors), so SqlClient just waits for any of them to fire:

waitResult = WaitHandle.WaitAny(_waitHandles.GetHandles(allowCreate), unchecked((int)waitForMultipleObjectsTimeout));
}
// From the WaitAny docs: "If more than one object became signaled during
// the call, this is the array index of the signaled object with the
// smallest index value of all the signaled objects." This is important
// so that the free object signal will be returned before a creation
// signal.
switch (waitResult)
{
case WaitHandle.WaitTimeout:
SqlClientEventSource.Log.TryPoolerTraceEvent("<prov.DbConnectionPool.GetConnection|RES|CPOOL> {0}, Wait timed out.", ObjectID);
Interlocked.Decrement(ref _waitCount);
connection = null;
return false;
case ERROR_HANDLE:
// Throw the error that PoolCreateRequest stashed.
SqlClientEventSource.Log.TryPoolerTraceEvent("<prov.DbConnectionPool.GetConnection|RES|CPOOL> {0}, Errors are set.", ObjectID);
Interlocked.Decrement(ref _waitCount);
throw TryCloneCachedException();
case CREATION_HANDLE:
SqlClientEventSource.Log.TryPoolerTraceEvent("<prov.DbConnectionPool.GetConnection|RES|CPOOL> {0}, Creating new connection.", ObjectID);
try
{
obj = UserCreateRequest(owningObject, userOptions);

At this point timeout (waitForMultipleObjectsTimeout in this case) is 0 so WaitHandle.WaitAny should immediately return a timeout error. But because connections are open sequentially, the handle responsible for opening a physical connection is always set and WaitHandle.WaitAny returns that handle. Essentially, what's happening is:

  1. Pool's thread gets a connection request from the queue
  2. Takes a semaphore to open a physical connection
  3. Tries (and fails) to open a physical connection
  4. Releases a semaphore
  5. Repeat

@dmytro-gokun
Copy link
Author

@vonzshik Great insights, thank you! Clearly looks like a bug to me.

@vonzshik
Copy link

Clearly looks like a bug to me.

Every time I say that something looks like a bug I'm usually replied with "it's by design" 😢
But anyway, one way to fix that is to modify TryGetConnection to immediately exit whenever a timeout is 0:

private bool TryGetConnection(DbConnection owningObject, uint waitForMultipleObjectsTimeout, bool allowCreate, bool onlyOneCheckConnection, DbConnectionOptions userOptions, out DbConnectionInternal connection)
{
	if (waitForMultipleObjectsTimeout == 0)
	{
		connection = null;
		return false;
	}
	/// other stuff unchanged

It's not going to help completely since eventually that thread is going to start getting requests with a near-zero timeout (or more like it's going to fluctuate, as in on the first request it tries to open a physical connection, then skips a few requests because their timeout is going to be 0, then get another with a near-zero timeout and tries to open, then again skips a few, etc), but all in all should be a bit better behavior.

@dmytro-gokun
Copy link
Author

dmytro-gokun commented Mar 26, 2024

@vonzshik Well, it could not be by design that a connection attempt stalls for minutes killing an entire multi-threaded app where it is supposed to fail safely in few seconds ..

@cheenamalhotra
Copy link
Member

@dmytro-gokun

Have you tried setting ConnectRetryCount=0 to disable transient fault handling?

@dmytro-gokun
Copy link
Author

@cheenamalhotra Yes, does not help. Neither does disabling pooling altogether

@dmytro-gokun
Copy link
Author

@cheenamalhotra Any update here?

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Apr 8, 2024

I think you're experiencing a combination of known issues:

To achieve true parallelism with fail fast support, you'd need to:

  • Disable pooling to allow true parallel execution.
  • Use Open() API to override the SqlConnectionOverrides to "OpenWithoutRetry" instead to disable retries during connection failure.

We are going to bring this support in OpenAsync() soon, fyi PR: #2433, and with #601 getting fixed sometime in future, you will not have to replace API with Open().

@dmytro-gokun
Copy link
Author

@cheenamalhotra That's something. Actually, i've found that just replacing OpenAsync() with Open() w/o OpenWithoutRetry does the trick. Is not that strange?

We are going to bring this support in OpenAsync() soon

Will that require disabling pooling as well? Honestly, i do not like doing that in the high-load environment

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Apr 8, 2024

The pooling issue requires connection pool changes and is not covered with the changes made in the PR #2433 as of now. It will enable failing fast, but if setting ConnectRetryCount = 0 doesn't help, the fail fast override will not make a significant difference by itself. So I'm afraid you'll still need to wait for #601 to be addressed, the comment #2430 (comment) captures it exactly.

@cheenamalhotra
Copy link
Member

Since this is a duplicate of #601 and depends on it, I'll proceed to close this one.

@cheenamalhotra cheenamalhotra closed this as not planned Won't fix, can't repro, duplicate, stale Apr 8, 2024
@vonzshik
Copy link

vonzshik commented Apr 8, 2024

Setting aside retries, the fact that SqlClient opens physical connections from a ctor (which makes true async impossible) and whatnot. Whenever waitOpenThread from DbConnectionPool gets a request for a physical connection, it never checks whether that request already timed out and a such there is no reason to do anything else with it. While it certainly doesn't fix the underlying problems, it should even slightly alleviate it, and beside that it does make sense by itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

4 participants