Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

SqlClient fix SqlEnvChangePool thread safety bug #36735

Merged
merged 4 commits into from
Apr 11, 2019

Conversation

Wraith2
Copy link
Contributor

@Wraith2 Wraith2 commented Apr 9, 2019

In PR #34573 a static pool of SqlEnvChange objects was added. This pool has been found to be unsafe causing a client crash.

Under high load while running the data access benchmark at https://github.com/aspnet/DataAccessPerformance an exception occurred which I tracked back to newBinValue being null on line 276.

if (newBinValue != null)
{
Array.Clear(newBinValue, 0, newBinValue.Length);
if (newBinRented)
{
ArrayPool<byte>.Shared.Return(newBinValue);
}
newBinValue = null;

The object is not accessed from multiple threads and it is not possible to invoke the Clear method on the same object twice unless it has been returned to two different callers which means the pool is not safe.

It has been replaced with a simpler version inspired by the ObjectPool in aspnet which doesn't count it just uses interlocks throughout. This version does not crash under load. The functional and manual tests all pass but they passed before. This is a bug fix and it is important to get in integrated quickly if possible.

/cc @afsanehr, @tarikulsabbir, @Gary-Zh , @david-engelh @saurabh500

@AfsanehR-zz
Copy link
Contributor

Is it possible to add a test case to this PR?

}

private static readonly SqlEnvChangeWrapper[] _items = new SqlEnvChangeWrapper[Environment.ProcessorCount * 2];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the size of _items being set to ProcessorCount * 2 ?
Also add a comment on why such a decision is being made?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an arbitrary choice at the moment. In general env changes are handled quite quickly so the lifetime outside the pool is quite short and the number of items I've seen so far is small. As such a small pool uses a small amount of memory and benefits reasonably well on low frequency querying. For high frequency you're not going to see if for all the other work going on.

I'm open to any suggestions on a better choice. This seems reasonable from my perspective but my perspective does not encompass large scales through lack of experience.

retval = new SqlEnvChange();
}
return retval;
return new SqlEnvChange();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allocated SqlEnvChange will not be tracked by _items
This seems counter intuitive to the design.

When this item is released in the Release call below, it will not go back to the _items. If _items is not gong to be a container for all the SqlEnvChange objects, then why have it in the first place ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if I have missed anything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SqlEnvChange is used in a small number of places. All allocation is done through this function. The pool is designed to be a buffer but not a block so if we have a free object hanging around then use it but if we're running fast and don't have a spare object I don't want to wait for someone to return one to the pool, just create a new one. Like allocation release is always done through the pool so if an item was new'ed rather than being sourced from the pool it will still be given back to the pool. If the pool is full the item will simply be discarded.

This behaviour means that at low frequency somewhere below the pool threshold all items are likely to be sourced from the pool which makes the operations lower memory than they otherwise would be. At high frequency the pool will be recycled quickly but some items are still going to be new'ed and dropped to GC, overall the operations will have lower GC pressure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this assumption works very well. SqlEnvChange is used in a small number of places. But then the places it is used can happen concurrently and with high through put.

image

For applications with small load, I don't care as much for this change, however the applications where the improvements is useful are web apps which will spin up a lot of connections at the same time from different threads, potentially from few connection pools. Those are the real world cases. In those cases, the number of env change token from the pool will be exhausted quickly enough.

I don't have a quantification on how big the Env Change pool should be, but I do know that ProcessCount * 2 doesn't get close enough to the concurrent connection open requests and transaction promotions that happen in real world applications.

Copy link
Contributor Author

@Wraith2 Wraith2 Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. then either we need a larger pool or to drop it entirely, I'm fine with either.

In low memory cases having a very large pool would be a constant cost and possibly go mostly unused. In the case of bringing up an entire pool how many concurrent connections are likely to be in the env change code at once do you think? my instinct says it's likely to be high so a high pool would be best.

The list and array rental changes of the original RP are still beneficial to all cases. I suspect the pool is of dubious benefit without some perf numbers to back it up at this point. Would you agree @afsanehr ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaning towards dropping this pool till we can have a more solid reasoning behind the size, and have a one size fits all model.

I don't have any ideas to improve this right now. Let's try to keep this simple (though possible less performant in some cases) and remove the pool.

@@ -300,103 +300,40 @@ internal void Clear()

internal static class SqlEnvChangePool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment about the behavior of this class would be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, and a couple of others inline to explain why items are new'ed and possibly dropped to the GC. Let me know if that explains things better and what else might be useful.

Copy link

@jnm2 jnm2 Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence with a comma should be split into two sentences. (http://www.sussex.ac.uk/informatics/punctuation/comma/joining)
Sentences should consistently have a capital letter and an end mark.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Apr 9, 2019

Is it possible to add a test case to this PR?

Yes but I'm not sure how. There's a stress tests project but I've never used it or seen instructions on how to do so. Clearly the manual and functional tests don't catch it because they each look for specific behaviour and this bug manifests purely under load. While I'm certain enough of it to propose this fix catching it in action directly involves some extremely precise timing.

The alternative to this is to not pool the Sq;EnvChange objects at all. I considered that but I'd like to keep the performance benefits for low load that the pool allows. If this PR isn't acceptable it would be the next option, if you'd prefer to remove the pooling entirely it'd be understandable, just let me know.

@saurabh500
Copy link
Contributor

I have been thinking about the failure to catch regression in this area.
I think we should get the dataaccess performance runs as a test into the Manual tests for SqlClient only. Just the tests running with a TPS > 0 means that the tests ran successfully. Any single exception during execution is a bug.

Right now I am not in a favor of rollback. Let's try to drive this issue to completion.

There is an urgency to get this fix in, and to improve the test coverage for SqlClient so that we can safely take more PRs in these areas. If for any reason we cannot build up the confidence in the fix, we can consider a rollback. Let's give this issue a 48 hour deadline failing which we can consider a rollback.

Adding DataAccess performance to tests should be done in a separate PR since that is the only way to catch this issue and bringing that test in is lot more code than fixing this issue. However in terms of porting the test in corefx, I know that it will be straight-forward. @Wraith2 Can I put this test port in your plate? I know that you are validating the fix with the Dataaccess perf benchmark.

@@ -318,21 +326,28 @@ internal static SqlEnvChange Allocate()
return item;
}
}
// if we didn't find an item in the pool we've either never craete any or we're under pressure
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

craete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You wouldn't believe I'm an English monoglot based on my comments would you 😁

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I would, believe me :D

@Wraith2
Copy link
Contributor Author

Wraith2 commented Apr 9, 2019

I think we should get the dataaccess performance runs as a test into the Manual tests for SqlClient only. Just the tests running with a TPS > 0 means that the tests ran successfully. Any single exception during execution is a bug.

I found the issue while I was 3 minutes into a profile run on the release build. It took a while to work out exactly what was going on because the error was reporting as being thrown at

throw result.Exception.InnerException;
which wasn't really the case. It took a few runs to work out what was going on. The information you'd get reported in this case wouldn't easily lead you do the real problem which is why I'm keen to fix this even if that means reverting the PR, it'd be a nasty problem for support.

The test has to run fast enough to get the same item returned twice from the pool, returned at the same time and for both callers to be executing inside the if block concurrently but slightly out of step so one works and the other then sees that result. That's some tricky timing. I couldn't catch it in debug builds because the code just ran too slowly with asserts present. I'm not against adding more tests but it wouldn't be deterministic and if it worked most of the time that would be a false security.

I think this is a case where we have to be absolutely certain of the building blocks used. The pool I provided was not good enough. I hadn't realised that there was a problem with it and because of the nature of concurrent behaviours it would be very difficult to see in code review. I should have flagged this as a high risk in the original PR.

The revised design is inspired by a pool in shipping aspnet codebase which gives me some confidence. I asked @jnm2 to look at it because he's good with concurrency and hopefully his and your feedback has prevented me introducing more errors. At this point my recommendation would be to take this PR if everyone is confident in the new pool but if anyone has any doubts we remove the pool and go to standard new/gc release semantics.

@saurabh500
Copy link
Contributor

I found the issue while I was 3 minutes into a profile run on the release build. It took a while to work out exactly what was going on because the error was reporting as being thrown at

This is good context for the serviceability of the component. Thanks.

When I look at the original PR in combination with this PR, the overall goal is to optimize to get away from GC being invoked.
I am leaning to believe that there are perf optimizations for some scenarios and for many other scenarios we are going to fall back on GC.

I am leaning a bit more on going back to the GC behavior. It may be better to rollback to older behavior and pick this up, when we have a better solution to the problem.

@saurabh500
Copy link
Contributor

saurabh500 commented Apr 9, 2019

@Wraith2 here's an idea.
Right now the SqlEnvChangePool is a global resource. Why not make this a per connection resource i.e. at the level of SqlInternalConnectionTds and that at least gets rid of multiple allocations that used to happen pre #34573. That is much easier to manage.

On a given connection the Pool will never be used concurrently since the packets come in serial order, so at a time for a Connection's EnvChangePool, you are either using the pool or not. That makes the lifecycle of this pool way easier to manage. There is an overhead of allocating the EnvChangePool per connection, but most of these connections are going to be very long lived because of pooling. So way lesser garbage to handle.

If you agree with the idea and think that it is worth pursuing, I think you should send a revert commit for the original PR #34573 and rework on top of that. This will bring master to a good state and then we get your Env pool changes in.

Let me know what you think.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Apr 9, 2019

For the moment the discussion we've had has convinced me that the pool is too complex for uncertain benefit at the moment. I've taken it out.

I like your idea and I think it's something I'll address in more depth later. This has also highlighted that my focus is on very low end performance and that I need to start looking at bigger scale somehow.

Copy link
Contributor

@saurabh500 saurabh500 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@saurabh500
Copy link
Contributor

I need to start looking at bigger scale somehow.

Ask me questions, or open an issue and tag it as a question and area-System.Data.SqlClient cc me, and other usual folks from Sql . If I can contribute any usage pattern I would. I think I can give you some ideas around how the APIs will be used if there are targetted questions, and an overall picture of the lifecycle of different class objects in the larger scheme of things in SqlClient. I agree that I don't know everything about SqlClient but I know a thing or two. For the questions I cannot answer, you and I will be on the same playing field where we both are trying to look through code to reverse engineer concepts :)

For me, @divega has been the go to person to ask questions to, wrt ADO.Net and why APIs were designed the way they were and how they are typically used. In case there are any questions about the public API and how they would be used, he can answer them. I can try to translate his inputs to the internals of SqlClient with whatever knowledge I have.

@saurabh500
Copy link
Contributor

Pending some sanity testing from @afsanehr and team.

@AfsanehR-zz
Copy link
Contributor

Tests passed.

@AfsanehR-zz AfsanehR-zz merged commit af562e8 into dotnet:master Apr 11, 2019
@Wraith2
Copy link
Contributor Author

Wraith2 commented Apr 11, 2019

Excellent, thanks. Sorry I broke it in the first place 😞

@divega
Copy link

divega commented Apr 14, 2019

@Wraith2 what exception do you get when you hit the bug? Just wondering because @davidfowl just sent me a stack trace from a failure in our CI that contains TryProcessEnvChange:

System.FormatException : Input string was not in a correct format.


at System.Number.ThrowOverflowOrFormatException(ParsingStatus status, TypeCode type)

at System.Data.SqlClient.TdsParser.TryProcessEnvChange(Int32 tokenLength, TdsParserStateObject stateObj, SqlEnvChange& sqlEnvChange)

at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)

at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)

at System.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject
stateObj, Boolean isDelegateControlRequest)

at System.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)

at System.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(IsolationLevel iso, String transactionName, Boolean shouldReconnect)

at System.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso, String transactionName)

at System.Data.SqlClient.SqlConnection.BeginDbTransaction(IsolationLevel isolationLevel)

at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.BeginTransactionWithNoPreconditions(IsolationLevel isolationLevel)

at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.BeginTransactionAsync(IsolationLevel isolationLevel, CancellationToken cancellationToken)

at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.BeginTransactionAsync(CancellationToken cancellationToken)

at Microsoft.EntityFrameworkCore.Update.Internal.BatchExecutor.ExecuteAsync(DbContext _, ValueTuple`2 parameters, CancellationToken cancellationToken)

at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)

at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(IList`1 entriesToSave, CancellationToken cancellationToken)

at Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)

at Microsoft.EntityFrameworkCore.DbContext.SaveChangesAsync(Boolean acceptAllChangesOnSuccess, CancellationToken cancellationToken)

at
Microsoft.AspNetCore.Identity.EntityFrameworkCore.UserStore`9.CreateAsync(TUser user, CancellationToken cancellationToken) in /_/src/Identity/EntityFrameworkCore/src/UserStore.cs:line 166

at
Microsoft.AspNetCore.Identity.UserManager`1.CreateAsync(TUser user) in /_/src/Identity/Extensions.Core/src/UserManager.cs:line 482

at
Microsoft.AspNetCore.Identity.Test.UserManagerSpecificationTestBase`2.CanChangeEmailWithDifferentTokenProvider() in /_/src/Identity/Specification.Tests/src/UserManagerSpecificationTests.cs:line 1736

@saurabh500
Copy link
Contributor

@divega Do you know what version of SqlClient is being used ?

@Wraith2
Copy link
Contributor Author

Wraith2 commented Apr 14, 2019

That isn't the particular error I got but it's likely it's the same underlying bug. An SqlEnvChange object could end up being used by two connections at the same time so a number of things could go wrong. In my case I got an array return error because of the exact test I was doing.

@saurabh500
Copy link
Contributor

@divega, this is definitely a bug in SQL Client. Even if the SqlClient APIs are used badly, this error should never be thrown. (Unless David's scenario somehow uncovered a bug in the way the server is sending data). I would like to rule out the server being a problem, else we would have received a lot more bug reports.

Let's start with a new SqlClient issue. If you can provide the version of SqlClient that caused this problem, I could check the IL to see if it has the EnvPool optimizations at all, or does it have the change with the EnvPool optimization or does it have the SqlEnvChange token optimizations with the envpool removed.

That should help pinpoint the change that could have caused this issue.

@Wraith2 Wraith2 deleted the sqlfix-envchangepool branch April 18, 2019 18:07
@karelz karelz added this to the 3.0 milestone May 22, 2019
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
SqlClient fix SqlEnvChangePool thread safety bug

Commit migrated from dotnet/corefx@af562e8
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants