-
Notifications
You must be signed in to change notification settings - Fork 4.9k
SqlClient fix SqlEnvChangePool thread safety bug #36735
Conversation
Is it possible to add a test case to this PR? |
src/System.Data.SqlClient/src/System/Data/SqlClient/TdsParserHelperClasses.cs
Outdated
Show resolved
Hide resolved
} | ||
|
||
private static readonly SqlEnvChangeWrapper[] _items = new SqlEnvChangeWrapper[Environment.ProcessorCount * 2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the size of _items being set to ProcessorCount * 2 ?
Also add a comment on why such a decision is being made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an arbitrary choice at the moment. In general env changes are handled quite quickly so the lifetime outside the pool is quite short and the number of items I've seen so far is small. As such a small pool uses a small amount of memory and benefits reasonably well on low frequency querying. For high frequency you're not going to see if for all the other work going on.
I'm open to any suggestions on a better choice. This seems reasonable from my perspective but my perspective does not encompass large scales through lack of experience.
src/System.Data.SqlClient/src/System/Data/SqlClient/TdsParserHelperClasses.cs
Outdated
Show resolved
Hide resolved
retval = new SqlEnvChange(); | ||
} | ||
return retval; | ||
return new SqlEnvChange(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allocated SqlEnvChange
will not be tracked by _items
This seems counter intuitive to the design.
When this item is released in the Release
call below, it will not go back to the _items. If _items
is not gong to be a container for all the SqlEnvChange objects, then why have it in the first place ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if I have missed anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SqlEnvChange is used in a small number of places. All allocation is done through this function. The pool is designed to be a buffer but not a block so if we have a free object hanging around then use it but if we're running fast and don't have a spare object I don't want to wait for someone to return one to the pool, just create a new one. Like allocation release is always done through the pool so if an item was new'ed rather than being sourced from the pool it will still be given back to the pool. If the pool is full the item will simply be discarded.
This behaviour means that at low frequency somewhere below the pool threshold all items are likely to be sourced from the pool which makes the operations lower memory than they otherwise would be. At high frequency the pool will be recycled quickly but some items are still going to be new'ed and dropped to GC, overall the operations will have lower GC pressure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this assumption works very well. SqlEnvChange is used in a small number of places. But then the places it is used can happen concurrently and with high through put.
For applications with small load, I don't care as much for this change, however the applications where the improvements is useful are web apps which will spin up a lot of connections at the same time from different threads, potentially from few connection pools. Those are the real world cases. In those cases, the number of env change token from the pool will be exhausted quickly enough.
I don't have a quantification on how big the Env Change pool should be, but I do know that ProcessCount * 2 doesn't get close enough to the concurrent connection open requests and transaction promotions that happen in real world applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. then either we need a larger pool or to drop it entirely, I'm fine with either.
In low memory cases having a very large pool would be a constant cost and possibly go mostly unused. In the case of bringing up an entire pool how many concurrent connections are likely to be in the env change code at once do you think? my instinct says it's likely to be high so a high pool would be best.
The list and array rental changes of the original RP are still beneficial to all cases. I suspect the pool is of dubious benefit without some perf numbers to back it up at this point. Would you agree @afsanehr ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am leaning towards dropping this pool till we can have a more solid reasoning behind the size, and have a one size fits all model.
I don't have any ideas to improve this right now. Let's try to keep this simple (though possible less performant in some cases) and remove the pool.
@@ -300,103 +300,40 @@ internal void Clear() | |||
|
|||
internal static class SqlEnvChangePool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment about the behavior of this class would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, and a couple of others inline to explain why items are new'ed and possibly dropped to the GC. Let me know if that explains things better and what else might be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentence with a comma should be split into two sentences. (http://www.sussex.ac.uk/informatics/punctuation/comma/joining)
Sentences should consistently have a capital letter and an end mark.
Yes but I'm not sure how. There's a stress tests project but I've never used it or seen instructions on how to do so. Clearly the manual and functional tests don't catch it because they each look for specific behaviour and this bug manifests purely under load. While I'm certain enough of it to propose this fix catching it in action directly involves some extremely precise timing. The alternative to this is to not pool the Sq;EnvChange objects at all. I considered that but I'd like to keep the performance benefits for low load that the pool allows. If this PR isn't acceptable it would be the next option, if you'd prefer to remove the pooling entirely it'd be understandable, just let me know. |
src/System.Data.SqlClient/src/System/Data/SqlClient/TdsParserHelperClasses.cs
Outdated
Show resolved
Hide resolved
src/System.Data.SqlClient/src/System/Data/SqlClient/TdsParserHelperClasses.cs
Outdated
Show resolved
Hide resolved
I have been thinking about the failure to catch regression in this area. Right now I am not in a favor of rollback. Let's try to drive this issue to completion. There is an urgency to get this fix in, and to improve the test coverage for SqlClient so that we can safely take more PRs in these areas. If for any reason we cannot build up the confidence in the fix, we can consider a rollback. Let's give this issue a 48 hour deadline failing which we can consider a rollback. Adding DataAccess performance to tests should be done in a separate PR since that is the only way to catch this issue and bringing that test in is lot more code than fixing this issue. However in terms of porting the test in corefx, I know that it will be straight-forward. @Wraith2 Can I put this test port in your plate? I know that you are validating the fix with the Dataaccess perf benchmark. |
src/System.Data.SqlClient/src/System/Data/SqlClient/TdsParserHelperClasses.cs
Outdated
Show resolved
Hide resolved
@@ -318,21 +326,28 @@ internal static SqlEnvChange Allocate() | |||
return item; | |||
} | |||
} | |||
// if we didn't find an item in the pool we've either never craete any or we're under pressure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
craete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You wouldn't believe I'm an English monoglot based on my comments would you 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, I would, believe me :D
I found the issue while I was 3 minutes into a profile run on the release build. It took a while to work out exactly what was going on because the error was reporting as being thrown at
The test has to run fast enough to get the same item returned twice from the pool, returned at the same time and for both callers to be executing inside the if block concurrently but slightly out of step so one works and the other then sees that result. That's some tricky timing. I couldn't catch it in debug builds because the code just ran too slowly with asserts present. I'm not against adding more tests but it wouldn't be deterministic and if it worked most of the time that would be a false security. I think this is a case where we have to be absolutely certain of the building blocks used. The pool I provided was not good enough. I hadn't realised that there was a problem with it and because of the nature of concurrent behaviours it would be very difficult to see in code review. I should have flagged this as a high risk in the original PR. The revised design is inspired by a pool in shipping aspnet codebase which gives me some confidence. I asked @jnm2 to look at it because he's good with concurrency and hopefully his and your feedback has prevented me introducing more errors. At this point my recommendation would be to take this PR if everyone is confident in the new pool but if anyone has any doubts we remove the pool and go to standard new/gc release semantics. |
This is good context for the serviceability of the component. Thanks. When I look at the original PR in combination with this PR, the overall goal is to optimize to get away from GC being invoked. I am leaning a bit more on going back to the GC behavior. It may be better to rollback to older behavior and pick this up, when we have a better solution to the problem. |
@Wraith2 here's an idea. On a given connection the Pool will never be used concurrently since the packets come in serial order, so at a time for a Connection's EnvChangePool, you are either using the pool or not. That makes the lifecycle of this pool way easier to manage. There is an overhead of allocating the EnvChangePool per connection, but most of these connections are going to be very long lived because of pooling. So way lesser garbage to handle. If you agree with the idea and think that it is worth pursuing, I think you should send a revert commit for the original PR #34573 and rework on top of that. This will bring master to a good state and then we get your Env pool changes in. Let me know what you think. |
For the moment the discussion we've had has convinced me that the pool is too complex for uncertain benefit at the moment. I've taken it out. I like your idea and I think it's something I'll address in more depth later. This has also highlighted that my focus is on very low end performance and that I need to start looking at bigger scale somehow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Ask me questions, or open an issue and tag it as a For me, @divega has been the go to person to ask questions to, wrt ADO.Net and why APIs were designed the way they were and how they are typically used. In case there are any questions about the public API and how they would be used, he can answer them. I can try to translate his inputs to the internals of SqlClient with whatever knowledge I have. |
Pending some sanity testing from @afsanehr and team. |
Tests passed. |
Excellent, thanks. Sorry I broke it in the first place 😞 |
@Wraith2 what exception do you get when you hit the bug? Just wondering because @davidfowl just sent me a stack trace from a failure in our CI that contains TryProcessEnvChange:
|
@divega Do you know what version of SqlClient is being used ? |
That isn't the particular error I got but it's likely it's the same underlying bug. An SqlEnvChange object could end up being used by two connections at the same time so a number of things could go wrong. In my case I got an array return error because of the exact test I was doing. |
@divega, this is definitely a bug in SQL Client. Even if the SqlClient APIs are used badly, this error should never be thrown. (Unless David's scenario somehow uncovered a bug in the way the server is sending data). I would like to rule out the server being a problem, else we would have received a lot more bug reports. Let's start with a new SqlClient issue. If you can provide the version of SqlClient that caused this problem, I could check the IL to see if it has the EnvPool optimizations at all, or does it have the change with the EnvPool optimization or does it have the SqlEnvChange token optimizations with the envpool removed. That should help pinpoint the change that could have caused this issue. |
SqlClient fix SqlEnvChangePool thread safety bug Commit migrated from dotnet/corefx@af562e8
In PR #34573 a static pool of
SqlEnvChange
objects was added. This pool has been found to be unsafe causing a client crash.Under high load while running the data access benchmark at https://github.com/aspnet/DataAccessPerformance an exception occurred which I tracked back to newBinValue being null on line 276.
corefx/src/System.Data.SqlClient/src/System/Data/SqlClient/TdsParserHelperClasses.cs
Lines 271 to 279 in 7b320a0
The object is not accessed from multiple threads and it is not possible to invoke the Clear method on the same object twice unless it has been returned to two different callers which means the pool is not safe.
It has been replaced with a simpler version inspired by the ObjectPool in aspnet which doesn't count it just uses interlocks throughout. This version does not crash under load. The functional and manual tests all pass but they passed before. This is a bug fix and it is important to get in integrated quickly if possible.
/cc @afsanehr, @tarikulsabbir, @Gary-Zh , @david-engelh @saurabh500