-
Notifications
You must be signed in to change notification settings - Fork 4.9k
SqlClient memory performance changes #32811
Conversation
…nvalid handle move some more async continuations o separate functions
@Wraith2 Are you still working on this PR? There is a bunch of commented out code, which can be deleted. I notice the [WIP] in the title, but wanted to double check. |
src/System.Data.SqlClient/src/System/Data/SqlClient/TdsParser.cs
Outdated
Show resolved
Hide resolved
The build failure is caused by the new |
Re The project file has |
Thanks, done. There are a small number of other places in the project that could also use those |
@Wraith2 Exception
Failing Tests
Stack Trace
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested changes for failing tests on Linux and Mac
I mentioned in my original post that I had run the manual tests that I could, there are ones I can't run because I don't have the databases and setup needed. If there's some information about how to set all of that up then i'll give it a try. Logically I can only see one way for that particular error to happen in that place, if I'm right it's just a case of the input buffer being longer than the value we want to consume and me being careless about my span length. I've pushed a tiny change which would correct this, can you pull and retry? |
@Wraith2 For Linux test execution you could either install and setup the SQL Server just the way you did on Windows, or better approach, you could simply point the Linux test environment variable to point to the already setup, test Sql server. The steps to setup and build on Ubuntu are at https://github.com/dotnet/corefx/blob/master/Documentation/building/unix-instructions.md |
The part I'm missing is Linux experience, I have none. Going from zero to corefx and sql server dev/testing is a big task. I couldn't run most of the manual tests anyway because there no setup documented for them. |
@Wraith2 Did your Manual Tests execution on Windows, execute the failing tests above? |
I doubt this test was executed, the only ones I can run are the basic ones that only need an sql server instance and northwind. These probably weren't in that list. As @keeratsingh has said these don't fail on windows though so I couldn't have caught it anyway. As you can see from the code changes in that area the problem is occurring when there is too much information to continue the write synchronously, since it was passed in as a span it's allocating space and trying to copy into it before passing that into the continuation creation method. |
I've spent the weekend getting an Ubuntu vm setup with sql2017 and the parts needed to compile corefx. I can now compile it and attempt to run tests. I say attempt because even the master tests won't work, the first manual test core dumps with an unrelated assertion fault. A tooling warning from adcade seems to block the sqlperf branch tests running and the build.sh script won't recognise -- as a valid argument so I can't force the test with /p:ForceRunTests=true . Combine this with the StressRunner not working at all on windows and the tests for this module seem to very difficult to setup correctly. The change to WriteByteArray is only needed by the change for float and double at the moment and if needed that set of changes could be dropped. I'd rather not, I'd much rather have a repeatable cross platform set of tests that let me pinpoint the issue but after my last change took 3 months to integrate I don't want the same to happen here. |
Just an update that I am still seeing the failures on Mac, similar to Linux. |
Expected. But I have nowhere to go with this unless someone willing to help more than telling me it's broken. |
// | ||
// Takes a byte array and writes it to the buffer. | ||
// | ||
internal Task WriteByteArray(byte[] b, int len, int offsetBuffer, bool canAccumulate = true, TaskCompletionSource<object> completion = null) | ||
internal Task WriteByteArray(ReadOnlySpan<byte> b, int len, int offsetBuffer, bool canAccumulate = true, TaskCompletionSource<object> completion = null, byte[] array = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flow of this function is dependent on array
. It is a bit confusing, what is the source of data for this method.
If the byte[] array is defined then it is used to create a Span, else b
is used.
Will it make sense to create an overload of this function, one with ReadOnlySpan b and one which takes byte[] array ? From the looks of it, if b
is passed in and array
is passed in, then the data in array takes precedence, which means that the data in b
will be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you pass in a span and no byte array then it will attempt to use only the span.
If you pass in an array then the array will be used as the source of the span.
if the span cannot be written into the current packet then the remaining contents of the span are "promoted" to a heap allocated array so they can be part of the async closure that will used to callback into the method to continue.
Once you're got a full array the method will use that in preference to anything else so you can't be in a position where you accidentally pass a span and an array, array wins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wraith2 This is what I want to prevent in the method. Having to read through code to understand what is the data source 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's clear enough I can put it into the code at the top of the function, would that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, make a comment on top of the function explaining this. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a coment to the header of the function explaining how it's used. I've also changed it from internal to private because i'd already provided wrappers for both array and span only, this way you have to be in the same file as the confusing parametered version to call it and at that point you chould read the comment. I also decided to rename it to WriteBytes because it's more accurate. Externally you have to call WriteByteArray or WriteByteSpan.
src/System.Data.SqlClient/src/System/Data/SqlClient/TdsParserStateObject.cs
Outdated
Show resolved
Hide resolved
…ake private to prevent confused external callers
@Wraith2 is this ready for re-review? |
Yes, has been for a while, I thought that's what @saurabh500 was doing... |
I think the code looks good to me. However I want to hold off the merge, till the EntityFrameworkCore tests have been run with SqlClient with these changes. |
Ok, that makes sense. Is the EF test suite available for me to test against? When this PR gets tested it might be a good idea to try #33155 as well, both contribute performance increases and open the way for further work I've identified but am blocked on. |
Yes EF tests are available. It is an open source project as well. https://gist.github.com/saurabh500/7aea92c5bd0f2ad3d12f1dd751e6aec8 The over all idea is to clone and build https://github.com/aspnet/EntityFrameworkCore |
The EFCore tests run successfully. repo verision:
my version:
Your instructions were good but slightly off at the end i think because efcore have changed how it works. Once you update the package version in build\dependencies.props you then have to provide a way to restore that package, not just the dll. I edited build\sources.props to add my corefx bin\packages\Debug directory to the list. Before doing that i ensured that the build was broken by the change to the package version alone so i know it picked up my package from the right place and had to use it. Mine takes longer because i had to use the Debug build nuget packages. -allconfigurations doesn't seem to build the release packages and i can't find how to force build packages, the tooling docs only reference a batch file that no longer exists. There shouldn't be any behavioural difference between the debug and release versions though, if anything more checks are done with the assertions in debug so it may be a better test. |
Nudge. |
@dotnet-bot test UWP CoreCLR x64 Debug Build |
@Wraith2 Nudge acknowledged. |
@dotnet-bot test UWP CoreCLR x64 Debug Build |
@Wraith2 Running EFCore tests on Linux, I see a crash in the test EFCore.SqlServer.FunctionalTests
As Entity frameworks tests enable MARS randomly and for some tests MARS is always enabled, these failures are happening with MARS enabled code path. |
EF tests enable MARS randomly? that would mean they weren't repeatable and would defeat their purpose wouldn't it? That doesn't make sense... Interesting but not related as far as I can tell. The error is coming from corefx/src/System.Data.SqlClient/src/System/Data/SqlClient/TdsParserStateObject.cs Line 2707 in 869208d
My behavioural changes are on write paths, so not involved in this read operation (and If these non-CI tests need to be run for both state object implementations it would make things easier if there were a way to switch between the two without having to recompile or fiddle with app context switches. Can the implementation switch be re-enabled and put on an environment variable? |
Ok, I see what you're seeing now. I'm still certain it isn't directly caused by a change I've made. The non-deterministic behaviour makes me think it's a timing problem. This combined with the quite random order that tests are run makes the error appear in different places each time, even multiple times in the same test when they're run in parallel. I've managed verify that the TCS isn't null, that means that packet._data is null and that can only be the case when there's a problem receiving data in
In order to test I have to do a build -allconfigurations because it's the only way to force a package build. I've tried the Debugger.Break() trick and that won't work, the code is present in the binary but doesn't function for no reason I can determine. So at the moment I can iterate very slowly and my only method of inspection is assertion. This is possibly the worst debugging environment possible, even messageboxes are friendlier. Attempting to isolate the problem under these conditions is totally impractical. |
To remove the MARS randomization in EF tests, you could change the tests here #32811 (comment) to make sure that MARS is always set to true. MARS - MultipleActiveResultSets cc @danmosemsft @karelz, in case there are any pointers for debugging on Linux? |
Debugging on Linux isn't needed. The managed/unmanaged switch can be hardcoded to force use of the managed path on windows, this is what I've done and replicated the issues reported. It would be useful if this switch was available at runtime even if only in debug builds but I can get around it. In order to replicate the issue seen at the moment you have to have build sqlclient, publish the nuget, consume that nuget from efcore and run the tests until they fail at some point, which they will but not reliably and not in the same test. When they do fail there is no debug output, attaching a debugger won't get you much information because you don't know when the problem will occur and when it does occur since you're using a package you don't get code or il you're down at assembly level. It may be possible to fix some of these issues but it'll still be slow. I don't believe that the managed implementation is working correctly. As I said above none of the changes I've made have affected this path through the code. The problem we're seeing looks like the tcp socket is disconnecting and I've been nowhere near that and any timing changes should not affect how this functions. As you're aware that's a large complex and async set of code if there are problems inside it then I'm not going to attempt to isolate and quantify them with the absolute minimum level of debugging support currently available. My day job is maintenance programming and I wanted to work on core to get away from inheriting problems and having to fix them without help and have a community to work with and improve something. |
@Wraith2 We understand it's difficult to debug this issue given that you have to generate a package and use the same on EFCore tests. What we would suggest is to break down the PR into three different sections as you mentioned above. |
I can split apart the changes but they will have the same effect when you combine them. If i'm right that the error isn't actually coming from my code changes then there's no reason we won't just end up in the same state again, and it won't be because the last change is the cause it'll be that it's the one that changed the timing enough to make it apparent. Have you investigated and do you agree with my conclusions? or have you found information that leads you to think differently? If you want me to close this and rework in parts can you review the other two open PR's, they're entirely independant of this one but weren't being looked at waiting for this one. #33580 and #33155 |
I was able to run the EFCore tests on Linux successfully without your changes
👍 Will start looking in to these PRs too. I was just wondering if you had already run EFCore tests with those changes as well and if not can you please run them and let us know the results? Thanks! |
Good news! I have a minimal reproduction of the problem you found with the ReadAsyncCallback. Steps to repro are pretty simple. Checkout current master, change UseManagedSNI to true, build and run the manual tests, watch it fail:
Tracking back you can find the inner exception from here which will show you a SocketException containing an IOException identifying that the socket connection was closed. This causes null data in the packet and that bubbles up to where you see it. I opened an issue for this with some more details, https://github.com/dotnet/corefx/issues/33930 |
While doing some profiling I realised that SqlClient related classes were causing a lot of memory churn so I looked into why. I've made some performance changes which affect all query paths to some extent but at most effective when using a single command object with a prepared query, in that case the memory usage is almost 10x lower.
Points to note:
_SqlRPC
object which change how the system and user parameters are allocated and stored. Originally the system parameters were allocated each execution, I've altered this so the correct number are allocated as needed and re-used cleanly if the query type changes (between prepare, unprepared, proc and exec). User parameters are no longer copied in but referenced by index into the parameter collection. This will need careful review.WriteByteArray
path and used that to allow the newBitConverter.TryWriteBytes
overload with stackalloced buffers to remove memory allocations for writing of 16 and 32 bit reals. This path contains an async continuation when the buffer needs to be expanded which now allocates and copies to a byte[] only when needed.Things that can still be improved with more extensive knowledge:
TdsStateParserNative.ReadSyncOverAsync
when calling into sni.dll. If the return type could be changed to IntPtr it would improve things.SqlCommand.ValidateCommand
unconditionally calls intoValidateAsyncCommand
which allocates an async state object.ValidateCommand
has a bool async arg but it isn't used which I find suspicious that someone would miss, I haven't changed this in case it's deliberate. Please review.All the SqlClient tests pass locally. All the manual tests I'm able to run pass. All my testing of using it with various queries works. So I think it's stable and working, given the complexity of the library and the wide usage it will need some careful prodding from reviewers.
I can't get benchmarkdotnet to co-operate with the project despite help from people in gitter/corefx's help. I do have profiling output which shows the level of improvement. I have written the benchmarks and was using them on the original problem code I used the same setup/run/teardown loop from the test to profile with. The tests consist of generating 1000 numbers and then inserting them into a table using a prepared/unprepared/sproc query, and looping that.
I'm not an async expert so I haven't done much work on the async paths. They still work, they're no slower and they do benefit from the changes I've done. The profiling shows that async machinery dominates the allocations running the same test rig using async versions of functions.
Before changes, 100x1000 float32 prepared insert:
![sqlclient-reference](https://user-images.githubusercontent.com/13322696/46918678-bb73b300-cfcc-11e8-8f77-897d769ff47f.PNG)
After changes:
![sqlclient-development](https://user-images.githubusercontent.com/13322696/46918687-d2b2a080-cfcc-11e8-8955-b7122e48e8d6.PNG)
FAO: @keeratsingh, @afsanehr, @David-Engel