-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading large fields asynchronously is extremely slow #245
Comments
If the payload of the column crosses multiple packets the data has to be buffered until the end of the data is found. It's going to cause repeated re-allocations. In the case where the full length of the data isn't known it'll keep reallocating and copying into the new buffer. Bizarrely the best way to read it may to be using sequential access mode with a stream because it may only keep the current packet. I'll have a tinker with it but if I'm right about how it works it would be a substantial change to alter the behaviour if it's possible at all. |
That makes sense for non-sequential, but isn't the whole point of sequential access to stream large values? In other words, it's expected for non-sequential access to be slow (no complaints), but not for SequentialAccess. Note that this is a regression from this benchmarked behavior of the .NET Framework driver (note that I didn't confirm/rerun this), where sequential access is fast. In addition, this should all have nothing to do with sync vs. async... I'm guessing that under the hood there's a completely different implementation for these two paths, and the async one has the issue we're seeing. |
I can't speak about intent. I can tell you how the code behaves when that option is set though. The most relevant behaviour is that it specifically maintains only the currently read column I'm setting up for some benching and I'll see what I can find and if there's any room for improvement. |
Well there's definitely a problem with how async reading of PLP data is done. It's not a simple one though, it's deeply rooted in how async works. I'll explain what I've found so far because it's as good a place as any to put the information for later reference. A standard column value is of fixed length, easy to deal with. Non fixed width data has to be prefixed with it's length and then possibly reconstructed from multiple chunks. There's also an unknown length capacity where the user just has to keep requesting data until they reach and end of data packet. We're dealing with the middle option. This data spans multiple packets and is formatted as partially length prefixed data with a known length. When reading this in sync mode it's pretty straight forward you just keep getting packets and pulling the contents of them out until the current offset in the byte buffer until you reach the end. Then you call the text encoding that corresponds to the encoding to convert your bytes into a string. Almost all of the memory overhead on sync is this string conversion. The only interruptions that are possible are internally generated. In async mode things get complicated. Because any async method has to be interruptible by cancellation tokens and external events you have to handle state storage until a resolution to the read has been reached. This mechanism is snapshots. All async operations create a snapshot of the current reader state as their first action. Data is then requested and when received it is appended to the snapshot packet list and the snapshot is replayed to see if it can reach a resolution. When you start reading the data from the first packet in the snapshot the data is identified as being part of a larger array of known length. An array is allocated for the full length and and attempt is made to read all the data, this fails because we don't have all the data (most of the time). Once it fails the current packet is consumable but not consumed so we can't discard it and it's left in the snapshot and we wait for more data. Hopefully you've just seen the problem. For every packet the arrives all packets in the unresolved chain are re-processed in the snapshot replay and that reprocessing allocates an array large enough for all the output every time. so divide the size of the data 5Mib by roughly 8K chunks for default packet size and you'll have that number -1 dropped 5Mib arrays. The profile traces are quite impressive. I've already got one PR in progress that optimizes the mechanics of snapshots. It won't help here. This is a logical problem. What I'd like to understand is why it was chosen to implement a full replay each time a packet arrives rather than trying to keep some additional state in the snapshot, as I see it there is no branch capability in the replay the only possible results are {Success, Failure, Continue} and if that's the case and backtracking can't happen it would be possible to augment snapshots with continuation state To know this you'll need to find someone who was around at the time or implemented this functionality because they'll know what the limiting factor was here, there could well be a hidden requirement I can't divine from the code,. |
Thanks for the detailed info @Wraith2, it's indeed quite interesting and raises some questions. Two immediate comments that I have:
|
I suspect the first iteration of the async will have used sync over async, in which case you consume an additional thread but get identical (fairly good) performance to the sync version. I'd have to find an original version from that time and dive into the IL to verify that and i'm not sure it helps to do so, I'd just accept that perf used to be better and get on with improving it for the future. All the variable length fields varchar, varbinary, image ntext etc seem to go down the same route to get the bytes and then if needed they're transformed once the entire contents is known. That places an upper limit on the size of those fields which is worked around with stream capability like GetTextReader as you mentioned which bypasses buffering the whole thing, that's the non-prefixed variable length i mentioned.
Yes. But the snapshot is present for good reasons because some things like attn and envchange need to processed on the real current state even when recieved in a snapshot context. There's plenty of async closure stuff going on in the library and that might even be part of the problem here. If i squint a bit i can sort of see how this can be corrected to be a linear operation. It's going to get a bit messy because the snapshot functionality is integrated with a couple of different layers but there's no doubt it's worth doing. I would like some confirmation of the no backtracking possibliity when processing a plp chunk stream though. |
There's a pretty simple way to change the behaviour. Before, restating for comparison so the number are from the same machine:
After:
It's not a very nice solution because it breaks encapsulation by making the plp data read aware of the snapshot and using it to store the target buffer between attempts to complete the execution. 3x faster, 400 times less GC. It's still almost 10x slower than the sync version but there's no way to fix that. At the moment it performs a complete attempt to fill the buffer on every packet received and stores all packets received in a queue until the command succeeds or is cancelled. The good solution is to change the entire way that async behaves and to make it store continuation progress in the intermediate buffer before returning it. This would decrease memory usage by removing the packet queue and increase speed by removing wasted cpu in continual replays. It's a major task that needs agreement and planning officially. Perf will also be improved by #198 |
Ack: PRs for Perf improvement are on hold for review for a while, till we get driver stabilized for breaking issues. |
@Wraith2 can you share your project that you used to benchmark this issue? I had a chat with @roji and we are seeing worse benchmarks than were originally reported.
We are wondering how the original results were obtained? @roji is going to backtrack to the version of SqlClient that he may have run the original benchmarks on. @Wraith2 in the meanwhile, can you provide your repro project. In case this is a regression in the version of SqlClient that we have been testing with, we need to remove the regression first. The repos we have used are https://github.com/roji/SqlClientAsyncBenchmark |
This was run on Windows as well as Linux. The benchmarks show 3.6 Gigs allocation for async :( |
It isn't the version of SqlClient - it seems to be a difference between BenchmarkDotNet 11.5 and the recently-released 12.0. I'll investigate a bit more deeply to understand which one is reporting correct results. |
Here we go, from the release notes for BDN 0.12.0:
I'll reach out to the BDN folks, but I'd tend to trust the new results by default unless we find out otherwise... |
I noted in my PR #285 that BDN wasn't showing the correct memory figures and used dotMemory results to show the difference made by my changes. The new numbers you give look much more reasonable. |
Makes sense now, thanks @Wraith2! |
@Wraith2 Sorry for posting on an old issue but my rabbit hole led here. The last benchmark you mentioned here states that async runs ~10x slower than sync. Is this still the case? Has this been addressed in another issue in this repo or if not, is any work planned? |
Yes, this is still the case. It can't be addressed without deep changes to the library. At the moment i'm working on consolidating the netcore and netfx codebases so that we don't have to work on two different libraries at once which will hopefully free up development time overall. As for plans. I'm an external contributor and know only what is posted on this repo publicly. I plan to work on async but it's going to be a slow project once I get started and I won't be starting anytime soon. What the owning team intend I have no idea. Stability and new features seem to be the current goals. |
Understood. Thanks for your response and contributions to such a critical library. |
I still see the same problem on 3.1.1, I cant understand - the issue is closed but when was it fixed? |
This individual issue was sped up with a cache. The larger issue is still open under #593 |
This is still a major issue. I have been troubleshooting spikes in CPU and slow SQL statements only to come to this thread. GetClaimsAsync in Microsoft.AspNetCore.Identity calls SQL that has a table of nvarchar(max) for claims and values. Changing the table to nvarchar(255) seem to resolve the issue. This issue is having a knock on affect to other libraries :( |
Reading a large (e.g. 5MB) string with SqlDataReader is 10 times slower than doing it synchronously (see BDN benchmark below); at the same time, it allocates much less memory. The same behavior reproduces both with CommandBehavior.Default and CommandBehavior.SequentialAccess, although I suspect there may be some interaction with the command behavior under the hood. Memory-wise, the expectation would be for SequentialAccess to use much less memory memory compared to the default command behavior - regardless of sync/async.
The tests were on run with .NET Core 3.0 on Ubuntu 19.04 against SQL Server running on localhost.
Related links:
Benchmark code:
Results:
The text was updated successfully, but these errors were encountered: