Reading large fields asynchronously is extremely slow #245

roji · 2019-10-04T18:36:25Z

Reading a large (e.g. 5MB) string with SqlDataReader is 10 times slower than doing it synchronously (see BDN benchmark below); at the same time, it allocates much less memory. The same behavior reproduces both with CommandBehavior.Default and CommandBehavior.SequentialAccess, although I suspect there may be some interaction with the command behavior under the hood. Memory-wise, the expectation would be for SequentialAccess to use much less memory memory compared to the default command behavior - regardless of sync/async.

The tests were on run with .NET Core 3.0 on Ubuntu 19.04 against SQL Server running on localhost.

Related links:

https://stackoverflow.com/a/28619983/640325, with interesting profiling and benchmarking on .NET Framework (from 2015) showing a very related but slightly different issue (only async + CommandBehavior.Default trigger the x10 slowdown)
Performance issue when querying varbinary(MAX), varchar(MAX), nvarchar(MAX) or XML with Async efcore#18221, discussing this at the EF Core level.

Benchmark code:

[MemoryDiagnoser]
public class Benchmarks
{
    const string ConnectionString = "...";

    [Params(CommandBehavior.Default, CommandBehavior.SequentialAccess)]
    public CommandBehavior Behavior { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        using var conn = new SqlConnection(ConnectionString);
        conn.Open();

        using (var cmd = new SqlCommand("IF NOT EXISTS (SELECT * FROM sysobjects WHERE name='TextTable' AND xtype='U') CREATE TABLE [TextTable] ([Text] VARCHAR(MAX))", conn))
            cmd.ExecuteNonQuery();

        using (var cmd = new SqlCommand("INSERT INTO [TextTable] ([Text]) VALUES (@p)", conn))
        {
            cmd.Parameters.AddWithValue("p", new string('x', 1024 * 1024 * 5));
            cmd.ExecuteNonQuery();
        }
    }

    [Benchmark]
    public async ValueTask<int> Async()
    {
        using var conn = new SqlConnection(ConnectionString);
        using var cmd = new SqlCommand("SELECT [Text] FROM [TextTable]", conn);
        await conn.OpenAsync();

        await using var reader = await cmd.ExecuteReaderAsync(Behavior);
        await reader.ReadAsync();
        return (await reader.GetFieldValueAsync<string>(0)).Length;
    }

    [Benchmark]
    public async ValueTask<int> Sync()
    {
        using var conn = new SqlConnection(ConnectionString);
        using var cmd = new SqlCommand("SELECT [Text] FROM [TextTable]", conn);
        conn.Open();

        using var reader = cmd.ExecuteReader(Behavior);
        reader.Read();
        return reader.GetFieldValue<string>(0).Length;
    }
}

Results:

BenchmarkDotNet=v0.11.5, OS=ubuntu 19.04
Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100
  [Host]     : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT
  DefaultJob : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT

Method	Behavior	Mean	Error	StdDev	Median	Gen 0	Gen 1	Gen 2	Allocated
Async	Default	839.58 ms	43.2838 ms	127.6233 ms	930.22 ms	98000.0000	97000.0000	96000.0000	11.31 KB
Sync	Default	61.07 ms	1.0988 ms	1.0278 ms	60.80 ms	333.3333	222.2222	222.2222	10600.7 KB
Async	SequentialAccess	813.80 ms	44.3059 ms	130.6370 ms	767.23 ms	127000.0000	126000.0000	125000.0000	11.31 KB
Sync	SequentialAccess	69.54 ms	0.6815 ms	0.6375 ms	69.45 ms	375.0000	250.0000	250.0000	10672.5 KB

Wraith2 · 2019-10-04T22:51:03Z

If the payload of the column crosses multiple packets the data has to be buffered until the end of the data is found. It's going to cause repeated re-allocations. In the case where the full length of the data isn't known it'll keep reallocating and copying into the new buffer. Bizarrely the best way to read it may to be using sequential access mode with a stream because it may only keep the current packet.

I'll have a tinker with it but if I'm right about how it works it would be a substantial change to alter the behaviour if it's possible at all.

roji · 2019-10-05T07:48:09Z

If the payload of the column crosses multiple packets the data has to be buffered until the end of the data is found

That makes sense for non-sequential, but isn't the whole point of sequential access to stream large values? In other words, it's expected for non-sequential access to be slow (no complaints), but not for SequentialAccess. Note that this is a regression from this benchmarked behavior of the .NET Framework driver (note that I didn't confirm/rerun this), where sequential access is fast.

In addition, this should all have nothing to do with sync vs. async... I'm guessing that under the hood there's a completely different implementation for these two paths, and the async one has the issue we're seeing.

Wraith2 · 2019-10-05T09:44:52Z

That makes sense for non-sequential, but isn't the whole point of sequential access to stream large values? In other words,

I can't speak about intent. I can tell you how the code behaves when that option is set though. The most relevant behaviour is that it specifically maintains only the currently read column SqlBuffer clearing out previous ones are reads are done, so you can't step back. That does align with the purpose of reading large values. I don't remember the de-packetization chunk handling code being particularly specialized.

I'm setting up for some benching and I'll see what I can find and if there's any room for improvement.

Wraith2 · 2019-10-06T01:35:36Z

Well there's definitely a problem with how async reading of PLP data is done. It's not a simple one though, it's deeply rooted in how async works. I'll explain what I've found so far because it's as good a place as any to put the information for later reference.

A standard column value is of fixed length, easy to deal with. Non fixed width data has to be prefixed with it's length and then possibly reconstructed from multiple chunks. There's also an unknown length capacity where the user just has to keep requesting data until they reach and end of data packet. We're dealing with the middle option. This data spans multiple packets and is formatted as partially length prefixed data with a known length.

When reading this in sync mode it's pretty straight forward you just keep getting packets and pulling the contents of them out until the current offset in the byte buffer until you reach the end. Then you call the text encoding that corresponds to the encoding to convert your bytes into a string. Almost all of the memory overhead on sync is this string conversion. The only interruptions that are possible are internally generated.

In async mode things get complicated. Because any async method has to be interruptible by cancellation tokens and external events you have to handle state storage until a resolution to the read has been reached. This mechanism is snapshots. All async operations create a snapshot of the current reader state as their first action. Data is then requested and when received it is appended to the snapshot packet list and the snapshot is replayed to see if it can reach a resolution.

When you start reading the data from the first packet in the snapshot the data is identified as being part of a larger array of known length. An array is allocated for the full length and and attempt is made to read all the data, this fails because we don't have all the data (most of the time). Once it fails the current packet is consumable but not consumed so we can't discard it and it's left in the snapshot and we wait for more data.

Hopefully you've just seen the problem. For every packet the arrives all packets in the unresolved chain are re-processed in the snapshot replay and that reprocessing allocates an array large enough for all the output every time. so divide the size of the data 5Mib by roughly 8K chunks for default packet size and you'll have that number -1 dropped 5Mib arrays. The profile traces are quite impressive.

I've already got one PR in progress that optimizes the mechanics of snapshots. It won't help here. This is a logical problem. What I'd like to understand is why it was chosen to implement a full replay each time a packet arrives rather than trying to keep some additional state in the snapshot, as I see it there is no branch capability in the replay the only possible results are {Success, Failure, Continue} and if that's the case and backtracking can't happen it would be possible to augment snapshots with continuation state To know this you'll need to find someone who was around at the time or implemented this functionality because they'll know what the limiting factor was here, there could well be a hidden requirement I can't divine from the code,.

roji · 2019-10-07T10:38:56Z

Thanks for the detailed info @Wraith2, it's indeed quite interesting and raises some questions.

Two immediate comments that I have:

As I wrote above, the Core version seems to be a regression compared to the Framework version - at least a some point, according to this SO, async sequential was apparently fast. It may be a good idea to run the benchmark above on .NET Framework to confirm, and then to see what's being done differently in the Framework version.
At least for byte arrays (blobs), things really should be quite simple. Assuming values are length-prefixed (which seems to be the case according to your description), a byte array of the suitable size should simply be allocated and filled gradually with no extra memory overhead or needless copying. I understand the current architecture is very different, am just stating the obvious in how things should work.
Strings, on the other hand, are considerably more complicated as decoding has to take place. In Npgsql, if the string is really large, an extra byte array is allocated for the (byte) length of the string being read. Like with byte arrays, it's simply filled gradually, and decoded once at the end. This is probably not ideal (at minimum the temporary byte array should be pooled), but I doubt that this API could function much better for long strings (where GetTextReader is more suited). Here's the source code.
Am I understanding you correctly that reader state is manually snapshotted/managed for async operations? That would seem to indicate that the compiler async/await feature isn't used in SqlClient? Because the whole point of that feature is to "snapshot" the local variables in the async function and restore them later when the callback completes... Am just curious.

Wraith2 · 2019-10-07T11:03:41Z

I suspect the first iteration of the async will have used sync over async, in which case you consume an additional thread but get identical (fairly good) performance to the sync version. I'd have to find an original version from that time and dive into the IL to verify that and i'm not sure it helps to do so, I'd just accept that perf used to be better and get on with improving it for the future.

All the variable length fields varchar, varbinary, image ntext etc seem to go down the same route to get the bytes and then if needed they're transformed once the entire contents is known. That places an upper limit on the size of those fields which is worked around with stream capability like GetTextReader as you mentioned which bypasses buffering the whole thing, that's the non-prefixed variable length i mentioned.

Am I understanding you correctly that reader state is manually snapshotted/managed for async operations?

Yes. But the snapshot is present for good reasons because some things like attn and envchange need to processed on the real current state even when recieved in a snapshot context. There's plenty of async closure stuff going on in the library and that might even be part of the problem here.

If i squint a bit i can sort of see how this can be corrected to be a linear operation. It's going to get a bit messy because the snapshot functionality is integrated with a couple of different layers but there's no doubt it's worth doing. I would like some confirmation of the no backtracking possibliity when processing a plp chunk stream though.

Wraith2 · 2019-10-13T22:35:09Z

There's a pretty simple way to change the behaviour.

Before, restating for comparison so the number are from the same machine:

Method	Behavior	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
Async	Default	448.84 ms	8.9453 ms	8.7855 ms	309000.0000	309000.0000	309000.0000	3.98 KB
Sync	Default	17.46 ms	0.1025 ms	0.0959 ms	406.2500	406.2500	406.2500	10244.82 KB
Async	SequentialAccess	458.23 ms	9.0577 ms	10.4309 ms	310000.0000	310000.0000	310000.0000	3.98 KB
Sync	SequentialAccess	17.38 ms	0.0638 ms	0.0596 ms	406.2500	406.2500	406.2500	10244.82 KB

After:

Method	Behavior	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
Async	Default	130.98 ms	1.2547 ms	1.1737 ms	750.0000	500.0000	250.0000	3.95 KB
Sync	Default	17.38 ms	0.1107 ms	0.1036 ms	406.2500	406.2500	406.2500	10244.82 KB
Async	SequentialAccess	131.20 ms	0.7910 ms	0.7399 ms	750.0000	500.0000	250.0000	3.95 KB
Sync	SequentialAccess	17.35 ms	0.0806 ms	0.0754 ms	406.2500	406.2500	406.2500	10244.82 KB

It's not a very nice solution because it breaks encapsulation by making the plp data read aware of the snapshot and using it to store the target buffer between attempts to complete the execution. 3x faster, 400 times less GC. It's still almost 10x slower than the sync version but there's no way to fix that.

At the moment it performs a complete attempt to fill the buffer on every packet received and stores all packets received in a queue until the command succeeds or is cancelled. The good solution is to change the entire way that async behaves and to make it store continuation progress in the intermediate buffer before returning it. This would decrease memory usage by removing the packet queue and increase speed by removing wasted cpu in continual replays. It's a major task that needs agreement and planning officially.

Perf will also be improved by #198

cheenamalhotra · 2019-10-16T18:15:03Z

Ack: PRs for Perf improvement are on hold for review for a while, till we get driver stabilized for breaking issues.

saurabh500 · 2019-11-06T18:21:38Z

@Wraith2 can you share your project that you used to benchmark this issue?

I had a chat with @roji and we are seeing worse benchmarks than were originally reported.

| Method |         Behavior |       Mean |    Error |    StdDev |     Median |       Gen 0 |       Gen 1 |       Gen 2 |  Allocated |
|------- |----------------- |-----------:|---------:|----------:|-----------:|------------:|------------:|------------:|-----------:|
|  Async |          Default |   961.5 ms | 35.20 ms | 103.80 ms | 1,007.5 ms | 105000.0000 | 104000.0000 | 104000.0000 | 3661.01 MB |
|   Sync |          Default |   893.9 ms |  3.09 ms |   2.89 ms |   893.8 ms |           - |           - |           - |      15 MB |
|  Async | SequentialAccess | 1,888.8 ms | 42.94 ms | 126.60 ms | 1,929.5 ms | 173000.0000 | 172000.0000 | 172000.0000 | 3525.79 MB |
|   Sync | SequentialAccess | 1,878.2 ms | 37.43 ms |  71.21 ms | 1,865.5 ms |           - |           - |           - |      15 MB |

We are wondering how the original results were obtained? @roji is going to backtrack to the version of SqlClient that he may have run the original benchmarks on.

@Wraith2 in the meanwhile, can you provide your repro project. In case this is a regression in the version of SqlClient that we have been testing with, we need to remove the regression first.

The repos we have used are https://github.com/roji/SqlClientAsyncBenchmark
and
https://github.com/saurabh500/sqlclientrepros/tree/master/asyncperf

saurabh500 · 2019-11-06T18:22:14Z

This was run on Windows as well as Linux. The benchmarks show 3.6 Gigs allocation for async :(
I have tried running against local and remote sql server with similar results.

roji · 2019-11-06T18:54:28Z

It isn't the version of SqlClient - it seems to be a difference between BenchmarkDotNet 11.5 and the recently-released 12.0. I'll investigate a bit more deeply to understand which one is reporting correct results.

roji · 2019-11-06T18:57:24Z

Here we go, from the release notes for BDN 0.12.0:

Improved MemoryDiagnoser
Now MemoryDiagnoser includes memory allocated by all threads that were live during benchmark execution: a new GC API was exposed in .NET Core 3.0 preview6+. It allows to get the number of allocated bytes for all threads. #1155, #1153, #723

I'll reach out to the BDN folks, but I'd tend to trust the new results by default unless we find out otherwise...

Wraith2 · 2019-11-06T19:02:01Z

I noted in my PR #285 that BDN wasn't showing the correct memory figures and used dotMemory results to show the difference made by my changes. The new numbers you give look much more reasonable.

roji · 2019-11-06T19:07:14Z

Makes sense now, thanks @Wraith2!

hallidev · 2020-12-05T02:03:34Z

@Wraith2 Sorry for posting on an old issue but my rabbit hole led here. The last benchmark you mentioned here states that async runs ~10x slower than sync. Is this still the case? Has this been addressed in another issue in this repo or if not, is any work planned?

Wraith2 · 2020-12-05T08:50:26Z

Yes, this is still the case. It can't be addressed without deep changes to the library. At the moment i'm working on consolidating the netcore and netfx codebases so that we don't have to work on two different libraries at once which will hopefully free up development time overall.

As for plans. I'm an external contributor and know only what is posted on this repo publicly. I plan to work on async but it's going to be a slow project once I get started and I won't be starting anytime soon. What the owning team intend I have no idea. Stability and new features seem to be the current goals.

hallidev · 2020-12-05T11:29:25Z

Understood. Thanks for your response and contributions to such a critical library.

enkelmedia · 2021-11-03T16:34:24Z

I still see the same problem on 3.1.1, I cant understand - the issue is closed but when was it fixed?

Wraith2 · 2021-11-03T16:47:14Z

This individual issue was sped up with a cache. The larger issue is still open under #593

KonnectSalon · 2022-04-04T04:38:19Z

This is still a major issue.

I have been troubleshooting spikes in CPU and slow SQL statements only to come to this thread.

GetClaimsAsync in Microsoft.AspNetCore.Identity calls SQL that has a table of nvarchar(max) for claims and values.

Changing the table to nvarchar(255) seem to resolve the issue.

This issue is having a knock on affect to other libraries :(

This was referenced Oct 4, 2019

Performance issue when querying varbinary(MAX), varchar(MAX), nvarchar(MAX) or XML with Async dotnet/efcore#18221

Closed

Query: Relational: Use SequentialAccess when reading data? dotnet/efcore#885

Open

cheenamalhotra added the 📈 Performance Issues that are targeted to performance improvements. label Oct 7, 2019

Wraith2 mentioned this issue Oct 12, 2019

perf: rework TdsParserStateObject and SqlDataReader snapshots #198

Merged

AndriySvyryd mentioned this issue Oct 24, 2019

ToListAsync() is very slow dotnet/efcore#18571

Closed

Wraith2 mentioned this issue Oct 24, 2019

perf: Add multipacket async read target buffer caching #285

Merged

David-Engel added this to the 1.1.0 milestone Nov 6, 2019

David-Engel closed this as completed in #285 Nov 7, 2019

roji mentioned this issue Nov 11, 2019

Slow Async Query of Byte[] in 2.2.1 dotnet/efcore#14498

Closed

Wraith2 mentioned this issue Feb 28, 2020

Drastic performance difference between using SqlClient and SSMS #412

Closed

This was referenced Jun 5, 2020

EF Core - Memory and performance issues with async methods dotnet/efcore#21147

Closed

Reading large data (binary, text) asynchronously is extremely slow #593

Open

AndresRamos mentioned this issue Oct 14, 2023

Corrige rendimiento en consultas de requests AndresRamos/ARSoftware.Contpaqi.Comercial.Api#38

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading large fields asynchronously is extremely slow #245

Reading large fields asynchronously is extremely slow #245

roji commented Oct 4, 2019 •

edited

Loading

Wraith2 commented Oct 4, 2019

roji commented Oct 5, 2019

Wraith2 commented Oct 5, 2019

Wraith2 commented Oct 6, 2019

roji commented Oct 7, 2019 •

edited

Loading

Wraith2 commented Oct 7, 2019

Wraith2 commented Oct 13, 2019

cheenamalhotra commented Oct 16, 2019

saurabh500 commented Nov 6, 2019

saurabh500 commented Nov 6, 2019 •

edited

Loading

roji commented Nov 6, 2019

roji commented Nov 6, 2019

Wraith2 commented Nov 6, 2019

roji commented Nov 6, 2019

hallidev commented Dec 5, 2020 •

edited

Loading

Wraith2 commented Dec 5, 2020

hallidev commented Dec 5, 2020

enkelmedia commented Nov 3, 2021

Wraith2 commented Nov 3, 2021

KonnectSalon commented Apr 4, 2022

Reading large fields asynchronously is extremely slow #245

Reading large fields asynchronously is extremely slow #245

Comments

roji commented Oct 4, 2019 • edited Loading

Wraith2 commented Oct 4, 2019

roji commented Oct 5, 2019

Wraith2 commented Oct 5, 2019

Wraith2 commented Oct 6, 2019

roji commented Oct 7, 2019 • edited Loading

Wraith2 commented Oct 7, 2019

Wraith2 commented Oct 13, 2019

cheenamalhotra commented Oct 16, 2019

saurabh500 commented Nov 6, 2019

saurabh500 commented Nov 6, 2019 • edited Loading

roji commented Nov 6, 2019

roji commented Nov 6, 2019

Wraith2 commented Nov 6, 2019

roji commented Nov 6, 2019

hallidev commented Dec 5, 2020 • edited Loading

Wraith2 commented Dec 5, 2020

hallidev commented Dec 5, 2020

enkelmedia commented Nov 3, 2021

Wraith2 commented Nov 3, 2021

KonnectSalon commented Apr 4, 2022

roji commented Oct 4, 2019 •

edited

Loading

roji commented Oct 7, 2019 •

edited

Loading

saurabh500 commented Nov 6, 2019 •

edited

Loading

hallidev commented Dec 5, 2020 •

edited

Loading