Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading large fields asynchronously is extremely slow #245

Closed
roji opened this issue Oct 4, 2019 · 20 comments · Fixed by #285
Closed

Reading large fields asynchronously is extremely slow #245

roji opened this issue Oct 4, 2019 · 20 comments · Fixed by #285
Labels
📈 Performance Issues that are targeted to performance improvements.
Milestone

Comments

@roji
Copy link
Member

roji commented Oct 4, 2019

Reading a large (e.g. 5MB) string with SqlDataReader is 10 times slower than doing it synchronously (see BDN benchmark below); at the same time, it allocates much less memory. The same behavior reproduces both with CommandBehavior.Default and CommandBehavior.SequentialAccess, although I suspect there may be some interaction with the command behavior under the hood. Memory-wise, the expectation would be for SequentialAccess to use much less memory memory compared to the default command behavior - regardless of sync/async.

The tests were on run with .NET Core 3.0 on Ubuntu 19.04 against SQL Server running on localhost.

Related links:

Benchmark code:

[MemoryDiagnoser]
public class Benchmarks
{
    const string ConnectionString = "...";

    [Params(CommandBehavior.Default, CommandBehavior.SequentialAccess)]
    public CommandBehavior Behavior { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        using var conn = new SqlConnection(ConnectionString);
        conn.Open();

        using (var cmd = new SqlCommand("IF NOT EXISTS (SELECT * FROM sysobjects WHERE name='TextTable' AND xtype='U') CREATE TABLE [TextTable] ([Text] VARCHAR(MAX))", conn))
            cmd.ExecuteNonQuery();

        using (var cmd = new SqlCommand("INSERT INTO [TextTable] ([Text]) VALUES (@p)", conn))
        {
            cmd.Parameters.AddWithValue("p", new string('x', 1024 * 1024 * 5));
            cmd.ExecuteNonQuery();
        }
    }

    [Benchmark]
    public async ValueTask<int> Async()
    {
        using var conn = new SqlConnection(ConnectionString);
        using var cmd = new SqlCommand("SELECT [Text] FROM [TextTable]", conn);
        await conn.OpenAsync();

        await using var reader = await cmd.ExecuteReaderAsync(Behavior);
        await reader.ReadAsync();
        return (await reader.GetFieldValueAsync<string>(0)).Length;
    }

    [Benchmark]
    public async ValueTask<int> Sync()
    {
        using var conn = new SqlConnection(ConnectionString);
        using var cmd = new SqlCommand("SELECT [Text] FROM [TextTable]", conn);
        conn.Open();

        using var reader = cmd.ExecuteReader(Behavior);
        reader.Read();
        return reader.GetFieldValue<string>(0).Length;
    }
}

Results:

BenchmarkDotNet=v0.11.5, OS=ubuntu 19.04
Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100
  [Host]     : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT
  DefaultJob : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT

Method Behavior Mean Error StdDev Median Gen 0 Gen 1 Gen 2 Allocated
Async Default 839.58 ms 43.2838 ms 127.6233 ms 930.22 ms 98000.0000 97000.0000 96000.0000 11.31 KB
Sync Default 61.07 ms 1.0988 ms 1.0278 ms 60.80 ms 333.3333 222.2222 222.2222 10600.7 KB
Async SequentialAccess 813.80 ms 44.3059 ms 130.6370 ms 767.23 ms 127000.0000 126000.0000 125000.0000 11.31 KB
Sync SequentialAccess 69.54 ms 0.6815 ms 0.6375 ms 69.45 ms 375.0000 250.0000 250.0000 10672.5 KB
@Wraith2
Copy link
Contributor

Wraith2 commented Oct 4, 2019

If the payload of the column crosses multiple packets the data has to be buffered until the end of the data is found. It's going to cause repeated re-allocations. In the case where the full length of the data isn't known it'll keep reallocating and copying into the new buffer. Bizarrely the best way to read it may to be using sequential access mode with a stream because it may only keep the current packet.

I'll have a tinker with it but if I'm right about how it works it would be a substantial change to alter the behaviour if it's possible at all.

@roji
Copy link
Member Author

roji commented Oct 5, 2019

If the payload of the column crosses multiple packets the data has to be buffered until the end of the data is found

That makes sense for non-sequential, but isn't the whole point of sequential access to stream large values? In other words, it's expected for non-sequential access to be slow (no complaints), but not for SequentialAccess. Note that this is a regression from this benchmarked behavior of the .NET Framework driver (note that I didn't confirm/rerun this), where sequential access is fast.

In addition, this should all have nothing to do with sync vs. async... I'm guessing that under the hood there's a completely different implementation for these two paths, and the async one has the issue we're seeing.

@Wraith2
Copy link
Contributor

Wraith2 commented Oct 5, 2019

That makes sense for non-sequential, but isn't the whole point of sequential access to stream large values? In other words,

I can't speak about intent. I can tell you how the code behaves when that option is set though. The most relevant behaviour is that it specifically maintains only the currently read column SqlBuffer clearing out previous ones are reads are done, so you can't step back. That does align with the purpose of reading large values. I don't remember the de-packetization chunk handling code being particularly specialized.

I'm setting up for some benching and I'll see what I can find and if there's any room for improvement.

@Wraith2
Copy link
Contributor

Wraith2 commented Oct 6, 2019

Well there's definitely a problem with how async reading of PLP data is done. It's not a simple one though, it's deeply rooted in how async works. I'll explain what I've found so far because it's as good a place as any to put the information for later reference.

A standard column value is of fixed length, easy to deal with. Non fixed width data has to be prefixed with it's length and then possibly reconstructed from multiple chunks. There's also an unknown length capacity where the user just has to keep requesting data until they reach and end of data packet. We're dealing with the middle option. This data spans multiple packets and is formatted as partially length prefixed data with a known length.

When reading this in sync mode it's pretty straight forward you just keep getting packets and pulling the contents of them out until the current offset in the byte buffer until you reach the end. Then you call the text encoding that corresponds to the encoding to convert your bytes into a string. Almost all of the memory overhead on sync is this string conversion. The only interruptions that are possible are internally generated.

In async mode things get complicated. Because any async method has to be interruptible by cancellation tokens and external events you have to handle state storage until a resolution to the read has been reached. This mechanism is snapshots. All async operations create a snapshot of the current reader state as their first action. Data is then requested and when received it is appended to the snapshot packet list and the snapshot is replayed to see if it can reach a resolution.

When you start reading the data from the first packet in the snapshot the data is identified as being part of a larger array of known length. An array is allocated for the full length and and attempt is made to read all the data, this fails because we don't have all the data (most of the time). Once it fails the current packet is consumable but not consumed so we can't discard it and it's left in the snapshot and we wait for more data.

Hopefully you've just seen the problem. For every packet the arrives all packets in the unresolved chain are re-processed in the snapshot replay and that reprocessing allocates an array large enough for all the output every time. so divide the size of the data 5Mib by roughly 8K chunks for default packet size and you'll have that number -1 dropped 5Mib arrays. The profile traces are quite impressive.

I've already got one PR in progress that optimizes the mechanics of snapshots. It won't help here. This is a logical problem. What I'd like to understand is why it was chosen to implement a full replay each time a packet arrives rather than trying to keep some additional state in the snapshot, as I see it there is no branch capability in the replay the only possible results are {Success, Failure, Continue} and if that's the case and backtracking can't happen it would be possible to augment snapshots with continuation state To know this you'll need to find someone who was around at the time or implemented this functionality because they'll know what the limiting factor was here, there could well be a hidden requirement I can't divine from the code,.

@roji
Copy link
Member Author

roji commented Oct 7, 2019

Thanks for the detailed info @Wraith2, it's indeed quite interesting and raises some questions.

Two immediate comments that I have:

  • As I wrote above, the Core version seems to be a regression compared to the Framework version - at least a some point, according to this SO, async sequential was apparently fast. It may be a good idea to run the benchmark above on .NET Framework to confirm, and then to see what's being done differently in the Framework version.
  • At least for byte arrays (blobs), things really should be quite simple. Assuming values are length-prefixed (which seems to be the case according to your description), a byte array of the suitable size should simply be allocated and filled gradually with no extra memory overhead or needless copying. I understand the current architecture is very different, am just stating the obvious in how things should work.
  • Strings, on the other hand, are considerably more complicated as decoding has to take place. In Npgsql, if the string is really large, an extra byte array is allocated for the (byte) length of the string being read. Like with byte arrays, it's simply filled gradually, and decoded once at the end. This is probably not ideal (at minimum the temporary byte array should be pooled), but I doubt that this API could function much better for long strings (where GetTextReader is more suited). Here's the source code.
  • Am I understanding you correctly that reader state is manually snapshotted/managed for async operations? That would seem to indicate that the compiler async/await feature isn't used in SqlClient? Because the whole point of that feature is to "snapshot" the local variables in the async function and restore them later when the callback completes... Am just curious.

@Wraith2
Copy link
Contributor

Wraith2 commented Oct 7, 2019

I suspect the first iteration of the async will have used sync over async, in which case you consume an additional thread but get identical (fairly good) performance to the sync version. I'd have to find an original version from that time and dive into the IL to verify that and i'm not sure it helps to do so, I'd just accept that perf used to be better and get on with improving it for the future.

All the variable length fields varchar, varbinary, image ntext etc seem to go down the same route to get the bytes and then if needed they're transformed once the entire contents is known. That places an upper limit on the size of those fields which is worked around with stream capability like GetTextReader as you mentioned which bypasses buffering the whole thing, that's the non-prefixed variable length i mentioned.

Am I understanding you correctly that reader state is manually snapshotted/managed for async operations?

Yes. But the snapshot is present for good reasons because some things like attn and envchange need to processed on the real current state even when recieved in a snapshot context. There's plenty of async closure stuff going on in the library and that might even be part of the problem here.

If i squint a bit i can sort of see how this can be corrected to be a linear operation. It's going to get a bit messy because the snapshot functionality is integrated with a couple of different layers but there's no doubt it's worth doing. I would like some confirmation of the no backtracking possibliity when processing a plp chunk stream though.

@cheenamalhotra cheenamalhotra added the 📈 Performance Issues that are targeted to performance improvements. label Oct 7, 2019
@Wraith2
Copy link
Contributor

Wraith2 commented Oct 13, 2019

There's a pretty simple way to change the behaviour.

Before, restating for comparison so the number are from the same machine:

Method Behavior Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Async Default 448.84 ms 8.9453 ms 8.7855 ms 309000.0000 309000.0000 309000.0000 3.98 KB
Sync Default 17.46 ms 0.1025 ms 0.0959 ms 406.2500 406.2500 406.2500 10244.82 KB
Async SequentialAccess 458.23 ms 9.0577 ms 10.4309 ms 310000.0000 310000.0000 310000.0000 3.98 KB
Sync SequentialAccess 17.38 ms 0.0638 ms 0.0596 ms 406.2500 406.2500 406.2500 10244.82 KB

After:

Method Behavior Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Async Default 130.98 ms 1.2547 ms 1.1737 ms 750.0000 500.0000 250.0000 3.95 KB
Sync Default 17.38 ms 0.1107 ms 0.1036 ms 406.2500 406.2500 406.2500 10244.82 KB
Async SequentialAccess 131.20 ms 0.7910 ms 0.7399 ms 750.0000 500.0000 250.0000 3.95 KB
Sync SequentialAccess 17.35 ms 0.0806 ms 0.0754 ms 406.2500 406.2500 406.2500 10244.82 KB

It's not a very nice solution because it breaks encapsulation by making the plp data read aware of the snapshot and using it to store the target buffer between attempts to complete the execution. 3x faster, 400 times less GC. It's still almost 10x slower than the sync version but there's no way to fix that.

At the moment it performs a complete attempt to fill the buffer on every packet received and stores all packets received in a queue until the command succeeds or is cancelled. The good solution is to change the entire way that async behaves and to make it store continuation progress in the intermediate buffer before returning it. This would decrease memory usage by removing the packet queue and increase speed by removing wasted cpu in continual replays. It's a major task that needs agreement and planning officially.

Perf will also be improved by #198

@cheenamalhotra
Copy link
Member

Ack: PRs for Perf improvement are on hold for review for a while, till we get driver stabilized for breaking issues.

@saurabh500
Copy link
Contributor

@Wraith2 can you share your project that you used to benchmark this issue?

I had a chat with @roji and we are seeing worse benchmarks than were originally reported.

| Method |         Behavior |       Mean |    Error |    StdDev |     Median |       Gen 0 |       Gen 1 |       Gen 2 |  Allocated |
|------- |----------------- |-----------:|---------:|----------:|-----------:|------------:|------------:|------------:|-----------:|
|  Async |          Default |   961.5 ms | 35.20 ms | 103.80 ms | 1,007.5 ms | 105000.0000 | 104000.0000 | 104000.0000 | 3661.01 MB |
|   Sync |          Default |   893.9 ms |  3.09 ms |   2.89 ms |   893.8 ms |           - |           - |           - |      15 MB |
|  Async | SequentialAccess | 1,888.8 ms | 42.94 ms | 126.60 ms | 1,929.5 ms | 173000.0000 | 172000.0000 | 172000.0000 | 3525.79 MB |
|   Sync | SequentialAccess | 1,878.2 ms | 37.43 ms |  71.21 ms | 1,865.5 ms |           - |           - |           - |      15 MB |

We are wondering how the original results were obtained? @roji is going to backtrack to the version of SqlClient that he may have run the original benchmarks on.

@Wraith2 in the meanwhile, can you provide your repro project. In case this is a regression in the version of SqlClient that we have been testing with, we need to remove the regression first.

The repos we have used are https://github.com/roji/SqlClientAsyncBenchmark
and
https://github.com/saurabh500/sqlclientrepros/tree/master/asyncperf

@saurabh500
Copy link
Contributor

saurabh500 commented Nov 6, 2019

This was run on Windows as well as Linux. The benchmarks show 3.6 Gigs allocation for async :(
I have tried running against local and remote sql server with similar results.

@roji
Copy link
Member Author

roji commented Nov 6, 2019

It isn't the version of SqlClient - it seems to be a difference between BenchmarkDotNet 11.5 and the recently-released 12.0. I'll investigate a bit more deeply to understand which one is reporting correct results.

@roji
Copy link
Member Author

roji commented Nov 6, 2019

Here we go, from the release notes for BDN 0.12.0:

Improved MemoryDiagnoser
Now MemoryDiagnoser includes memory allocated by all threads that were live during benchmark execution: a new GC API was exposed in .NET Core 3.0 preview6+. It allows to get the number of allocated bytes for all threads. #1155, #1153, #723

I'll reach out to the BDN folks, but I'd tend to trust the new results by default unless we find out otherwise...

@Wraith2
Copy link
Contributor

Wraith2 commented Nov 6, 2019

I noted in my PR #285 that BDN wasn't showing the correct memory figures and used dotMemory results to show the difference made by my changes. The new numbers you give look much more reasonable.

@roji
Copy link
Member Author

roji commented Nov 6, 2019

Makes sense now, thanks @Wraith2!

@hallidev
Copy link

hallidev commented Dec 5, 2020

@Wraith2 Sorry for posting on an old issue but my rabbit hole led here. The last benchmark you mentioned here states that async runs ~10x slower than sync. Is this still the case? Has this been addressed in another issue in this repo or if not, is any work planned?

@Wraith2
Copy link
Contributor

Wraith2 commented Dec 5, 2020

Yes, this is still the case. It can't be addressed without deep changes to the library. At the moment i'm working on consolidating the netcore and netfx codebases so that we don't have to work on two different libraries at once which will hopefully free up development time overall.

As for plans. I'm an external contributor and know only what is posted on this repo publicly. I plan to work on async but it's going to be a slow project once I get started and I won't be starting anytime soon. What the owning team intend I have no idea. Stability and new features seem to be the current goals.

@hallidev
Copy link

hallidev commented Dec 5, 2020

Understood. Thanks for your response and contributions to such a critical library.

@enkelmedia
Copy link

I still see the same problem on 3.1.1, I cant understand - the issue is closed but when was it fixed?

@Wraith2
Copy link
Contributor

Wraith2 commented Nov 3, 2021

This individual issue was sped up with a cache. The larger issue is still open under #593

@KonnectSalon
Copy link

This is still a major issue.

I have been troubleshooting spikes in CPU and slow SQL statements only to come to this thread.

GetClaimsAsync in Microsoft.AspNetCore.Identity calls SQL that has a table of nvarchar(max) for claims and values.

Changing the table to nvarchar(255) seem to resolve the issue.

This issue is having a knock on affect to other libraries :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📈 Performance Issues that are targeted to performance improvements.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants