-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update SslOverTdsStream #541
Conversation
{ | ||
ValueTask valueTask = _stream.WriteAsync(buffer, cancellationToken); | ||
if (!valueTask.IsCompletedSuccessfully) | ||
{ | ||
await valueTask.AsTask().ConfigureAwait(false); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using ValueTask you must be careful to get either the result or the task only once. To enforce this I isolate them into an additional scope so that if anyone tries to use them again they will get an error with them being out of scope,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need an .AsTask()
here, since the result of ValueTask is never used more than once, should it just be await valueTask.ConfigureAwait(false);
instead. Also for other place use await xxx.AsTask().ConfigureAwait(false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yyjdelete see below discussion in #541 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not tested whether it affact the perf, but AsTask()
may need to create an new Task with the IValueTaskSource
, if the valueTask wraps an IValueTaskSource
instead of Task
(NetworkStream already use IValueTaskSource
in netcoreapp if possible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any need to have AsTask. It's perfectly fine to directly await a ValueTask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I await ValueTask or call AsTask() for the awaiter? Not sure I care really. Sharplab shows that it makes a single IL instruction difference (which admittedly is a callvirt) and since I'm not going to be able to sensibly measure the perf of this in-situ I'm not going to worry about it unless a profile shows that I need to.
As noted in #541 (comment) I used the pattern I've seen used in corefx and documented in https://devblogs.microsoft.com/dotnet/understanding-the-whys-whats-and-whens-of-valuetask/ because I'm not trying to make perf gains here but there's a pattern that can be used casually to make perf good by default so I used it. @roji noted that this only perhaps saved 7ns, which is fine with me if the library owners are ok with it, if not they'll request the simplified version.
So if you want to micro optimize the hell out of this or fun then go ahead but I'm not sure it's helpful to do it in this PR. If you're looking for stuff to improve managed performance then it'd be good if you would look over one of my other PR's which has been open since January with no movement #389 . It contains interesting packet recycling mechanics and an entirely measurable 7% overall read perf increase. For general performance the removal of a lot of allocations on async paths using SqlDataReader is in #528 that could use some careful review and if you're feeling like a really deep dive into the library then I opened #533 to track the cause of the 2nd and 4th most allocated objects in all async reads.
|
||
int readCount; | ||
{ | ||
ValueTask<int> remainderReadValueTask = _stream.ReadAsync(buffer.Slice(0, wantedCount), cancellationToken); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am curious... What is gained by the split logic for IsCompletedSuccessfully? Why not just always await?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It saves a call into the state machine which can avoid a lot of work. it's a common pattern with hot paths and since this is internal and will be required on all network io I thought I may as well include it. You can see it in https://devblogs.microsoft.com/dotnet/understanding-the-whys-whats-and-whens-of-valuetask/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm familiar with the pattern, but only when the async path has some additional logic that you want to skip in the sync path. For example, in the blog post you point to, the async path contains a RegisterCancellation
invocation which should indeed be skipped if the operation completes synchronously - but here there's no such thing. Note also how Stephen explicitly recommends developers use this pattern "hopefully only after measuring carefully and finding it provides meaningful benefit".
Out of curiosity, I went ahead and measured the difference and got this:
BenchmarkDotNet=v0.12.0, OS=ubuntu 19.10
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.2.20176.6
[Host] : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
Method | Mean | Error | StdDev | Median |
---|---|---|---|---|
SplitLogic | 48.28 ns | 0.976 ns | 1.631 ns | 47.33 ns |
JustAwaitIt | 55.31 ns | 0.047 ns | 0.042 ns | 55.31 ns |
Benchmark code
public class Program
{
[Benchmark]
public async ValueTask<int> SplitLogic()
{
var valueTask = SomeSyncReturningAsyncMethod();
return valueTask.IsCompletedSuccessfully
? valueTask.Result
: await SomeSyncReturningAsyncMethod();
}
[Benchmark]
public async ValueTask<int> JustAwaitIt()
=> await SomeSyncReturningAsyncMethod();
ValueTask<int> SomeSyncReturningAsyncMethod() => new ValueTask<int>(8);
static void Main(string[] args)
=> BenchmarkRunner.Run<Program>();
}
I personally wouldn't complicate the code with this pattern for a 7ns improvement - not in this project anyway, where it's almost sure to be lost in overall perf - but of course that's up to you and the SqlClient people to decide on. In any case, when working any kind of micro-optimizations like this one, it's really a good idea to have a BenchmarkDotNet suite to verify that there's meaningful improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the numbers. It is a tiny change in times. I view this part of the codebase as both performance critical and low level because it blocks any better performance higher in the pipeline so as you've seen I chose to go with the more complex and less maintainable long form. I've no problem changing it if someone strongly desires it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the codebase does indeed seem quite critical for perf and what you're doing seems important (am lacking context here and am just looking at the PR code). My point is only that specifically for this ValueTask pattern, I'd only introduce this kind of code when a functional benchmark shows it has a meaningful impact. My benchmark above isn't that - I'd be extremely surprised if this could be shown to have any impact on actual SqlClient code executing a command (and code complexity does have a price). Up to you guys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a few more comments. I've looked a bit more at the PR and wanted to share some thoughts.
The code being replaced doesn't really do sync-over-async - unless I'm mistaken, GetAwaiter().GetResult() is only getting called when the async parameter is false, which causes only sync I/O operations down the stack. In other words, although the methods are async in the compiler sense, they always complete synchronously (when async=false). I'm mentioning this since actual sync-over-async (where methods complete asynchronously) is a really dangerous thing to do, leading to various deadlock and starvation issues - but that's not the case here.
The pattern of passing an async flag down the stack is quite common, and is used to prevent exactly the code duplication which this PR introduces by splitting the code up. Now, technically it's true that synchronous code by this "unified sync/async" pattern, since a state machine is needlessly used. However, the perf of async methods which complete synchronously has received a lot of attention in recent time, and the actual perf impact is probably negligible. FWIW the entire Npgsql driver is written in this way: async methods accepting an async flag, and performing GetAwaiter().GetResult() for sync wrapper APIs, and performance is really quite good. I benchmarked this at some point, and while 2-3 years ago duplicating the code base for sync/async made a lot of sense, that usually isn't the case today.
Bottom line, I'd avoid doing this kind of micro-optimization - which causes a lot of code duplication - without thorough benchmarking and hard numbers to show that it improves perf in a meaningful way.
{ | ||
public override int Read(byte[] buffer, int offset, int count) | ||
{ | ||
if (_encapsulate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd invert this condition to exit the function early if _encapsulate if false, reducing the indentation for the entire body for the interesting case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encapsulation is only going to be enabled for a single packet for the lifetime of a connection. This is another case where I'd be quite happy to do silly things to get the absolute fastest check and call for the common case, which is passthrough. Since I hadn't measured that particular perf I wrote it the same way as the original. Do you think there's any particular perf merit one way or the other? Ideally this is a case where it'd be nice to have something like a tail call happen if we end up just passing through to the _stream
implementation but that'd be a JIT thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no, I wasn't referring to any perf-related point here - only a code styling suggestion. Basically you have the main method body in an if, and then a tiny else that just does return. In these I cases I tend to invert the condition, to get "rid" of the else block early.
So instead of:
if (_encapsulate) {
// Long block
} else {
return await _stream.ReadAsync(buffer, offset, count, cancellationToken);
}
I tend to write:
if (!_encapsulate)
return await _stream.ReadAsync(buffer, offset, count, cancellationToken);
// Long block
This makes code a bit easier to follow and reduces an indentation level for the long block. Just a style suggestion.
...soft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SNI/SslOverTdsStream.NetStandard.cs
Show resolved
Hide resolved
|
||
int readCount; | ||
{ | ||
ValueTask<int> remainderReadValueTask = _stream.ReadAsync(buffer.Slice(0, wantedCount), cancellationToken); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the codebase does indeed seem quite critical for perf and what you're doing seems important (am lacking context here and am just looking at the PR code). My point is only that specifically for this ValueTask pattern, I'd only introduce this kind of code when a functional benchmark shows it has a meaningful impact. My benchmark above isn't that - I'd be extremely surprised if this could be shown to have any impact on actual SqlClient code executing a command (and code complexity does have a price). Up to you guys.
|
||
remaining = remaining.Slice(dataLength); | ||
} | ||
if (packetBuffer != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd place this in a finally clause, to also return the buffer in case of exceptions. IIRC this isn't an actual memory leak (unreturned buffers do get reclaimed by the GC), but this would be more correct (and allow buffers to get reused even in case of an exception).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just me being cautious. I tend to be quite security concious which leads me to handle exceptions as catastrophic. In this case if an error has occured i don't want to try and rescue the buffer back into the pool because i don't know if the error occured because someone was messing with the buffer. If we hit an exception just throw everything away as dirty.
If there's a particular preference for a try-finally block just say and i'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why an exception would be more related to a security concern than a non-exceptional flow - in both cases we've read some SQL Server TDS data into the buffer, either it's OK to return it as-is to the pool, or it isn't... Note that ArrayPool.Return has an option for clearing the buffer (off by default) if that's something you want to do. But I'd think more broadly about what kind of security policy/guarantees you want to provide.
I thought it required a second thread to wake it. If you end up with a wait you're using a SemaphoneSlim inside the task which needs to be signalled and that signal can only come from another thread. You usually have other threads so this isn't typically a problem but we've had several bug reports here recently of people opening 100 connections simultaneously and doing that overwhelms the threadpool unless you've presized it. So perf is good when you have a waker available but when you're thread constrained this approach is slightly better. Similarly the state machine alloc on the sync path isn't a big problem but it isn't needed. Splitting the implementations removes it relatively low cost. The better part is that the span based overloads have been added which will allow upstream callers to be written with valuetasks and fewer reallocations. I acknowledge your point about duplication. This is both more code and more implementations of the same algorithm. However it's not a piece of code that should need attention very often. If I were doing this higher up in TdsParser or SqlConnection level I'd have preferred the shared implementation the same way you would and not tried to change it without strong evidence. Down here at the SNI level I feel more comfortable with the duplication and absolute best performance. If you look at the header setup function there's a netcore jit optimization trick already in there which will eliminate some bounds checks, the performance this gains will be miniscule but it's not harmful even if it is less readable (definitely less logical) so that tiny bit of extra performance got included. I treat SNI level more like a hardware driver and since any perf lost at that level can't be regained in a complexity/readability trade-off I'm more likely to err on the side of complexity. We'll have to see what the MS team think. You're right and we just have slightly different perspectives on it. |
If an async method completes synchronously, i.e. never yielding, then no threads are involved and the method behaves pretty much as if it hadn't been async at all (at least from the point of view of threading etc.). So as long as you guarantee that your async method doesn't actually perform an async operation anywhere (when the async flag is false), it should be completely safe.
If someone opens 100 connections on the threadpool synchronously, then yeah, they're going to get starvation effects as the threadpool needs to scales up. But that has nothing really to do with the change you're proposing - a synchronous operation is a synchronous operation regardless of whether it's invoked from a regular method, or an async method which accepted an async=false flag. In other words, the problem is with the user code itself, which is overloading the threadpool by blocking TP threads synchronously.
If the async method completes synchronously, there is no heap allocation - the state machine is on the stack and not on the heap (see this blog post). I do understand your desire to squeeze the absolute best performance out of this important piece of code. However, code complexity and duplication have a real cost, even if you think this component isn't touched very frequently. I feel particularly strongly that any such micro-optimization should be accompanied by benchmarking, otherwise there's a real risk that you're just introducing code complexity and duplication, while pretty much getting nothing in return. My intuition tells me that splitting these methods in two just to avoid the Also, if the goal here is to improve SqlClient perf, then it's extremely important to start by profiling some common scenarios to see where the actual bottlenecks are, and attack those. Optimizing components without being guided by actual profiling isn't a good idea... |
Yes. I know. Try explaining that to the users and see how far it gets you.
Yes. I've been doing that for a while now. I don't pick random things to change. Everything I've changed is from a real scenario that I've performance traced. In this case it's from the DataAccessPerformance that I was painted to by @saurabh500 when I started working on perf for this library. Other scenarios have been provided by users reporting bad perf or my own use of the library. I don't understand why this relatively minor and obscure change seems to be such a problem. |
Unless my mental model is wrong, this PR doesn't make it any less likely to block, in any path (if I'm wrong I'd be happy to learn something new):
It's fully possible that I'm missing something, and there's some way this PR helps with any starvation issue - I'd just like to understand how.
Well, I'll say again that I'm just an external observer here, voicing my opinion. Neither you nor the SqlClient team have to listen :) However, my problem here is that:
Don't get me wrong - I'd be very happy to be proven wrong and see that this does improve perf in some meaningful way, or that there's some sort of TP starvation scenario it helps with (nothing better than learning something new). But at this point this is just a PR with no measurements attached to it... |
One last comment/question... I'm lacking context, but am I right in understanding that most of the relevant code here is only used during SSL handshake (i.e. while _encapsulate is true)? If so, is this trying to optimize only a setup phase which will only occur briefly at connection start time? |
Yes, normally the first packet on any new connection, that's real internal pooled connection objects not user created objects. So it happens very infrequently. The main intention was to remove the So if you're expecting measurable perf increases from this PR I wouldn't expect any and didn't intend any. What I wanted was to reduce threading contention since we've already got users hitting that scenario and to provide way to use task machinery more efficiently elsewhere later. |
Thanks for the added details and clarifying the perf expectations. Re starvation issues (which I think you're referring to as threading contention), I again don't see what this PR do to help. There's nothing wrong with GetAwaiter().GetResult() - as long as you know that the call completed synchronously (because you passed async=false). All this does is retrieve the return value, and if an exception occurred, raise it appropriately. This construct definitely has a bad reputation - and rightly so - because people erroneously use it without knowing that the call completed synchronously; this is the famous sync-over-async anti-pattern, which leads to so many issues (I just happened to see this today). There's simply nothing we can do in a library to help with that kind of user error. If you're not convinced by the above - take a quick look inside GetAwaiter().GetResult(); it's quick short and understandable, and assuming the ValueTask is already complete (which is must be if async=false), the value is simply returned immediately. I'd be happy to go into with you if anything seems unclear. Implementing the Span and Memory in general is a good idea - and is also a good example of an optimization which (usually) doesn't reduce code quality (e.g. via duplication). However, if this stream implementation really is used only during handshake - so extremely infrequently - then any perf value here (e.g. ValueTask instead of Task) would be practically nil, and therefore probably not worth it IMHO. |
Encvapsulation=true is only used for a single packet. The stream has to stay in place throughout the lifetime of the connection so all packets after that first packet will still call into this class. If this could be swapped out it'd be much less important and would never show up in traces, that'd be great. |
In any case thanks for taking the time to listen and discuss all this. |
Oh, I guess I misunderstood and it really is used only for a single packet during connection setup. In that case I agree with you that swapping it out would be the ideal way to go. |
Nope, at least not intentionally. As mentioned above this has two purposes, while doing that I addressed the possible issue of 0 returning read calls introducing infinite loops in pre-login encapsulation mode. |
Hi @Wraith2 Thanks! |
Hi @Wraith2 I also gathered Code Coverage data for these new methods by running Manual Tests, and doesn't seem like these Async methods are covered by Manual Tests. However I did see them covered in Functional Tests. It would be better to see it covered in driver operations too. |
They're covered in functional tess because I added test for them in #497 because I knew there's no way you'd let me rewrite this without coverage 😁. The library itself doesn't use them yet but I hope to use them in future async work allowing cleaner operation. |
Ok, that definitely lowers the risk of these changes. |
…tal buffers and change to early exit code flow
feedback addressed and rebased to current |
@Wraith2, Can you add EventSource logger to the new class? the more data we capture makes it easier to debug the code in future. |
I can but if you didn't have those events before they're not going to help you much in future I don't think. If you really want them I'd prefer to limit them to encapsulated paths and the switch from encapsulation to passthrough. |
Fixed. I'm too used to single point of entry and exit style and missed returning from the early checks. @JRahnama can you let me know if you still want tracing and if it's ok to add it only for encapsulation mode? |
I think it would be useful having it. Yes please. |
…calls in SslOverTdsStream
Ok, added along with a scope container to clean up the ugly try-finally blocks unto a neat using. Take a look. |
The class SslOverTdsStream is used in the managed networking implementation to handle the login time handover from the first unencrypted packet to a fully ssl secured stream. The current implementation uses sync-over-async
SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SNI/SslOverTdsStream.cs
Lines 54 to 55 in 6e09b54
SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SNI/SslOverTdsStream.cs
Lines 63 to 64 in 6e09b54
and by calling the current async ReadInternal and WriteInternal methods in sync code paths incur state machine allocation overhead where it isn't needed. Removing these things will help reliability and perf for both sync and async worlds a little at the cost of slightly higher code complexity because of the reimplementation of the multiplexer logic.
this PR includes some logic that replaces work done in #335 so /cc @benadams for a quick review, additional working of valuetask upstream will be done in future PR's.