Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add partial packet detection and fixup #2714

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

Wraith2
Copy link
Contributor

@Wraith2 Wraith2 commented Jul 24, 2024

Split out from #2608 per discussion detailed in #2608 (comment)

Adds packet multiplexer and covering tests.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Jul 25, 2024

I've added comments to the Packet class as requested. The CI was green apart from some ubuntu legs which timed out, many other ubuntu legs succeeded so I don't see any direct inference on from that.

Ready for review @David-Engel @saurabh500 @cheenamalhotra

@Wraith2 Wraith2 marked this pull request as ready for review July 25, 2024 18:20
@saurabh500 saurabh500 self-requested a review July 29, 2024 19:29
@saurabh500
Copy link
Contributor

@Wraith2 We are reviewing this and hope to get faster traction towards EOW.
Wanted to give an update, instead of maintaining radio silence.

cc @VladimirReshetnikov

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Aug 6, 2024

Pasting test failure for reference:

    Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.CancelAsyncConnections [2 m 38 s]
EXEC : error Message:  [/mnt/vss/_work/1/s/build.proj]
     Assert.Empty() Failure: Collection was not empty
  Collection: ["Microsoft.Data.SqlClient.SqlException (0x80131904)"···]
    Stack Trace:
       at Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.RunCancelAsyncConnections(SqlConnectionStringBuilder connectionStringBuilder) in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs:line 66
     at Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.CancelAsyncConnections() in /_/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs:line 31
     at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
     at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)

    Standard Output Messages:
   00:00:05.8665447 True Started:8 Done:0 InFlight:8 RowsRead:39 ResultRead:3 PoisonedEnded:1 nonPoisonedExceptions:0 PoisonedCleanupExceptions:0 Count:0 Found:0
   00:00:10.8624529 True Started:12 Done:0 InFlight:12 RowsRead:832 ResultRead:64 PoisonedEnded:6 nonPoisonedExceptions:6 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:15.8646242 True Started:17 Done:0 InFlight:17 RowsRead:2327 ResultRead:179 PoisonedEnded:11 nonPoisonedExceptions:9 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:20.8677772 True Started:42 Done:6 InFlight:36 RowsRead:4810 ResultRead:370 PoisonedEnded:18 nonPoisonedExceptions:14 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:25.8731904 True Started:71 Done:12 InFlight:59 RowsRead:9126 ResultRead:702 PoisonedEnded:30 nonPoisonedExceptions:29 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:30.8714979 True Started:77 Done:14 InFlight:63 RowsRead:12207 ResultRead:939 PoisonedEnded:38 nonPoisonedExceptions:36 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:35.0004685 True Started:86 Done:25 InFlight:61 RowsRead:17173 ResultRead:1321 PoisonedEnded:49 nonPoisonedExceptions:43 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:39.9987443 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:44.9985663 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:49.9982022 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:54.9982968 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:00:59.9996354 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:04.9991460 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:09.9983868 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:14.9975925 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:19.9977701 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:25.0122289 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:30.0025709 True Started:97 Done:64 InFlight:33 RowsRead:31798 ResultRead:2446 PoisonedEnded:64 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:35.0024237 True Started:98 Done:65 InFlight:33 RowsRead:32344 ResultRead:2488 PoisonedEnded:65 nonPoisonedExceptions:62 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:40.0025057 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:45.0002633 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:49.9986071 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:54.9998736 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:01:59.9957745 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:04.9985369 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:09.9982641 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:14.9983408 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:19.9988637 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:25.0003251 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:29.9988943 True Started:100 Done:98 InFlight:2 RowsRead:50297 ResultRead:3869 PoisonedEnded:98 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:35.0000752 True Started:100 Done:99 InFlight:1 RowsRead:50830 ResultRead:3910 PoisonedEnded:99 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   00:02:38.0752114 True Started:100 Done:100 InFlight:0 RowsRead:51376 ResultRead:3952 PoisonedEnded:100 nonPoisonedExceptions:63 PoisonedCleanupExceptions:0 Count:1 Found:0
   Microsoft.Data.SqlClient.SqlException (0x80131904): A severe error occurred on the current command.  The results, if any, should be discarded.
      at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlE

This test should be looked at carefully.

It failed on Ubuntu with .NET 6 and 8 , and also hung up on Windows when ran with Managed SNI, link to logs 1 link to logs 2.
In this use case, multiple parallel async read operations are being performed, which means connection isolation should be intact while cancellation occurs in between, but it doesn't seem to be happening.

@Wraith2 can you confirm if this is something you're able to repro in Windows with Managed SNI? Please make sure config file is configured to enable Managed SNI on Windows.

{
// Do nothing with callback if closed or broken and error not 0 - callback can occur
// after connection has been closed. PROBLEM IN NETLIB - DESIGN FLAW.
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a Debug Assert here and check if this is taking any hit?

Copy link
Member

@cheenamalhotra cheenamalhotra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test needs to be fixed, before reviewing any further.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 6, 2024

Isn't this the set of tests that @David-Engel pointed out in #2608 (comment) ? If so we discussed it at length on the teams call. I don't believe that those tests are reliable.

Setup a breakpoint or Debug.WriteLine where an exception is added to the state object and run the test. You should find that an exception is always added to the state object but that the test will usually succeed. That should not be possible, an exception if added should be thrown. The test is missing failures and if that's the case then the test is unreliable.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 6, 2024

When you work past the terrible code in SNITCPHandle and make the test run for long enough it settles into a steady state where it can't reach the end. There is no indication why yet.

00:05:45.3696502 True Started:97 Done:88 InFlight:9 RowsRead:241216 ResultRead:3769 PoisonedEnded:88 nonPoisonedExceptions:0 PoisonedCleanupExceptions:0 Count:0 Found:0
00:05:50.3638134 True Started:97 Done:88 InFlight:9 RowsRead:241216 ResultRead:3769 PoisonedEnded:88 nonPoisonedExceptions:0 PoisonedCleanupExceptions:0 Count:0 Found:0

those 9 in flight items just don't seem to complete but i don't know why.
This is going to need your help from MS side to identify what's going on here.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 6, 2024

After a few more hours investigation I know what the problem is but I have no clue what change has caused it.

In SqlDataReader when an async method is called we use a context object to contain some state and pass that context object to all the async methods that are used to implement the async read machinery. Part of this state is the TaskCompletionSource.
When running the test CancelAsyncConnections many connections are opened and then SqlCommand.Cancel is called after a brief timed wait. If the timing of the cancel operation is exact then an async operation can be in progress and between packets at the time when the cancellation is executed.
This causes the thread awaiting the async operation to wait indefinitely for a task that will never be completed. This is what causes the stuck threads. The threads as stuck so the test can never complete and it then times out.

What I don't understand is how cancel is supposed to work. I'm unable to run the tests in native sni mode because the native sni can't be initialized (can't find the sni dll). So I can't compare the managed to unmanaged implementations here. I don't believe that I have made any change that should affect cancellation. I have verified that there are no partial packets in the state objects when the async tasks get stuck.

I don't understand how async cancellation is supposed to work at all.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 10, 2024

Can someone with CI access rerun the failed legs? the failures are random or CI resources not being available as far as i can tell.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 10, 2024

The current failures are interesting. They're in the test that was failing before but they new ones are only detected because i made the test more accurate.

  [xUnit.net 00:06:40.39]     Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.CancelAsyncConnections [FAIL]
  [xUnit.net 00:06:40.39]       Assert.Empty() Failure: Collection was not empty
  [xUnit.net 00:06:40.39]       Collection: ["Microsoft.Data.SqlClient.SqlException (0x80131904)"···]
  [xUnit.net 00:06:40.39]       Stack Trace:
  [xUnit.net 00:06:40.39]         /_/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs(71,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.RunCancelAsyncConnections(SqlConnectionStringBuilder connectionStringBuilder)
  [xUnit.net 00:06:40.39]         /_/src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/AsyncTest/AsyncCancelledConnectionsTest.cs(32,0): at Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.CancelAsyncConnections()
  [xUnit.net 00:06:40.39]            at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
  [xUnit.net 00:06:40.39]            at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)
  [xUnit.net 00:06:40.39]       Output:
  [xUnit.net 00:06:40.39]         00:00:05.4318805 True Started:21 Done:0 InFlight:21 RowsRead:117 ResultRead:9 PoisonedEnded:4 nonPoisonedExceptions:2 PoisonedCleanupExceptions:0 Count:1 Found:0
  [xUnit.net 00:06:40.39]         00:00:10.4374767 True Started:25 Done:0 InFlight:25 RowsRead:1469 ResultRead:113 PoisonedEnded:11 nonPoisonedExceptions:9 PoisonedCleanupExceptions:0 Count:1 Found:0
  [xUnit.net 00:06:40.39]         00:00:15.4529038 True Started:31 Done:1 InFlight:30 RowsRead:4732 ResultRead:364 PoisonedEnded:14 nonPoisonedExceptions:13 PoisonedCleanupExceptions:0 Count:1 Found:0
  [xUnit.net 00:06:40.39]         00:00:20.4568918 True Started:67 Done:12 InFlight:55 RowsRead:7852 ResultRead:604 PoisonedEnded:28 nonPoisonedExceptions:21 PoisonedCleanupExceptions:0 Count:1 Found:0
  [xUnit.net 00:06:40.39]         00:00:24.9990795 True Started:91 Done:32 InFlight:59 RowsRead:19955 ResultRead:1535 PoisonedEnded:43 nonPoisonedExceptions:35 PoisonedCleanupExceptions:0 Count:1 Found:0
  [xUnit.net 00:06:40.39]         00:00:28.1341854 True Started:100 Done:100 InFlight:0 RowsRead:52273 ResultRead:4021 PoisonedEnded:100 nonPoisonedExceptions:44 PoisonedCleanupExceptions:0 Count:1 Found:0
  [xUnit.net 00:06:40.39]         Microsoft.Data.SqlClient.SqlException (0x80131904): A severe error occurred on the current command.  The results, if any, should be discarded.
  [xUnit.net 00:06:40.39]            at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlE
    Failed Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.CancelAsyncConnections [28 s]
EXEC : error Message:  [/mnt/vss/_work/1/s/build.proj]
     Assert.Empty() Failure: Collection was not empty

The previous version of the test accepted any exception when it was expecting a cancellation exception. It was passing on netfx with my previous changes because timeout exceptions were being thrown. I judged that accepting a timeout when we were supposed to be testing whether cancellation had occurred was not correct.

If we retained the previous version of the test then everything would have passed cleanly. In the current situation since the test completed correctly without hanging the result is equivalent to what we would have experienced in all test runs in the past, all started threads that we expected to be cancelled exited with an exception.

[edit]
I ran that single test locally in Net6 managed SNI mode using the vs "Run Until Failure" option. This runs the test up to 1000 times sequentially and stops if it fails. It completed 1000 runs successfully.

@David-Engel
Copy link
Contributor

@Wraith2 You might be banging your head against an unrelated issue in the driver. IIRC, the test was only introduced to ensure we don't regress "The MARS TDS header contained errors." issue. (The test code came from the repro.)

If you isolate your test changes and run them against main code, does it still fail? Yes, the correct exception is probably "Operation cancelled by user." where the exception is being caught. But if it's unrelated to your other changes, I would leave that part of the test as it was and file a new issue with repro code. As it is, it's unclear if and how this behavior is impacting users and I wouldn't hold up your perf improvements for it.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 12, 2024

There was definitely a real problem. The results differed between main and my branch. I've solved that issue now and the current state is that we're seeing a real failure because I've made the test more sensitive. I think it's probably safe to lower the sensitivity of the test again now because the new test that I've added covers the specific scenario in the multiplexer that I had missed and everything else is pass. I'll try that and see how the CI likes it.

I think the current state on this branch is that it is as stable as live. We need to have confidence that this set of changes is correct before we can merge it. It's high risk and high complexity code. Even understanding it very deeply it has taken me a week to actively debug a very important behaviour change that I missed.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 12, 2024

Can someone re-run the failed legs? the only failing test is something to do with event counters which I've been no-where near.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 12, 2024

The failing test is EventCounter_ReclaimedConnectionsCounter_Functional. It's doing something with GC specific to net6. It's failing sporadically on net6 managed sni runs but not deterministically. I can't make it fail locally to trace what might be happening.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 19, 2024

Any thoughts?

@David-Engel
Copy link
Contributor

David-Engel commented Aug 26, 2024

I'm not seeing the failures you mentioned in EventCounter_ReclaimedConnectionsCounter_Functional [in the CI results]. I mainly see fairly consistent failures of CancelAsyncConnections on Linux. It seems to pass on Windows managed SNI, so there might be something that is Linux/Ubuntu network specific. Can you run/debug the test against a local WSL or Docker instance?

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 26, 2024

If i click through the failure i get to this page https://sqlclientdrivers.visualstudio.com/public/_build/results?buildId=95784&view=ms.vss-test-web.build-test-results-tab
image

The cancel tests are passing now, those failed in the previous runs but not the current ones.

@David-Engel
Copy link
Contributor

I think there is something wrong with the Tests tab. I don't usually reference it. I scroll down the summary tab to see what jobs had failures:

image

Then drill into the job and the task that failed to see the log:

image

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 27, 2024

If it's AsyncCancelledConnectionsTest again then there isn't anything further I can do. That test is multithreaded and timing dependent. I've traced the individual packets through the entire call stack. I've run it for 1000 iterations successfully after fixing a reproducible error in it. If someone can isolate a reproducible problem from it then i'll investigate.

@Wraith2 Wraith2 requested a review from cheenamalhotra August 28, 2024 00:29
@David-Engel
Copy link
Contributor

I chatted with @saurabh500 and I just want to add that this is definitely something we all want to see get merged. It'll just take someone finding time (could take a few days dedicated time) to get their head wrapped around the new code and be able to help repro/debug to find the issue.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 28, 2024

I'm happy to make myself available to talk through the code with anyone that needs it.

@saurabh500
Copy link
Contributor

@Wraith2 and @David-Engel I was looking at the lifecycle of the snapshots and something that stood out in NetCore vs NetFx is that SqlDataReader for NetCore is storing the cached snapshot with the SqlInternalConnectionTds which is a shared resource among all the SqlDataReader(s) running on a MARS connection.

private void PrepareAsyncInvocation(bool useSnapshot)
{
    // if there is already a snapshot, then the previous async command
    // completed with exception or cancellation.  We need to continue
    // with the old snapshot.
    if (useSnapshot)
    {
        Debug.Assert(!_stateObj._asyncReadWithoutSnapshot, "Can't prepare async invocation with snapshot if doing async without snapshots");

        if (_snapshot == null)
        {
            if (_connection?.InnerConnection is SqlInternalConnection sqlInternalConnection)
            {
                _snapshot = Interlocked.Exchange(ref sqlInternalConnection.CachedDataReaderSnapshot, null) ?? new Snapshot();
            }
            else
            {
                _snapshot = new Snapshot();
            }

This means that we are saving the reader snapshot on the shared resource, which can be overwritten by any other reader.
Also a reader can receive another reader's snapshot.

@Wraith2 have you had a chance to pursue this line of investigation for hanging test?

I wonder if the timing is causing the wrong cached snapshot to be provided to a SqlDataReader, causing data corruption and likely causing a hang.

@saurabh500
Copy link
Contributor

SqlInternalConnection.cs


#if NET6_0_OR_GREATER
        internal SqlCommand.ExecuteReaderAsyncCallContext CachedCommandExecuteReaderAsyncContext;
        internal SqlCommand.ExecuteNonQueryAsyncCallContext CachedCommandExecuteNonQueryAsyncContext;
        internal SqlCommand.ExecuteXmlReaderAsyncCallContext CachedCommandExecuteXmlReaderAsyncContext;

        internal SqlDataReader.Snapshot CachedDataReaderSnapshot;
        internal SqlDataReader.IsDBNullAsyncCallContext CachedDataReaderIsDBNullContext;
        internal SqlDataReader.ReadAsyncCallContext CachedDataReaderReadAsyncContext;
#endif

@saurabh500
Copy link
Contributor

saurabh500 commented Sep 4, 2024

@Wraith2 I see that you had made the changes in the first place. Can you try another PR where you remove the storage of these contexts and snapshots on SqlInternalConnection and with the multiplexing change, try to see if this solves the problem.

Also, I am Happy to be told that my theory is wrong, but I would like to understand how in MARS cases, the shared Cached contexts on InternalConnection is a safe design choice.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Dec 13, 2024

Just checking where this is up to. I see we've got a several new reviewers. If you want a teams call to talk through what I've done just let me know.

Copy link
Contributor

@David-Engel David-Engel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we've branched 6.0, this should be mergeable. We definitely want to get this in sooner than later for 7.0 preview 1. I'm good with the changes with the addition of the license header to one file and addressing conflicts. We will just need a second after that.

@David-Engel David-Engel dismissed cheenamalhotra’s stale review December 18, 2024 23:28

Changes have been addressed

@David-Engel David-Engel requested a review from a team December 19, 2024 00:02
@mdaigle
Copy link
Contributor

mdaigle commented Dec 20, 2024

Hi @Wraith2, thanks for the detailed writeups and design notes. I really appreciate your thorough approach to this. I'm working through this PR to add my review and I'd love to connect on teams soon to check my understanding. My microsoft alias is the same as my github alias.

Also, apologies if I missed some details on this somewhere, but you had a point in the original PR that partial packets were possible due to an "underlying incorrect assumption in the networking handling code". Can you share any additional details on this? It would help my understanding of how this change interacts with the networking code. Thanks!

@Wraith2
Copy link
Contributor Author

Wraith2 commented Dec 20, 2024

Email sent.

After spending more time in debugging for this PR I think that the problem was an oversight caused by not understanding the original code completely.
The original sync code that is the core of the library must be able to handle partial packets because if it couldn't then sync code would never work on linux because the occurrence of partial packets on linux is commonplace. This means that the sync code path doesn't need to check that the input buffer contains a complete packet because it doesn't care, it can handle incomplete packets.

The async code was added a long time after the original sync code. The async code causes long delays when executing the replay mechanism used to support multi-packet payloads. Those delays mean that it's virtually impossible to observe partial packets while async. The async code takes entire input buffers and adds them to the packet list that it uses. It assumes that those input buffers are complete. This was a reasonable assumption given the observable data. Unfortunately I think it was wrong.

If you run the code fast enough in async multi-packet situations you can encounter a network read where packet-size of data was requested but less than packet-size of data was available in the os buffer even though the complete packet will be packet-size in length. The read will return partial-first-size of data and the async code path does not check that partial-first-size against the size of the full packet defined in the packet header. This packet will be added to the async buffer list.
The next data will not start with a tds header because the next data received will be the remainder of the partial packet. The async playback mechanism does not handle this case and will eventually attempt to replay the packet and interpret the first 8 bytes of the packet as a tds header, this will result in incorrect things happening and (hopefully) a crash.

I believe the current code is incapable of running fast enough to encounter the problem. When you speed up async multi-packet reads it occurs often enough to spoil test runs but it's not entirely deterministic and the more debugging logic you add to try and catch it the slower execution becomes and the less it happens.
When it happens you get a packet buffer list that contains data like the below. This is one of the very few times I managed to catch it under the debugger. The first 8 bytes must contain the packet header and you can see that the final packet does not.
image

@mdaigle
Copy link
Contributor

mdaigle commented Dec 20, 2024

Thank you, that makes sense. I wanted to make sure there wasn't some additional low-level networking behavior we were working around. Email received; I'll reach out shortly.

Copy link
Contributor

@samsharma2700 samsharma2700 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review iteration 1

throw SQL.ParsingError(ParsingErrorState.ProcessSniPacketFailed);
}
}
return SNIPacketGetData(packet, _inBuff, ref dataSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it will be possible to avoid this function hop GetSniPacket->SNIPacketGetData?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. It's part of the separation between the managed and unmanaged SNI implementations and looks like it's currently just a function to function indirection to the override. If it's really important I can collapse it but it's unlikely to cause any problems and it keeps the existing structure in place for those familiar with the layering making it easier to understand.

@@ -393,60 +401,9 @@ private void ReadSniError(TdsParserStateObject stateObj, uint error)
AssertValidState();
}

public void ProcessSniPacket(PacketHandle packet, uint error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we still keep the older function? I am still trying to understand all the checks that were removed with the new function call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous version of ProcessSniPacket ? Why would we? we can't swap out the implementations without also adding the ability to disable all partial packet checking. The core idea of the PR was that the change was limited to a small location, primarily ProcessSniPacket which is the communication point between two layers. Unfortunately the rather messy bug fixes that have been needed in other places make this more complex but ProcessSniPacket is still the primary location where new functionality is present.

Copy link
Contributor

@mdaigle mdaigle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for multiplexing is clear given the current code structure, and the multiplexing logic itself is well contained. The tests give me confidence that we're multiplexing correctly.

I have less confidence in the overall orchestration of network calls and that the multiplexing logic is invoked correctly. There are too many different locations and situations in which the logic is invoked, I can't really reason about it. At the end of the day, we may just have to trust our existing code coverage.

Code Structure

The big thing that makes this challenging to reason about is that we're looping in two places (one for network reads, one for packet multiplexing), 4 or 5 layers apart and coordinating between them with state. The loops are also inverted from what you would expect (network read outside, multiplex inside). There's too much indirection at the moment. I really want us to move toward tighter loops at the same layer.

There's a lot of tricky stuff glossed over here, but conceptually I'm thinking something like:

SNIPacket GetPacket() {
    EnsurePacket()
    SNIPacket packet = ReadPacket()
    ResetBuffer()
    return packet
}

EnsurePacket()
{
    if (buffer.length < HEADER_SIZE) {
        EnsureBytes(HEADER_SIZE)
    }

    int data_size = buffer.readInt(...)

    if (buffer.length < HEADER_SIZE + data_size) {
        EnsureBytes(data_size)
    }
}

EnsureBytes(int count, async)
{
    while (buffer.length < count) {
        NetworkRead(PACKET_SIZE)
    }
}

Taking some inspiration from npgsql: https://github.com/npgsql/npgsql/blob/main/src/Npgsql/Internal/NpgsqlConnector.cs#L1349

This setup inverts the current structure. In this way it's easier to see where network reads occur and orchestrate them properly to create a complete packet. You can get a complete packet out at the end and reference that rather than having future operations directly access the underlying buffer. It also removes the need for a lot of global state because the loops manage all of the state locally.

I want to explore any ways we can start moving towards this in this PR when we chat, but I won't block on it.

Performance numbers

It would be exciting to compile some performance numbers to show the impact of this change. Do you have any tests in mind that you used while implementing? I'll also think about anything I can add on.

}
}

private static bool SetupRemainderPacket(Packet packet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to see the DataLength being set automatically inside the Packet class. As much as possible, the Packet class should enforce that it is internally consistent any time it is constructed or modified. This will help us make sure that we don't accidentally omit a setup call if we modify the multiplexing code in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setter is used in 3 places and each is a different situation. In one case we know the value ahead of time, in the other two we don't know it and we're setting it using the function to fetch it from the header and in theory we can simply call the function for one of those but it'll do extra work that we don't need.

In general the Packet class is a dumb container of related data rather than being something that contains interesting logic for behaviours. Accessors has sanity checks but little more. Think of internals more like C than C#. Of course I can move behaviours into it but it increases cognitive load for little payback.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Dec 31, 2024

The need for multiplexing is clear given the current code structure, and the multiplexing logic itself is well contained. The tests give me confidence that we're multiplexing correctly.

Excellent. The tests along with all the new issues that occurred with the linux CI runs are the best indicator that we're doing to get until it arrives with end users.

I have less confidence in the overall orchestration of network calls and that the multiplexing logic is invoked correctly. There are too many different locations and situations in which the logic is invoked, I can't really reason about it. At the end of the day, we may just have to trust our existing code coverage.

The multiplexing was invoked incorrectly. More precisely I'd missed a couple of places that pending reads could terminate and those have now been added. The effort required to untangle the internal layering is very high, the risk is very high and there are no immediate benefits to consumers. We've simplified and transformed some parts of the codebase over the years this repository has existed but the core parts are hard to change safely.

In general large refactorings are unlikely to be helpful in this current codebase when the time could be spent on SqlClientX bringing it to feature parity.

Code Structure

The big thing that makes this challenging to reason about is that we're looping in two places (one for network reads, one for packet multiplexing), 4 or 5 layers apart and coordinating between them with state. The loops are also inverted from what you would expect (network read outside, multiplex inside). There's too much indirection at the moment. I really want us to move toward tighter loops at the same layer.

There's also the smux layer. The multiplexer sits above that so I don't have to deal with untangling the smux header and the tds header.
Using less indirection by dropping the unmanaged sni implementation entirely is a long term goal. It will allow us to collapse some layers of indirection. Unfortunately we can't do that until the performance of the managed sni is roughly equivalent to the managed version. The last time I saw numbers mentioned it was ~10 throughput behind. It also has problems that we can't solve like #422 which tracks back to the runtime.

Performance numbers

It would be exciting to compile some performance numbers to show the impact of this change. Do you have any tests in mind that you used while implementing? I'll also think about anything I can add on.

I haven't run any performance numbers for this PR in isolation. In general I split my work into two categories; Perf and Fixes. For perf work numbers are the main driving force so I'll explain what I'm measuring and why. For fixes correctness is the goal and unless I have reason to believe that it will negatively affect performance and even then I'll make changes that slow things down because I consider correctness more important than speed. I consider this (and the previous part-1) a Fix pr for correctness.

For this PR there will be some level of performance drop. I expect it to be small because on the happy path there should be few to no new allocations with some constant cpu required to evaluate the new path. On unhappy paths where we require reconstructions we will observe the query that encounters the fragmentation slow down when it hits the fragmented buffers and it will then continue seeing fragments until the query ends. The next query on the same connection will start with unfragmented packets again and will see the typical speed.

If your workload is fixed length type heavy you should see little to no change. The small additional packet receive time will be hidden by the work required to get the values out of the packet. If your workload contains large multi-buffer values than this will slow you down more noticeably but part-3 of this series of PR's will linearize multi-buffer fetches so you're probably doing to see an overall speed increase.

Comment on lines 294 to 296
int remainderLength = partialPacket.CurrentLength - partialPacket.RequiredLength;

partialPacket.CurrentLength -= remainderLength;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the partialPacket.CurrentLength be changed somewhere else asynchronously, because otherwise this is equivalent of writing partialPacket.CurrentLength = partialPacket.RequiredLength. Just trying to understand the reasoning here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't be changed elsewhere. I wrote it this way because it was logical to me, it documents what I'm doing not just what the result will be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a valid feedback, can you consider rewriting it to partialPacket.CurrentLength = partialPacket.RequiredLength or do you see any concerns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a style difference as far as I can see. I feel that my version follows the flow of calculating and using whereas the alternative simply jumps to a resultant fact. It doesn't make enough difference for me to fight about it and with 2:1 votes I've changed it as suggested.

You may also note that I followed up the discussion about making partial packet disposal exception a debug only behaviour in 79f9fbc but forgot to mention it.

Copy link
Contributor

@mdaigle mdaigle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to move forward and see how this performs in a preview release! Looking forward to the next stage 😄

@cheenamalhotra
Copy link
Member

We discussed offline, and our concerns are about things going wrong in production for end users that could require us to patch the driver in future. Because this is an intricate change in the packet processing logic, we're considering enabling this feature by default, but be able to turn it off on demand using an App Context switch.

The App Context switch would mean, we will preview this feature for couple of releases, till we have stabilized it and are confident the design is thorough and safe for permanent replacement, after which we can even remove the app context switch to remove old design.

@Wraith2 One of our team members will get started on this, and collaborate with you to make that change in this PR, so we can have the App Context switch come in before the next PR you're working on.

We're fully supportive of the changes, but also need to ensure we have a fallback plan for customers if anything goes wrong.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Jan 19, 2025

Ok. I've made a start by adding the required appcontext switch, copying the current ProcessSni implementation into the muliplexer file as ProcessSniCompat and updating new tests to be conditional on the appcontext switch. The existing ProcessSni call will check the appcontext switch and call the old one if the compat switch is set.

I'm not sure how you need to update the CI legs to account for testing both full sets to make sure we get coverage of both paths.

/// <summary>
/// returns a boolean value indicating if the <see cref="DataLength"/> value has been set.
/// </summary>
public bool HasDataLength => _dataLength >= 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider != UnknownDataLength. That's the point of having a named constant, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not quite the same thing. The constant is an arbitrary value which is known to produce a false result when this property is evaluated. The property defines ranges, the constant simply represents a single value inside that range that we use to have a specific named meaning. If you really want me to change it then I will but not using it is meaningful to me.

{
if (error != 0)
{
if ((_parser.State == TdsParserState.Closed) || (_parser.State == TdsParserState.Broken))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider _parser.State is TdsParserState.Closed or TdsParserState.Broken

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This falls into the category of maintaining consistent style. The change wouldn't improve functionality and would leave us with files in the solution using different syntax to do the same thing which can cause you not to be able to find them or have to mentally determine that both syntaxes mean the same thing.

If we're going to make this change I'd rather do it in a dedicated style update PR. I don't object to the new syntax I just think we should be consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants