-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add partial packet detection and fixup #2714
base: main
Are you sure you want to change the base?
Conversation
I've added comments to the Packet class as requested. The CI was green apart from some ubuntu legs which timed out, many other ubuntu legs succeeded so I don't see any direct inference on from that. Ready for review @David-Engel @saurabh500 @cheenamalhotra |
@Wraith2 We are reviewing this and hope to get faster traction towards EOW. |
Pasting test failure for reference:
This test should be looked at carefully. It failed on Ubuntu with .NET 6 and 8 , and also hung up on Windows when ran with Managed SNI, link to logs 1 link to logs 2. @Wraith2 can you confirm if this is something you're able to repro in Windows with Managed SNI? Please make sure config file is configured to enable Managed SNI on Windows. |
{ | ||
// Do nothing with callback if closed or broken and error not 0 - callback can occur | ||
// after connection has been closed. PROBLEM IN NETLIB - DESIGN FLAW. | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a Debug Assert here and check if this is taking any hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test needs to be fixed, before reviewing any further.
Isn't this the set of tests that @David-Engel pointed out in #2608 (comment) ? If so we discussed it at length on the teams call. I don't believe that those tests are reliable. Setup a breakpoint or Debug.WriteLine where an exception is added to the state object and run the test. You should find that an exception is always added to the state object but that the test will usually succeed. That should not be possible, an exception if added should be thrown. The test is missing failures and if that's the case then the test is unreliable. |
When you work past the terrible code in SNITCPHandle and make the test run for long enough it settles into a steady state where it can't reach the end. There is no indication why yet.
those 9 in flight items just don't seem to complete but i don't know why. |
After a few more hours investigation I know what the problem is but I have no clue what change has caused it. In SqlDataReader when an async method is called we use a context object to contain some state and pass that context object to all the async methods that are used to implement the async read machinery. Part of this state is the TaskCompletionSource. What I don't understand is how cancel is supposed to work. I'm unable to run the tests in native sni mode because the native sni can't be initialized (can't find the sni dll). So I can't compare the managed to unmanaged implementations here. I don't believe that I have made any change that should affect cancellation. I have verified that there are no partial packets in the state objects when the async tasks get stuck. I don't understand how async cancellation is supposed to work at all. |
Add debugging and tighten the exception detection in async cancel tests.
88fbb93
to
8b57818
Compare
Can someone with CI access rerun the failed legs? the failures are random or CI resources not being available as far as i can tell. |
The current failures are interesting. They're in the test that was failing before but they new ones are only detected because i made the test more accurate.
The previous version of the test accepted any exception when it was expecting a cancellation exception. It was passing on netfx with my previous changes because timeout exceptions were being thrown. I judged that accepting a timeout when we were supposed to be testing whether cancellation had occurred was not correct. If we retained the previous version of the test then everything would have passed cleanly. In the current situation since the test completed correctly without hanging the result is equivalent to what we would have experienced in all test runs in the past, all started threads that we expected to be cancelled exited with an exception. [edit] |
@Wraith2 You might be banging your head against an unrelated issue in the driver. IIRC, the test was only introduced to ensure we don't regress "The MARS TDS header contained errors." issue. (The test code came from the repro.) If you isolate your test changes and run them against main code, does it still fail? Yes, the correct exception is probably "Operation cancelled by user." where the exception is being caught. But if it's unrelated to your other changes, I would leave that part of the test as it was and file a new issue with repro code. As it is, it's unclear if and how this behavior is impacting users and I wouldn't hold up your perf improvements for it. |
There was definitely a real problem. The results differed between main and my branch. I've solved that issue now and the current state is that we're seeing a real failure because I've made the test more sensitive. I think it's probably safe to lower the sensitivity of the test again now because the new test that I've added covers the specific scenario in the multiplexer that I had missed and everything else is pass. I'll try that and see how the CI likes it. I think the current state on this branch is that it is as stable as live. We need to have confidence that this set of changes is correct before we can merge it. It's high risk and high complexity code. Even understanding it very deeply it has taken me a week to actively debug a very important behaviour change that I missed. |
Can someone re-run the failed legs? the only failing test is something to do with event counters which I've been no-where near. |
The failing test is EventCounter_ReclaimedConnectionsCounter_Functional. It's doing something with GC specific to net6. It's failing sporadically on net6 managed sni runs but not deterministically. I can't make it fail locally to trace what might be happening. |
Any thoughts? |
I'm not seeing the failures you mentioned in EventCounter_ReclaimedConnectionsCounter_Functional [in the CI results]. I mainly see fairly consistent failures of CancelAsyncConnections on Linux. It seems to pass on Windows managed SNI, so there might be something that is Linux/Ubuntu network specific. Can you run/debug the test against a local WSL or Docker instance? |
If i click through the failure i get to this page https://sqlclientdrivers.visualstudio.com/public/_build/results?buildId=95784&view=ms.vss-test-web.build-test-results-tab The cancel tests are passing now, those failed in the previous runs but not the current ones. |
If it's AsyncCancelledConnectionsTest again then there isn't anything further I can do. That test is multithreaded and timing dependent. I've traced the individual packets through the entire call stack. I've run it for 1000 iterations successfully after fixing a reproducible error in it. If someone can isolate a reproducible problem from it then i'll investigate. |
I chatted with @saurabh500 and I just want to add that this is definitely something we all want to see get merged. It'll just take someone finding time (could take a few days dedicated time) to get their head wrapped around the new code and be able to help repro/debug to find the issue. |
I'm happy to make myself available to talk through the code with anyone that needs it. |
@Wraith2 and @David-Engel I was looking at the lifecycle of the snapshots and something that stood out in NetCore vs NetFx is that SqlDataReader for NetCore is storing the cached snapshot with the SqlInternalConnectionTds which is a shared resource among all the SqlDataReader(s) running on a MARS connection.
This means that we are saving the reader snapshot on the shared resource, which can be overwritten by any other reader. @Wraith2 have you had a chance to pursue this line of investigation for hanging test? I wonder if the timing is causing the wrong cached snapshot to be provided to a SqlDataReader, causing data corruption and likely causing a hang. |
SqlInternalConnection.cs
|
@Wraith2 I see that you had made the changes in the first place. Can you try another PR where you remove the storage of these contexts and snapshots on SqlInternalConnection and with the multiplexing change, try to see if this solves the problem. Also, I am Happy to be told that my theory is wrong, but I would like to understand how in MARS cases, the shared Cached contexts on InternalConnection is a safe design choice. |
The snapshot caching has been present for a long time now and it's cleared and never moves between owners. The snapshot lifetime is dealt with by the async prepare and related functions which are highly conserved between both versions of the driver. The cached async context objects are kept on the internal connection because that's the long-lived objects that is kept in the pool. Their ownership is opportunistic so if one is present on the internal connection it is always safe to use (see the Clear method on the base object for lifetime management) and if one isn't present then a new one is created. We could remove those caches for a small memory cost on repeated async operations but there's been no indication that they're in any way involved in a problem.
They're safe because interlocked is used to change their ownership. In Mars cases if you have multipe active readers that try to use the cached object only one can use it and the other will silently create a new object and use it. Only one can be put back in the cache and the other is dropped to the GC. Mars is not a common case and does not need to be as efficient as the non-mars case. |
I see. So its necessarily an object for anyone to opportunistically grab/reset and then reuse? Is this correct ? |
The async contexts, yes. All of them should be used through interlocks. |
@Wraith2 Have you looked at this failure ?
|
I did, it doesn't repro locally. On the topic of reproducing problems. I've build an ubuntu 22.04 vm and installed rider to allow me to build this project. The tests require runtime version 2.1 for some reason and that isn't available (and hasn't been for a long time) on any supported ubuntu build. How are you running development environments on linux to investigate things? Is there some magic way to get netcore2.1 onto ubuntu that i can't find? |
I personally use the Ubuntu distribution with the .Net runtime and mostly use command line to launch the tests, and leverage console logs and traces. |
I'm trying the linux thing again. The buildguide should be up to date but it's not helping. For example, the commands list just running Have you had any luck getting a linux system build and working? has anyone else? |
wraith@UbuntuDev2204:~/RiderProjects/SqlClient$ dotnet test "src\Microsoft.Data.SqlClient\tests\ManualTests\Microsoft.Data.SqlClient.ManualTesting.Tests.csproj" -p:Platform="AnyCPU" -p:Configuration="Release" -p:TestTargetOS="Unixnetcoreapp" --no-build -v n --filter "FullyQualifiedName=Microsoft.Data.SqlClient.ManualTesting.Tests.AsyncCancelledConnectionsTest.CancelAsyncConnections" Build started 13/09/2024 22:11:16. Test run for /home/wraith/RiderProjects/SqlClient/artifacts/Project/bin/Unix/Release.AnyCPU.ManualTests/netcoreapp2.1/ManualTests.dll (.NETCoreApp,Version=v2.1) Microsoft (R) Test Execution Command Line Tool Version 17.3.3 (x64) Copyright (c) Microsoft Corporation. All rights reserved. Starting test execution, please wait... A total of 1 test files matched the specified pattern. Passed! - Failed: 0, Passed: 1, Skipped: 0, Total: 1, Duration: < 1 ms - /home/wraith/RiderProjects/SqlClient/artifacts/Project/bin/Unix/Release.AnyCPU.ManualTests/netcoreapp2.1/ManualTests.dll (netcoreapp2.1) Build succeeded. 0 Warning(s) 0 Error(s) Time Elapsed 00:00:41.22 |
You can find the command from the CI logs (e.g. link to log) actually, in this case the command run on Unix agents is:
I only needed to install dotnet SDKs (8.0 and 6.0) and PowerShell on Ubuntu 22 in order to run this command successfully. Further steps to build and run steps are also available in logs likewise. Note: These steps don't apply to WSL. |
Getting those items installed on the right version of Ubuntu didn't prove to be simple. It would also have been good to know that it has to be done on Ubuntu 2204 not 2404 because of package availability. Even with those things installed the test does not fail on my machine, as you can see from #2714 (comment) , and i cannot use remote debug over ssh to try and step through i can only run locally. So all in all a month mostly wasted. |
I finally got the error to repro. Haven't debugged anything, yet. But here's my setup: Installed Ubuntu 22.04 under WSL (install Ubuntu 22.04 from Microsoft Store). Install PowerShell on Ubuntu: https://learn.microsoft.com/en-us/powershell/scripting/install/install-ubuntu?view=powershell-7.4#installation-via-package-repository-the-package-repository Set MS packages repo preferred over Ubuntu's to avoid .NET errors wrt package conflicts. Create
Install .NET 6.0 and 8.0 SDKs:
Clone SqlClient and checkout branch:
Define
Restore NuGets:
Build
Hopefully I didn't miss any steps. I am leaving out a bunch of commands that failed to resolve the various issues I encountered just to get tests to run. My hope is this will get you to a point where you can actually debug the issue.
|
That works flawlessly up until the build part. wraith@Ubuntu:~/SqlClient$ dotnet msbuild
MSBuild version 17.11.4+37eb419ad for .NET
Determining projects to restore...
All projects are up-to-date for restore.
Determining projects to restore...
All projects are up-to-date for restore.
Determining projects to restore...
All projects are up-to-date for restore.
Determining projects to restore...
All projects are up-to-date for restore.
Microsoft.Cci.Extensions -> /home/wraith/SqlClient/tools/GenAPI/Microsoft.Cci.Extensions/bin/Release/netstandard2.0/Microsoft.Cci.Extensions.dll
Microsoft.DotNet.GenAPI -> /home/wraith/SqlClient/tools/GenAPI/Microsoft.DotNet.GenAPI/bin/Release/net6.0/Microsoft.DotNet.GenAPI.dll
Microsoft.DotNet.GenAPI -> /home/wraith/SqlClient/tools/GenAPI/Microsoft.DotNet.GenAPI/bin/Release/net8.0/Microsoft.DotNet.GenAPI.dll
Microsoft.DotNet.GenAPI -> /home/wraith/SqlClient/tools/GenAPI/Microsoft.DotNet.GenAPI/bin/Release/net472/Microsoft.DotNet.GenAPI.exe
Build succeeded.
0 Warning(s)
0 Error(s)
Time Elapsed 00:00:01.13
Microsoft.SqlServer.Server -> /home/wraith/SqlClient/artifacts/Project/bin/Unix/Debug.AnyCPU/Microsoft.SqlServer.Server/netstandard2.0/Microsoft.SqlServer.Server.dll
Microsoft.SqlServer.Server -> /home/wraith/SqlClient/artifacts/Project/bin/Unix/Debug.AnyCPU/Microsoft.SqlServer.Server/net46/Microsoft.SqlServer.Server.dll
Microsoft.Data.SqlClient -> /home/wraith/SqlClient/artifacts/Project/bin/Unix/Debug/Microsoft.Data.SqlClient/ref/net6.0/Microsoft.Data.SqlClient.dll
Microsoft.Data.SqlClient -> /home/wraith/SqlClient/artifacts/Project/bin/Unix/Debug/Microsoft.Data.SqlClient/ref/net8.0/Microsoft.Data.SqlClient.dll
Microsoft.Data.SqlClient -> /home/wraith/SqlClient/artifacts/Project/bin/Unix/Debug.AnyCPU/Microsoft.Data.SqlClient/netcore/net8.0/Microsoft.Data.SqlClient.dll
Microsoft.Data.SqlClient -> /home/wraith/SqlClient/artifacts/Project/bin/Unix/Debug.AnyCPU/Microsoft.Data.SqlClient/netcore/net6.0/Microsoft.Data.SqlClient.dll
Microsoft.Data.SqlClient -> /home/wraith/SqlClient/artifacts/Project/bin/AnyOS/Debug/Microsoft.Data.SqlClient/ref/net6.0/Microsoft.Data.SqlClient.dll
Microsoft.Data.SqlClient -> /home/wraith/SqlClient/artifacts/Project/bin/AnyOS/Debug/Microsoft.Data.SqlClient/ref/net8.0/Microsoft.Data.SqlClient.dll
/usr/bin/sh: 2: /tmp/MSBuildTempwraith/tmp6c4f199d3704486ab7f21105cfc06597.exec.cmd: /home/wraith/SqlClient/src/../artifacts/Project/tools/net8.0\Microsoft.DotNet.GenAPI.dll: not found
/home/wraith/SqlClient/tools/targets/NotSupported.targets(48,5): error MSB3073: The command " " /home/wraith/SqlClient/src/../artifacts/Project/tools/net8.0\Microsoft.DotNet.GenAPI.dll" "/home/wraith/SqlClient/artifacts/Project/bin/AnyOS/Debug/Microsoft.Data.SqlClient/ref/net8.0/Microsoft.Data.SqlClient.dll" -l:"/home/wraith/.nuget/packages/azure.core/1.38.0/lib/net6.0/;/home/wraith/.nuget/packages/azure.identity/1.11.4/lib/netstandard2.0/;/home/wraith/.nuget/packages/microsoft.bcl.asyncinterfaces/1.1.1/ref/netstandard2.1/;/usr/share/dotnet/packs/Microsoft.NETCore.App.Ref/8.0.8/ref/net8.0/;/home/wraith/.nuget/packages/microsoft.extensions.caching.abstractions/8.0.0/lib/net8.0/;/home/wraith/.nuget/packages/microsoft.extensions.caching.memory/8.0.0/lib/net8.0/;/home/wraith/.nuget/packages/microsoft.extensions.dependencyinjection.abstractions/8.0.0/lib/net8.0/;/home/wraith/.nuget/packages/microsoft.extensions.logging.abstractions/8.0.0/lib/net8.0/;/home/wraith/.nuget/packages/microsoft.extensions.options/8.0.0/lib/net8.0/;/home/wraith/.nuget/packages/microsoft.extensions.primitives/8.0.0/lib/net8.0/;/home/wraith/.nuget/packages/microsoft.identity.client/4.61.3/lib/net6.0/;/home/wraith/.nuget/packages/microsoft.identity.client.extensions.msal/4.61.3/lib/net6.0/;/home/wraith/.nuget/packages/microsoft.identitymodel.abstractions/7.5.0/lib/net8.0/;/home/wraith/.nuget/packages/microsoft.identitymodel.jsonwebtokens/7.5.0/lib/net8.0/;/home/wraith/.nuget/packages/microsoft.identitymodel.logging/7.5.0/lib/net8.0/;/home/wraith/.nuget/packages/microsoft.identitymodel.protocols/7.5.0/lib/net8.0/;/home/wraith/.nuget/packages/microsoft.identitymodel.protocols.openidconnect/7.5.0/lib/net8.0/;/home/wraith/.nuget/packages/microsoft.identitymodel.tokens/7.5.0/lib/net8.0/;/home/wraith/.nuget/packages/microsoft.sqlserver.server/1.0.0/lib/netstandard2.0/;/home/wraith/.nuget/packages/system.clientmodel/1.0.0/lib/net6.0/;/home/wraith/.nuget/packages/system.configuration.configurationmanager/8.0.0/lib/net8.0/;/home/wraith/.nuget/packages/system.diagnostics.eventlog/8.0.0/lib/net8.0/;/home/wraith/.nuget/packages/system.identitymodel.tokens.jwt/7.5.0/lib/net8.0/;/home/wraith/.nuget/packages/system.memory.data/1.0.2/lib/netstandard2.0/;/home/wraith/.nuget/packages/system.security.cryptography.protecteddata/8.0.0/lib/net8.0/" -o:"/home/wraith/SqlClient/src/../artifacts/Project/obj/Debug.AnyCPU/Microsoft.Data.SqlClient/netcore//net8.0/Microsoft.Data.SqlClient.notsupported.cs" -t:"Microsoft.Data.SqlClient is not supported on this platform."" exited with code 127. [/home/wraith/SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft.Data.SqlClient.csproj::TargetFramework=net8.0]
/usr/bin/sh: 2: /tmp/MSBuildTempwraith/tmp6cd4c5148eef4de6983e6d1ce31ff54f.exec.cmd: /home/wraith/SqlClient/src/../artifacts/Project/tools/net6.0\Microsoft.DotNet.GenAPI.dll: not found
/home/wraith/SqlClient/tools/targets/NotSupported.targets(48,5): error MSB3073: The command " " /home/wraith/SqlClient/src/../artifacts/Project/tools/net6.0\Microsoft.DotNet.GenAPI.dll" "/home/wraith/SqlClient/artifacts/Project/bin/AnyOS/Debug/Microsoft.Data.SqlClient/ref/net6.0/Microsoft.Data.SqlClient.dll" -l:"/home/wraith/.nuget/packages/azure.core/1.38.0/lib/net6.0/;/home/wraith/.nuget/packages/azure.identity/1.11.4/lib/netstandard2.0/;/home/wraith/.nuget/packages/microsoft.bcl.asyncinterfaces/1.1.1/ref/netstandard2.1/;/home/wraith/.nuget/packages/microsoft.netcore.app.ref/6.0.33/ref/net6.0/;/home/wraith/.nuget/packages/microsoft.extensions.caching.abstractions/6.0.0/lib/netstandard2.0/;/home/wraith/.nuget/packages/microsoft.extensions.caching.memory/6.0.1/lib/netstandard2.0/;/home/wraith/.nuget/packages/microsoft.extensions.dependencyinjection.abstractions/6.0.0/lib/net6.0/;/home/wraith/.nuget/packages/microsoft.extensions.logging.abstractions/6.0.0/lib/net6.0/;/home/wraith/.nuget/packages/microsoft.extensions.options/6.0.0/lib/netstandard2.1/;/home/wraith/.nuget/packages/microsoft.extensions.primitives/6.0.0/lib/net6.0/;/home/wraith/.nuget/packages/microsoft.identity.client/4.61.3/lib/net6.0/;/home/wraith/.nuget/packages/microsoft.identity.client.extensions.msal/4.61.3/lib/net6.0/;/home/wraith/.nuget/packages/microsoft.identitymodel.abstractions/7.5.0/lib/net6.0/;/home/wraith/.nuget/packages/microsoft.identitymodel.jsonwebtokens/7.5.0/lib/net6.0/;/home/wraith/.nuget/packages/microsoft.identitymodel.logging/7.5.0/lib/net6.0/;/home/wraith/.nuget/packages/microsoft.identitymodel.protocols/7.5.0/lib/net6.0/;/home/wraith/.nuget/packages/microsoft.identitymodel.protocols.openidconnect/7.5.0/lib/net6.0/;/home/wraith/.nuget/packages/microsoft.identitymodel.tokens/7.5.0/lib/net6.0/;/home/wraith/.nuget/packages/microsoft.sqlserver.server/1.0.0/lib/netstandard2.0/;/home/wraith/.nuget/packages/microsoft.win32.systemevents/6.0.0/lib/net6.0/;/home/wraith/.nuget/packages/system.clientmodel/1.0.0/lib/net6.0/;/home/wraith/.nuget/packages/system.configuration.configurationmanager/6.0.1/lib/net6.0/;/home/wraith/.nuget/packages/system.drawing.common/6.0.0/lib/net6.0/;/home/wraith/.nuget/packages/system.identitymodel.tokens.jwt/7.5.0/lib/net6.0/;/home/wraith/.nuget/packages/system.memory.data/1.0.2/lib/netstandard2.0/;/home/wraith/.nuget/packages/system.security.cryptography.protecteddata/6.0.0/lib/net6.0/;/home/wraith/.nuget/packages/system.security.permissions/6.0.0/lib/net6.0/;/home/wraith/.nuget/packages/system.windows.extensions/6.0.0/lib/net6.0/" -o:"/home/wraith/SqlClient/src/../artifacts/Project/obj/Debug.AnyCPU/Microsoft.Data.SqlClient/netcore//net6.0/Microsoft.Data.SqlClient.notsupported.cs" -t:"Microsoft.Data.SqlClient is not supported on this platform."" exited with code 127. [/home/wraith/SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft.Data.SqlClient.csproj::TargetFramework=net6.0]
wraith@Ubuntu:~/SqlClient$ |
Split out from #2608 per discussion detailed in #2608 (comment)
Adds packet multiplexer and covering tests.