Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

SqlClient add TestPacketNumberWraparound fix and test #34709

Merged
merged 8 commits into from
Mar 26, 2019

Conversation

Wraith2
Copy link
Contributor

@Wraith2 Wraith2 commented Jan 20, 2019

Fixes https://github.com/dotnet/corefx/issues/13439

Using a large amount of parameter input it was possible to cause an infinite loop in the packet construction logic as it tripped over a check designed to be attempted only on the first packet of a send. This was caused by the packet number variable being a byte as defined by the TDS protocol and wrapping around.

I have added a second counter which does not wrap and making sure that it is incremented in line with the packet number. The new counter is then used to identify the first packet instead of the number so it is now astonishingly difficult to cause it to wrap.

Standard and manual tests pass in native mode. It's at the Tds layer so it should be identical in managed.
The new test requires an sql connection to accept the packet stream so it is a manual test.
It's setup using the logic supplied in the original issue by @brandonagr . If the test fails the manual test suite may hang or leave a thread connected indefinitely until the process dies, there is no way I can avoid this because of the nature of the bug.

/cc the usual people, @afsanehr @keeratsingh @saurabh500

change TdsParserStateObject to add _outputPacketCount and use it
@Wraith2 Wraith2 changed the title add TestPacketNumberWraparound test SqlClient add TestPacketNumberWraparound fix and test Jan 20, 2019
@AfsanehR-zz
Copy link
Contributor

@Wraith2 Changes looks good. Could you please update your branch with the latest changes from master so we can do a final build of tests before merging?

@AfsanehR-zz AfsanehR-zz added this to the 3.0 milestone Feb 27, 2019
@Wraith2
Copy link
Contributor Author

Wraith2 commented Feb 27, 2019

Done

@@ -63,8 +63,8 @@ internal abstract class TdsParserStateObject
// Packet state variables
internal byte _outputMessageType = 0; // tds header type
internal byte _messageStatus; // tds header status
internal byte _outputPacketNumber = 1; // number of packets sent to server
// in message - start at 1 per ramas
internal byte _outputPacketNumber = 1; // number of packets sent to server in message - start at 1 per ramas
Copy link
Contributor

@AfsanehR-zz AfsanehR-zz Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wraith2 What does ramas stand for/means?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I just saw this already existed in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed it was the name of the guy on the sql server team that the author asked for help 😁

@AfsanehR-zz
Copy link
Contributor

AfsanehR-zz commented Feb 28, 2019

@Wraith2 Without your fix, I don't see the infinite loop happening in the test case you provided, rather the assertion fails with returning a lower number on the enumerator.Count. Anything I am missing here?

@Wraith2
Copy link
Contributor Author

Wraith2 commented Feb 28, 2019

The loop is in a timeout that is forced onto another thread by the task creation options. If the test data is all successfully used then it will take a second two and the counts will match. If the timerout happens (At which point another thread is stuck) then the counts won't match and the assert fires. So the assert means it fails as you'd expect but I made sure it didn't block the entire test run, you'll just find there's a thread somewhere that doesn't die until the process exits.

@karelz
Copy link
Member

karelz commented Mar 18, 2019

@afsanehr I see you approved the change. It is passing all tests. Any reason for NOT merging it? Thanks!

@AfsanehR-zz
Copy link
Contributor

@Wraith2 noticed two small Nit issues on the PR. Once that's done, we are ok for merging. 👍
Apologies for the confusion for the approval, I think I approved it confirming the tests passed.

@AfsanehR-zz
Copy link
Contributor

AfsanehR-zz commented Mar 19, 2019

@Wraith2 Could you please update this PR with master branch again?

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 19, 2019

Done.

@AfsanehR-zz
Copy link
Contributor

@Wraith2 Thanks a lot. I am going to hold off merging now. The tests pass on Windows but the TestPacketNumberWraparound test fails on Linux machine. I need to investigate this more.

@AfsanehR-zz AfsanehR-zz self-requested a review March 19, 2019 21:31
@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 20, 2019

Interesting, the behaviourr is above the SNI layer so platform shouldn't make a difference. I'll see if I can replicate and debug.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 20, 2019

Functional and manual tests run through successfully for me locally. Can you share the failure details you see?

@AfsanehR-zz
Copy link
Contributor

@Wraith2 I'll run this on a local Linux machine. This is happening in CI.

System.Data.SqlClient.ManualTesting.Tests.TvpTest.TestPacketNumberWraparound [FAIL]
Assert.True() Failure
Expected: True
Actual:   False
Stack Trace:
/home/dotnet/vsoagent1/_work/1/s/src/System.Data.SqlClient/tests/ManualTests/SQL/ParameterTest/TvpTest.cs(73,0): at System.Data.SqlClient.ManualTesting.Tests.TvpTest.TestPacketNumberWraparound()

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 22, 2019

Hmmm. The easy solution is that it's taking a lot longer on that box, can you extend the timer and see if that helps? If it doesn't I'm at bit of a loss since as you know I don't do my test on Linux I force managed on windows.

),
Task.Delay(TimeSpan.FromSeconds(5))
);
Assert.True(enumerator.MaxCount == enumerator.Count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: Failure on Linux,
It may help to output the enumerator.Count on the console when the failure happens so that we know where really is this failing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've upped the time from 5 seconds to 1 minute and added console debug info of state and time on failures. Lets see how that works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wraith2 The increase in delay worked. CI passes now.

@AfsanehR-zz AfsanehR-zz merged commit d4e6617 into dotnet:master Mar 26, 2019
@Wraith2 Wraith2 deleted the sqlfix-13439 branch March 26, 2019 19:23
@pphari2001
Copy link

I'm running a .NET (4.6.1.) application which inserts binary data and observing the same issue. It is getting locked up in the System.Data.SqlClient.TdsParserStateObject.WritePacket method while executing the System.Data.SqlClient.SQLCommand.ExecuteNonQuery. See below the version of System.Data.dll used in the system.

C:\Windows\Microsoft.Net\assembly\GAC_64\System.Data\v4.0_4.0.0.0__b77a5c561934e089\System.Data.dll 4.7.3630.0

The strange behavior is that it is happening only in a build which we've released in the last Feb. We have rebuilt recently the same code base and run the application and there is no issue. We have compared all binaries/dlls of both build and it is exactly the same.

I noticed that there is a fix for this issue and would like to know which .NET version or specific System.Data.Dll has the fix? I'm using process explorer (System internals) to check the runtime usage of System.Data.dll and it shows the .NET 4.7 version. What are the possibilities, the issue shows only on a specific build? The same data we have tried with at least 4 different builds and couldn't reproduce the issue. Appreciate any help to identify this issue.

image

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 9, 2020

The fix should be in Microsoft.Data.SqlClient which is a forked version of this driver which doesn't rely on a new framework release on netfx or netcore for releases. https://github.com/dotnet/SqlClient I advise using that if possible.

@pphari2001
Copy link

Thanks. What version of the Microsoft.Data.SqlClient or System.Data.SqlClient includes this fix?
Also, it is strange that the issue is not happening when I rebuild the same codebase? Any idea what could be the reason? Both builds are running on a same system and using the same System.Data.SqlClient.dll. The target .NET framework is in the project file is 4.6.1, but it is running 4.7.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Aug 9, 2020

Unless you have some reason to use an older version you should always use the latest version of Microsoft.Data.SqlClient. I'm not going to lookup which version exactly the fix is in but it should be there because it was forked from this repository which as you can see has the fix.

I don't know if the fix was backported to the desktop framework, that's all closed source. You could use a disassembler on the system assembly to look for the changes I've made in this PR If you really want to know that. Otherwise I suggest just updating.

@pphari2001
Copy link

Thanks for the update. If I rebuild the application with the same code base without any changes and run with the System.Data dll the problem is not happening (with the same data). Is there any chance when we rebuild a .NET application the compiler generates binaries differently with the optimization? If I need to use the Microsoft.Data.SqlClient library, I'll have to rebuild it with the new reference. But as soon as I rebuild the application the problem goes away. We need to isolate the problem with the binary since we have released this to the on-market systems which are widely used.

Below is the WritePacket method from the decompiled code:

internal Task WritePacket(byte flushMode, bool canAccumulate = false)
{
if (this._parser.State == TdsParserState.Closed || this._parser.State == TdsParserState.Broken)
throw ADP.ClosedConnectionError();
if (this._parser.IsYukonOrNewer && !this._bulkCopyOpperationInProgress && (this._outBytesUsed == this._outputHeaderLen + BitConverter.ToInt32(this._outBuff, this._outputHeaderLen) && this._outputPacketNumber == (byte) 1) || this._outBytesUsed == this._outputHeaderLen && this._outputPacketNumber == (byte) 1)
return (Task) null;
byte outputPacketNumber = this._outputPacketNumber;
bool flag = this._cancelled && this._parser._asyncWrite;
byte num;
if (flag)
{
num = (byte) 3;
this._outputPacketNumber = (byte) 1;
}
else if ((byte) 1 == flushMode)
{
num = (byte) 1;
this._outputPacketNumber = (byte) 1;
}
else if (flushMode == (byte) 0)
{
num = (byte) 4;
++this._outputPacketNumber;
}
else
num = (byte) 1;
this._outBuff[0] = this._outputMessageType;
this._outBuff[1] = num;
this._outBuff[2] = (byte) (this._outBytesUsed >> 8);
this._outBuff[3] = (byte) (this._outBytesUsed & (int) byte.MaxValue);
this._outBuff[4] = (byte) 0;
this._outBuff[5] = (byte) 0;
this._outBuff[6] = outputPacketNumber;
this._outBuff[7] = (byte) 0;
this._parser.CheckResetConnection(this);
Task task = this.WriteSni(canAccumulate);
if (flag)
task = AsyncHelper.CreateContinuationTask(task, new Action(this.CancelWritePacket), this._parser.Connection, (Action) null);
return task;
}

@David-Engel
Copy link

@pphari2001 This fix was never ported to System.Data.SqlClient in .NET Framework. I don't know if the exact problem was reproduced there, but this code change does not exist there. I don't recall if this change ever made it into a release of System.Data.SqlClient in .NET Core, either. But it did make it into Microsoft.Data.SqlClient during a 1.0 preview so it is certainly in 1.1.3 and the latest 2.0 releases.

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
SqlClient add TestPacketNumberWraparound fix and test

Commit migrated from dotnet/corefx@d4e6617
@SurajGupta
Copy link

SurajGupta commented Nov 5, 2022

It is absolutely ridiculous that this fix wasn't applied to older versions of the .NET Framework. Myself and 2 senior engineers spent the last 2 days trying to figure out why an ADO.net call was not returning but also not throwing on a line of code that has been run millions of times. It didn't occur to us that there was a known infinite loop bug in the framework that Microsoft casually decided not to fix. We threw every creative idea we had at the problem and spent the entire time banging our heads against the wall. Giant waste of time.

@brandonagr - many thanks for reporting this issue and then creating a simple repro. If myself or anyone on my team finds you at a bar one day, drinks are on us. =)

wlscaudill added a commit to NaosProject/Naos.SqlServer that referenced this pull request Nov 5, 2022
…ecause this issue was encountered (dotnet/corefx#34709), which Microsoft has chosen to not fix after 6 years (at time of commit) and this was apparently a workaround to get out of the unlucky byte alignment.
@Wraith2
Copy link
Contributor Author

Wraith2 commented Nov 5, 2022

It's interesting that at no point in that time spent debugging you saw the advice that you should use the Microsoft.Data.SqlClient package which would have fixed the issue for you.

@SurajGupta
Copy link

SurajGupta commented Nov 5, 2022

What advice? ADO.net has been around forever and in my experience is very reliable. How many applications around the world would you estimate run a version of ADO.net with this bug? And might taking a new package reference in otherwise stable code, particularly for something as low-level as database communication, come with it's own risks? We certainly don't add dependencies casually. And it's absolutely not an obvious first guess of where to look. The nature of this particular issue introduces a major red herring. We are talking about code that talks to SQL Server, which itself has shipped in many versions with many updates. Stored procedures need to be rebuilt? Deadlock? Is the database growing but not fast enough? I can ask 10 more good questions. If someone gave me no context at all but said they were having trouble with ADO.net in .NET Framework 4.x but that they could connect to the database and were executing a vanilla command (e.g. no wrapped transactions, no tracing, just the basic ExecuteNonQuery), I would tell them they have a SQL Server issue or a buggy statement or buggy parameter construction and I think 999 out of 1000 times I might be right.

What I find so frustrating about your comment is how pervasive this attitude of "just jump on the new stuff already" is within MS as it relates to .NET. It's an easy position to take when your whole company isn't based on application code written on top of .NET Framework. I think I see an ivory tower. Anyways, EOL for 4.6.2, which is what we are on, is early 2027. MS has already broken it's promise to continue shipping .NET Framework when it was easing people's concerns about Core. My expectation is that MS will honor their maintenance commitment and push out this fix to the Framework. It's the professional thing to do.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Nov 5, 2022

Your frustration is understandable but your response here is misguided. If you believe that you have a maintenance contract which has been broken then you need to take that up with whoever provides that contract.

You hit a very rare bug. One so rare that your report is only the second time I've seen it. Sadly this happens as you well know. Yes, the SqlClient driver has been around since 1.0 of the framework but that doesn't mean that it's flawless just that it's old and most obvious bugs have been fixed.

You mention EOL for netfx being 2027. That's end of support but what is the definition for what issues will be supported? This specific issue is not a security problem which is the highest level. It's a very rare issue that doesn't cause incorrect data it just plain doesn't work which is the best kind of crash. Annoying? certainly. High priority for back porting and require everyone in the world to get an update that will cause ngen of the entire framework again? Not in my opinion and it looks like not in the opinion of the people who make these decisions at Microsoft.

System.Data.SqlClient can only ship with the framework. That means netfx and netcore. For netfx that's whenever there's a major release which is less often then netcore. Netcore is still only yearly unless a bug is severe enough to be backported. The Microsoft.Data.SqlClient package can ship on a more frequent schedule, and does. The new package contains drivers for both netfx and netcore and where possible bug fixes are applied equally to both. Performance and feature updates are also applied frequently. The separate package isn't there to force you to upgrade it's there to give people a better path to get new features and fixes.

The advice I give is always to upgrade to the Microsoft package because there's a good chance that your bug is either fixed or if it isn't then it can be fixed and shipped more quickly than if it had been in the System version. If this advice isn't getting out to people or you're ignoring it because you think you know better than that's a you problem not a me problem.

I don't work for Microsoft and never have. I do this in my spare time for free. My opinions are my own.

@SurajGupta
Copy link

SurajGupta commented Nov 7, 2022

I greatly appreciate your efforts in making the fix and doubly so as this was unpaid time. You are a hero! I also appreciate a debate in good faith of improving software which is near and dear to me and I think to you as well.

Just this morning I found the following in Windows Update: "Cumulative Update Preview for .NET Framework 3.5, 4.8 and 4.8.1". It seems like there was ample opportunity to roll this fix into any of the updates since 2019 (it was reported in 2016). I get that there are shades of gray with security being at one end of that spectrum. However, an infinite loop is software 101. In a "my first coding class" students will learn fairly quickly that you need to eventually exit a while loop. Back in the day, being stuck in an infinite loop was readily apparent - your single physical/logical CPU would spike and your machine would grind to a halt or crash. This is not so today with multi core machines. You are saying that this issue is very rare, but given the years of opportunities to bundle an infinite loop fix, my compass points more to apathy then it does to the pain of ngen. It's not a priority. I feel you MS; and thanks for constantly rebooting my machine to fix a bunch of Windows bloat-ware that you have become so fond of shipping. =)

But I also disagree with github-tickets-as-proof-of-rareness argument. I think reporting bugs is an exception and that most individuals just find a workaround and move on. I'm guilty of this likely over a hundred times in my career. Also, corporate policy may disallow employees from engaging in open source platforms because of IP or other concerns. Given the reach/scale of ADO.net and the nature of its execution (how many inserts with byte arrays do all apps across planet earth perform in a year? billions? trillions? more?), I think many more individuals have hit this issue. Clearly its not tens of thousands otherwise it would have surfaced earlier; I'm not trying to be dramatic. However, folks shouldn't make assumptions when we are talking about foundational code with broad deployment.

I appreciate your pushback. My team knows me well; I'm the first to say "I was wrong." On this, I maintain my position; it was unprofessional of MS not to ship this fix. We are looking into taking the new nuget package and all of this has been a good reminder that we need to find the time and resources to get on .net 5/6/...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants