Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MsQuicStream may throw QuicOperationAbortedException when connection is closed by peer #69481

Closed
rzikm opened this issue May 18, 2022 · 13 comments
Assignees
Milestone

Comments

@rzikm
Copy link
Member

rzikm commented May 18, 2022

Observed when stressing functional tests:

[xUnit.net 00:00:03.59]     System.Net.Http.Functional.Tests.Http3ResponseStreamZeroByteReadTest_MsQuic.ZeroByteRead_BlocksUntilDataIsAvailable(async: True) [FAIL]
  Failed System.Net.Http.Functional.Tests.Http3ResponseStreamZeroByteReadTest_MsQuic.ZeroByteRead_BlocksUntilDataIsAvailable(async: True) [1 s]
  Error Message:
   System.Net.Quic.QuicOperationAbortedException : Operation aborted.
  Stack Trace:
     at System.Net.Quic.Implementations.MsQuic.MsQuicStream.SendReadOnlyMemoryAsync(ReadOnlyMemory`1 buffer, QUIC_SEND_FLAGS flags) in /home/rzikm/dotnet-runtime/src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/MsQuicStream.cs:line 1443
   at System.Net.Quic.Implementations.MsQuic.MsQuicStream.WriteAsync(ReadOnlyMemory`1 buffer, Boolean endStream, CancellationToken cancellationToken) in /home/rzikm/dotnet-runtime/src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/MsQuicStream.cs:line 309
   at System.Net.Test.Common.Http3LoopbackStream.SendFrameHeaderAsync(Int64 frameType, Int32 payloadLength) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Net/Http/Http3LoopbackStream.cs:line 162
   at System.Net.Test.Common.Http3LoopbackStream.SendFrameAsync(Int64 frameType, ReadOnlyMemory`1 framePayload) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Net/Http/Http3LoopbackStream.cs:line 167
   at System.Net.Test.Common.Http3LoopbackStream.SendGoAwayFrameAsync(Int64 firstInvalidStreamId) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Net/Http/Http3LoopbackStream.cs:line 150
   at System.Net.Test.Common.Http3LoopbackConnection.ShutdownAsync(Boolean failCurrentRequest) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Net/Http/Http3LoopbackConnection.cs:line 252
   at System.Net.Test.Common.Http3LoopbackServer.AcceptConnectionAsync(Func`2 funcAsync) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Net/Http/Http3LoopbackServer.cs:line 73
   at System.Net.Http.Functional.Tests.ResponseStreamZeroByteReadTestBase.<>c__DisplayClass1_0.<<ZeroByteRead_BlocksUntilDataIsAvailable>b__1>d.MoveNext() in /home/rzikm/dotnet-runtime/src/libraries/System.Net.Http/tests/FunctionalTests/ResponseStreamZeroByteReadTests.cs:line 285
--- End of stack trace from previous location ---
   at System.Threading.Tasks.TaskTimeoutExtensions.GetRealException(Task task) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Threading/Tasks/TaskTimeoutExtensions.cs:line 120
--- End of stack trace from previous location ---
   at System.Threading.Tasks.TaskTimeoutExtensions.WhenAllOrAnyFailed(Task[] tasks) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Threading/Tasks/TaskTimeoutExtensions.cs:line 90
   at System.Net.Test.Common.LoopbackServerFactory.<>c__DisplayClass5_0.<<CreateClientAndServerAsync>b__0>d.MoveNext() in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Net/Http/GenericLoopbackServer.cs:line 38
--- End of stack trace from previous location ---
   at System.Net.Test.Common.Http3LoopbackServerFactory.CreateServerAsync(Func`3 funcAsync, Int32 millisecondsTimeout, GenericLoopbackOptions options) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Net/Http/Http3LoopbackServer.cs:line 104
   at System.Net.Http.Functional.Tests.ResponseStreamZeroByteReadTestBase.ZeroByteRead_BlocksUntilDataIsAvailable(Boolean async) in /home/rzikm/dotnet-runtime/src/libraries/System.Net.Http/tests/FunctionalTests/ResponseStreamZeroByteReadTests.cs:line 244
--- End of stack trace from previous location ---

It seems there is a data race that makes us throw QuicOperationAbortedException instead of QuicConnectionAbortedException. And thus indicate, that the connection was closed locally (while it was actually closed remotely with an error code).

The data race seems to come from the way MsQuic is implemented. It may be that we rely on behavior that MsQuic does not actually guarantee.

  • MsQuic thread starts processing incoming CONNECTION_CLOSE frame, sets Connection->State.ClosedRemotely = TRUE, but does not indicate the event to the application yet
  • Application thread attempts to send data on a stream, MsQuicStreamSend checks Connection->State.ClosedRemotely and returns QUIC_STATUS_ABORTED
  • Application thread on the caller checks if we received an error code in QUIC_CONNECTION_EVENT_SHUTDOWN_INITIATED_BY_PEER or QUIC_CONNECTION_EVENT_SHUTDOWN_INITIATED_BY_TRANSPORT. Since none of these events were fired (yet), we interpret the status as "connection was closed locally" -> System.Net.Quic.QuicOperationAbortedException : Operation aborted.
  • MsQuic thread continues and indicates the actual error code to the application.
@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 18, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Net.Quic and removed untriaged New issue has not been triaged by the area owner labels May 18, 2022
@ghost
Copy link

ghost commented May 18, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Observed when stressing functional tests:

[xUnit.net 00:00:03.59]     System.Net.Http.Functional.Tests.Http3ResponseStreamZeroByteReadTest_MsQuic.ZeroByteRead_BlocksUntilDataIsAvailable(async: True) [FAIL]
  Failed System.Net.Http.Functional.Tests.Http3ResponseStreamZeroByteReadTest_MsQuic.ZeroByteRead_BlocksUntilDataIsAvailable(async: True) [1 s]
  Error Message:
   System.Net.Quic.QuicOperationAbortedException : Operation aborted.
  Stack Trace:
     at System.Net.Quic.Implementations.MsQuic.MsQuicStream.SendReadOnlyMemoryAsync(ReadOnlyMemory`1 buffer, QUIC_SEND_FLAGS flags) in /home/rzikm/dotnet-runtime/src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/MsQuicStream.cs:line 1443
   at System.Net.Quic.Implementations.MsQuic.MsQuicStream.WriteAsync(ReadOnlyMemory`1 buffer, Boolean endStream, CancellationToken cancellationToken) in /home/rzikm/dotnet-runtime/src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/MsQuicStream.cs:line 309
   at System.Net.Test.Common.Http3LoopbackStream.SendFrameHeaderAsync(Int64 frameType, Int32 payloadLength) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Net/Http/Http3LoopbackStream.cs:line 162
   at System.Net.Test.Common.Http3LoopbackStream.SendFrameAsync(Int64 frameType, ReadOnlyMemory`1 framePayload) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Net/Http/Http3LoopbackStream.cs:line 167
   at System.Net.Test.Common.Http3LoopbackStream.SendGoAwayFrameAsync(Int64 firstInvalidStreamId) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Net/Http/Http3LoopbackStream.cs:line 150
   at System.Net.Test.Common.Http3LoopbackConnection.ShutdownAsync(Boolean failCurrentRequest) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Net/Http/Http3LoopbackConnection.cs:line 252
   at System.Net.Test.Common.Http3LoopbackServer.AcceptConnectionAsync(Func`2 funcAsync) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Net/Http/Http3LoopbackServer.cs:line 73
   at System.Net.Http.Functional.Tests.ResponseStreamZeroByteReadTestBase.<>c__DisplayClass1_0.<<ZeroByteRead_BlocksUntilDataIsAvailable>b__1>d.MoveNext() in /home/rzikm/dotnet-runtime/src/libraries/System.Net.Http/tests/FunctionalTests/ResponseStreamZeroByteReadTests.cs:line 285
--- End of stack trace from previous location ---
   at System.Threading.Tasks.TaskTimeoutExtensions.GetRealException(Task task) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Threading/Tasks/TaskTimeoutExtensions.cs:line 120
--- End of stack trace from previous location ---
   at System.Threading.Tasks.TaskTimeoutExtensions.WhenAllOrAnyFailed(Task[] tasks) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Threading/Tasks/TaskTimeoutExtensions.cs:line 90
   at System.Net.Test.Common.LoopbackServerFactory.<>c__DisplayClass5_0.<<CreateClientAndServerAsync>b__0>d.MoveNext() in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Net/Http/GenericLoopbackServer.cs:line 38
--- End of stack trace from previous location ---
   at System.Net.Test.Common.Http3LoopbackServerFactory.CreateServerAsync(Func`3 funcAsync, Int32 millisecondsTimeout, GenericLoopbackOptions options) in /home/rzikm/dotnet-runtime/src/libraries/Common/tests/System/Net/Http/Http3LoopbackServer.cs:line 104
   at System.Net.Http.Functional.Tests.ResponseStreamZeroByteReadTestBase.ZeroByteRead_BlocksUntilDataIsAvailable(Boolean async) in /home/rzikm/dotnet-runtime/src/libraries/System.Net.Http/tests/FunctionalTests/ResponseStreamZeroByteReadTests.cs:line 244
--- End of stack trace from previous location ---

It seems there is a data race that makes us throw QuicOperationAbortedException instead of QuicConnectionAbortedException. And thus indicate, that the connection was closed locally (while it was actually closed remotely with an error code).

The data race seems to come from the way MsQuic is implemented. It may be that we rely on behavior that MsQuic does not actually guarantee.

  • MsQuic thread starts processing incoming CONNECTION_CLOSE frame, sets Connection->State.ClosedRemotely = TRUE, but does not indicate the event to the application yet
  • Application thread attempts to send data on a stream, MsQuicStreamSend checks Connection->State.ClosedRemotely and returns QUIC_STATUS_ABORTED
  • Application thread on the caller checks if we received an error code in QUIC_CONNECTION_EVENT_SHUTDOWN_INITIATED_BY_PEER or QUIC_CONNECTION_EVENT_SHUTDOWN_INITIATED_BY_TRANSPORT. Since none of these events were fired (yet), we assume the connection was closed locally -> System.Net.Quic.QuicOperationAbortedException : Operation aborted.
  • MsQuic thread continues and indicates the actual error code to the application.
Author: rzikm
Assignees: -
Labels:

area-System.Net.Quic

Milestone: -

@rzikm
Copy link
Member Author

rzikm commented May 18, 2022

@nibanks Does/should MsQuic guarantee that if MsQuicStreamSend returns QUIC_STATUS_ABORTED due to the connection being closed by the peer, the relevant connection event has already been signaled?

Edit: same goes for the STOP_SENDING frame, there is a similar race there when peer aborts the read side of the stream.

@nibanks
Copy link

nibanks commented May 18, 2022

No, I don't believe so.

@ManickaP ManickaP added this to the 7.0.0 milestone May 18, 2022
@rzikm
Copy link
Member Author

rzikm commented May 18, 2022

Is this something you could consider fixing? I think the only way to fix this from our side is to wait for the event (either by spinning the thread or using the async programming model). While doable, it is not going to be pretty.

@nibanks
Copy link

nibanks commented May 18, 2022

Is this something you could consider fixing?

Please open an issue on MsQuic with all the details and we can take a look.

@CarnaViire
Copy link
Member

There's also an option of sending connection's abort code alongside the ABORTED result in the stream event @nibanks

@nibanks
Copy link

nibanks commented May 18, 2022

So is the problem here that if you get an aborted error from calling StreamSend, you want to differentiate "connection aborted" and "stream aborted"?

@CarnaViire
Copy link
Member

We mostly intended to differentiate user-side connection abort from peer-side connection abort. The check was "if there's an error code already set to the connection, then the connection must be aborted by the peer", but the race here ended up not setting the connection's error code in time. So even if we know from somewhere it's peer side abort, we don't have anywhere to get the error code from.

With "connection aborted" vs "stream aborted" it is also interesting, I think we might have a bug where we believe ABORTED always means connection abort?...

@nibanks
Copy link

nibanks commented May 18, 2022

Aborted can definitely mean either Connection or Stream aborted.

@rzikm
Copy link
Member Author

rzikm commented May 18, 2022

There's also an option of sending connection's abort code alongside the ABORTED result in the stream event

The order of the events in MsQuic is such that the connection-wide error is signaled first, so by the time stream shutdown notification arrives, we already have the connection error code. The ABORTED result I am speaking of is the direct return value from MsQuicStreamSend, not from a stream event.

So is the problem here that if you get an aborted error from calling StreamSend, you want to differentiate "connection aborted" and "stream aborted"?

Yes, and we want to include the error code in the exception. We can already differentiate this (although, as @CarnaViire noted, we currently don't do that due to an oversight during implementation). It is only the race between Connection->State.ClosedRemotely = TRUE and indicating the event with the error code that is the problem (and similarly for aborting the stream). I will create the MsQuic issue with details.

@rzikm
Copy link
Member Author

rzikm commented May 18, 2022

Filed microsoft/msquic#2727

@rzikm rzikm added the tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly label May 24, 2022
@karelz karelz removed the tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly label Jun 14, 2022
@karelz
Copy link
Member

karelz commented Jun 14, 2022

Triage: This will be solved by moving to new API design.

@karelz karelz assigned ManickaP and unassigned rzikm Jun 14, 2022
@ManickaP
Copy link
Member

Closing as covered by #69675 as we'll expose WritesCompleted which we'll leverage in case we get an error from msquic Send operation.

@ghost ghost locked as resolved and limited conversation to collaborators Jul 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants