Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HTTP Connection Pool] Lack of timeout on SSL connection establishment caused high number of pending connections in pool #110598

Open
hongrui-zhang opened this issue Dec 11, 2024 · 13 comments · Fixed by #110744
Assignees
Labels
area-System.Net.Http bug in-pr There is an active PR which will close this issue when it is merged
Milestone

Comments

@hongrui-zhang
Copy link

Description

Our service uses HttpClient to send requests to downstream services, and we observed that,

  1. [expected] During a network outage on the infrastructure that hosts the destination of the requests, a lot of requests failed with timeouts/failures
  2. [unexpected] After the network outage is resolved, the sender still experiences those timeouts/failures. This only resolves after the machine hosting the request sender is restarted

We took a dump and based on the discovery formed a hypothesis that explains above and would like .NET team to check if the hypothesis is reasonable.

Observations from dump

  1. The HttpConnectionPool that serves the destination has 88 associated connections and all of them are pending, which implies that the connection establishments are hanging
    Image
  2. By counting AsyncTaskMethodBuilder for various methods on the heap, it seems that SSL connection establishment is the culprit, not TCP connection
Count Total Size Class Name
95 19,760 System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.ValueTuple<System.IO.Stream, System.Net.TransportContext, System.Net.IPEndPoint>>+AsyncStateMachineBox<System.Net.Http.HttpConnectionPool+<ConnectAsync>d__103>
7 1,344 System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.IO.Stream>+AsyncStateMachineBox<System.Net.Http.HttpConnectionPool+<ConnectToTcpHostAsync>d__104>
95 14,440 System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Net.Security.SslStream>+AsyncStateMachineBox<System.Net.Http.ConnectHelper+<EstablishSslConnectionAsync>d__2>
95 19,760 System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Threading.Tasks.VoidTaskResult>+AsyncStateMachineBox<System.Net.Security.SslStream+<ForceAuthenticationAsync>d__150<System.Net.Security.AsyncReadWriteAdapter>>
95 14,440 System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Int32>+AsyncStateMachineBox<System.Net.Security.SslStream+<ReceiveHandshakeFrameAsync>d__151<System.Net.Security.AsyncReadWriteAdapter>>

Analysis
Upon checking code in HttpConnectionPool, it seems like, under default setting, there is no cancellation for ConnectToTcpHostAsync and EstablishSslConnectionAsync (a cancellation token is passed, but with InfiniteTimeSpan). It kind of makes sense for TCP connection, as OS has timeout at OS level, but for SSL connection, I am not aware of any OS level timeout. With no OS or application level timeout, SSL connection can hang indefinitely.

Hypothesis

  1. With the network outage, HttpConnectionPool started to get contaminated with connections that hangs in SSL connection phase
  2. With PoolConnectionLifetime set in our application, healthy connections start to die off when their lifetime is up, so there are less and less healthy connections in the connection pool. Pending connections does not seem to honor PoolConnectionLifetime.
  3. Even after the network outage is resolved, pending connections are still hanging, counting towards _pendingHttp11ConnectionCount in the connection pool. High _pendingHttp11ConnectionCount makes it harder to start new connections (Connection pool has logic that only start new connection if request queue length is larger than _pendingHttp11ConnectionCount)
  4. The connection pool ended up having no working connection (as the dump showed) and a lot of pending connections, which explains the failures and timeouts we saw.

Asks to .NET team

  1. Is above hypothesis reasonable? (e.g. is there indeed no OS level timeout for SSL connection establishment, so that could theoretically hang indefinitely?)
  2. We are planning to set ConnectTimeout to some concrete value (e.g. 30 seconds). Are there any concerns/things to think about regarding that?
  3. What is the reason of having this value to be defaulted to infinite time span? What are the considerations that .NET team have?
  4. I cannot share the dump per security protocol, but I am happy to run diagnostics on it if necessary.

Reproduction Steps

No manual repro. As described above, our service sees timeouts/failures sending HTTP requests even after network outage is resolved on the infrastructure that hosts the destination of the request.

Expected behavior

HttpClient should be able to send requests to destination, after the network outage impacting the destination is resolved, without restarting sender application/machine.

Actual behavior

HttpClient reports failures and timeouts sending requests to destination, even after the network outage impacting the destination is resolved. The issue is only fixed with restarting application/machine.

Regression?

n/a

Known Workarounds

No response

Configuration

n/a

Other information

n/a

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 11, 2024
@MihaZupan
Copy link
Member

MihaZupan commented Dec 11, 2024

Your analysis looks correct.

Are you running on .NET 6? We are aware of this issue, and .NET 7 (and newer) includes a fix (#71785) that will cancel such stuck connection attempts even if you haven't set a ConnectTimeout.
On .NET 6, setting a ConnectTimeout would similarly resolve the issue.

We've considered changing the default to a non-infinite value in the past, but backed the change out due to concerns of breaking existing scenarios that expect a specific exception type (#66297 (comment)). It's also less important now given the aforementioned change.

@hongrui-zhang
Copy link
Author

Hello @MihaZupan , thanks for your response!

We are actually using .NET 8, so sounds like it should be fixed. Reading though the fix you referenced in your response, I think I understand how the cancellation token is cancelled when the originating request is served/cancelled, but would you mind pointing out how _pendingHttp11ConnectionCount is decremented when the that happens? I can't seem to find it.

@MihaZupan
Copy link
Member

Yes, this shouldn't be happening on .NET 8.
Are you using ConnectCallback to replace/wrap the network stream by any chance?

would you mind pointing out how _pendingHttp11ConnectionCount is decremented when the that happens?

If a connection attempt fails, it'll call into HandleHttp11ConnectionFailure, which does the decrement here

@wfurt
Copy link
Member

wfurt commented Dec 11, 2024

Note that TCP behavior is applicable to TLS as well as the SslStream depends on underlying NetworkStream and Socket. On Windows the OS Connect timeout is somewhat long. Depending on your use setting some explicit timeout seems reasonable.

@hongrui-zhang
Copy link
Author

We are not using ConnectCallback.

Any chance that the cancellationToken is not fully honored by EstablishSslConnectionAsync and the method that it calls? I've read a few posts about network stream calls hanging even with cancellationToken cancelled. For example, this stackoverflow post

@hongrui-zhang
Copy link
Author

@MihaZupan @wfurt , any insights on the question above? We are concerned about whether the cancellationToken is fully honored in the EstablishSslConnectionAsync and all the method calls that it generates. If it's not fully honored (e.g., it could be that the cancellation token is opportunistically honored, by checking only at the beginning of a method), then there is no guarantee that the task will throw/finish when the cancellation token is cancelled. In that case, both the fix referenced by @MihaZupan above and the recommendation of setting the ConnectTimeout will not work, since both of them rely on cancellationToken

@MihaZupan
Copy link
Member

Cancellation should be honored at any point during the call, I'm not aware of cases where it wouldn't be in current versions. @antonfirsov @liveans does that ring any bells?

@wfurt
Copy link
Member

wfurt commented Dec 12, 2024

I agree with @MihaZupan. That article is 12 years old and the behavior improved a lot since. The problematic cases I've seen recently are related to firewall or load balancer cutting communication in the middle. The TCP stack would never receive FIN or RST so it would see the connection as established and Read "hangs" for long time. One way how improve detection is such conditions is enabling TCP keep alive.
The are difficult case I'm aware is scenarios when server would fail to participate in TLS handshake but it would (or front balancer) keep the TCP open. With that, SslStream would also read indefinetly as it defers IO to underlying stream and it does not have timeout mechanism on its own.

And BTW what platform are you running on? You may try to collect some packet captures as well as enable internal tracing.
And there still could be bug in the connection pool logic as it is pretty complicated.

@antonfirsov
Copy link
Member

antonfirsov commented Dec 13, 2024

Cancellation should be honored at any point during the call, I'm not aware of cases where it wouldn't be in current versions.

I'm not aware of anything besides DNS on Linux, but the discussion seems to point to TLS.

@hongrui-zhang
Copy link
Author

hongrui-zhang commented Dec 13, 2024

@wfurt we are on windows server.

I am working with @MihaZupan offline to securely transfer the dump file for .NET team to further analyze. In the meantime, our team is considering moving to HttpClientFactory, which will from our understanding periodically tear down handlers and underlying connection pools.

Out of curiosity, @wfurt , how to set TCP keep alive from HttpClient/SocketsHttpHandler? I was not able to find such option exposed at HttpClient/SocketsHttpHandler level.

@MihaZupan
Copy link
Member

Yes, HttpClientFactory will rotate handlers for you before .NET 9.0. Since then we'll just rely on SocketsHttpHandler.PooledConnectionLifetime instead.

You can set tcp keepalive options in the ConnectCallback:

var handler = new SocketsHttpHandler
{
    ConnectCallback = async (context, cancellationToken) =>
    {
        var socket = new Socket(SocketType.Stream, ProtocolType.Tcp) { NoDelay = true };
        try
        {
            socket.SetSocketOption(SocketOptionLevel.Socket, SocketOptionName.KeepAlive, true);
            socket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.TcpKeepAliveTime, 60);
            socket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.TcpKeepAliveInterval, 1);

            await socket.ConnectAsync(context.DnsEndPoint, cancellationToken);
            return new NetworkStream(socket, ownsSocket: true);
        }
        catch
        {
            socket.Dispose();
            throw;
        }
    }
};

@MihaZupan MihaZupan removed the untriaged New issue has not been triaged by the area owner label Dec 13, 2024
@MihaZupan MihaZupan added this to the 10.0.0 milestone Dec 13, 2024
@MihaZupan MihaZupan self-assigned this Dec 16, 2024
@MihaZupan MihaZupan added the bug label Dec 16, 2024
@MihaZupan
Copy link
Member

There's a race condition here where if the initiating request completes right away, before we're able to set the CTS in InjectNewHttp11ConnectionAsync, we'll never set the timeout.

await Task.CompletedTask.ConfigureAwait(ConfigureAwaitOptions.ForceYielding);
HttpConnectionWaiter<HttpConnection> waiter = queueItem.Waiter;
HttpConnection? connection = null;
Exception? connectionException = null;
CancellationTokenSource cts = GetConnectTimeoutCancellationTokenSource();
waiter.ConnectionCancellationTokenSource = cts;

It's easy to reproduce if you add a delay between the Yield and assigning the cts, while failing the request.

@MihaZupan MihaZupan reopened this Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Net.Http bug in-pr There is an active PR which will close this issue when it is merged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants