Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Http2StreamException ENHANCE_YOUR_CALM during bidirectional streaming stress test after ca. 200 server calls #2010

Closed
osexpert opened this issue Jan 5, 2023 · 21 comments
Labels
bug Something isn't working

Comments

@osexpert
Copy link

osexpert commented Jan 5, 2023

Minimal code to repro here: https://github.com/osexpert/GrpcDotNetProblem
Can not always repro in debug run via Visual Studio. Can most often repro when run in debug from command line. Fails the same way in NETCORE3.0, NET50, NET60.

Removing the logging by calling:
builder.Logging.ClearProviders();
can sometime be needed because else the problem may not happen (the logging can slow it down enough so it still accidentally works)

The error you should get:
Message from server: Via server : hello 200
Unhandled exception. Grpc.Core.RpcException: Status(StatusCode="ResourceExhausted", Detail="Error starting gRPC call. HttpRequestException: An error occurred while sending the request. IOException: The request was aborted. Http2StreamException: The HTTP/2 server reset the stream. HTTP/2 error code 'ENHANCE_YOUR_CALM' (0xb).", DebugException="System.Net.Http.HttpRequestException: An error occurred while sending the request.
---> System.IO.IOException: The request was aborted.
---> System.Net.Http.Http2StreamException: The HTTP/2 server reset the stream. HTTP/2 error code 'ENHANCE_YOUR_CALM' (0xb).
--- End of inner exception stack trace ---
at System.Net.Http.Http2Connection.ThrowRequestAborted(Exception innerException)
at System.Net.Http.Http2Connection.Http2Stream.CheckResponseBodyState()
at System.Net.Http.Http2Connection.Http2Stream.TryEnsureHeaders()
at System.Net.Http.Http2Connection.Http2Stream.ReadResponseHeadersAsync(CancellationToken cancellationToken)
at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at Grpc.Net.Client.Balancer.Internal.BalancerHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
at Grpc.Net.Client.Internal.GrpcCall2.RunCall(HttpRequestMessage request, Nullable1 timeout)")
at Grpc.Net.Client.Internal.HttpContentClientStreamReader`2.MoveNextCore(CancellationToken cancellationToken)
at GrpcClient.Program.Main(String[] args) in E:\git\GrpcDotNetProblem\GrpcClient\Program.cs:line 24
at GrpcClient.Program.

(String[] args)

When it happens in debug run via Visual Studio i can see in the Paralell Stacks viewer:
1 Process, 200 Async Logical Stack, Scheduled, Http2OutputProducer.ProcessDataWrites()

I am guessing it is related to bidirectional streaming somehow.

I also see this in log after every method call is returned, but not sure if related to the ENHANCE_YOUR_CALM problem:
info: Microsoft.AspNetCore.Server.Kestrel[32]
Connection id "{someId}", Request id "{someId}": the application completed without reading the entire request body.

The native Grpc libraries does not have this problem, they can run forever.
I can not switch from the native Grpc libraries and to grpc-dotnet the way it currently works.

@osexpert osexpert added the bug Something isn't working label Jan 5, 2023
@osexpert
Copy link
Author

osexpert commented Jan 5, 2023

From what i can deduce, it seems grpc-dotnet depends on the client calling RequestStream.CompleteAsync(); before the server call returns? This is pretty weird to rely on the client to tell the server to free its resources, before the server call returns. If the server call returns without being told by the client that it should free its resources, it does not (at least not immediately). The server knows when the call returns (better than anyone?), and it could just immediately free the resources then, without relying on the client calling calling RequestStream.CompleteAsync(), I would think?

As it is made now, it may be possible to use this behavior to craft a DOS attack?

@JamesNK
Copy link
Member

JamesNK commented Jan 6, 2023

I'm pretty sure this is an expected error.

What you're doing is creating a lot of gRPC calls that aren't gracefully complete. Read more here: https://learn.microsoft.com/en-us/aspnet/core/grpc/client?view=aspnetcore-7.0#bi-directional-streaming-call

Because the server is ending the request before the client, it has to keep some state about the request active for a time. This is required because it is a connection error for an HTTP/2 connection to send data to an unknown stream ID. Kestrel keeps up to 200ish streams active in the background, eventually cleaning them up after a grace period to allow in incoming data that might still be in flight to be drained.

If the server exceeds that limit, which is happening in your stress test, the server decides the connection is being abused and closes it with ENHANCE_YOUR_CALM. That signals to the client that it is doing something wrong and should change how it makes HTTP requests.

@osexpert
Copy link
Author

osexpert commented Jan 6, 2023

Yes, if it was not for the native server working correctly, I would have bought stat story.
Edit: but why can't the server simply close all streams when the call returns? The current behavior seems more complicated, eg. I would have had to invest a lot more work to make it work like it does now than to have it work the most obvious way.
Edit2: in the linked article: "A bi-directional call completes gracefully when the server has finished reading the request stream and the client has finished reading the response stream". In my example, I do this. Edit3: but I do not do this: "Notifies the server it has finished sending messages with RequestStream.CompleteAsync()"
Even so, when aiming to replace the native server, the replacement should be at least as good as what it replaces. And this is not the case here.

@JamesNK
Copy link
Member

JamesNK commented Jan 6, 2023

but why can't the server simply close all streams when the call returns?

The server must remember the stream until the client indicates it is no longer sending data because it is an HTTP/2 connection protocol error to receive data to an unknown or closed stream ID.

In my example, I do this.

You don't. The server method completes gracefully when it exits after the request stream MoveNext returns false. Not only is this the cause of this error, but you'll also get worse performance. The state of gracefully completed streams is reused while reset streams are thrown away.

If you turn on logging I expect you'll see lots of log messages saying "the application completed without reading the entire request body." That means the request wasn't completely gracefully.

Even so, when aiming to replace the native server, the replacement should be at least as good as what it replaces. And this is not the case here.

I don't know how the native server is implemented, but a server can't keep tracking an unlimited number of ungracefully closed (i.e. reset) streams. If it does, then that's an attack vector that a malicious client could use to use up all the memory on the server. Or maybe the native server doesn't track them and is ok receiving DATA frames to unknown or closed stream IDs.

@osexpert
Copy link
Author

osexpert commented Jan 6, 2023

But why should the server track a stream after the server method has returned? In case the client suddenly make a call\write to it? But why? The server method has returned and there is no one there to read from the stream anyways, so why keep it open? And how long after the server method has returned does the server expect that a client would write to this stream that no one will be reading from anyways? 1 second? 1 hour? I don't get it. If a client would happen to write to a stream after a stream is closed, this would off course fail, but that would in no way exhaust the servers resources like it does now. It would simply be a reply from the server: sorry, that stream does exist.

@osexpert
Copy link
Author

osexpert commented Jan 6, 2023

But since this is a radical difference between native server and grpc-dotnet, maybe there is possible to contact someone in grpc.io to ensure both servers work the same (eg. either have native grpc fail the same way or make grpc-dotnet not fail).
Edit: and it would be interesting to test grpc server from other languages too, and check how they handle it.

@osexpert
Copy link
Author

osexpert commented Jan 6, 2023

"then that's an attack vector that a malicious client could use to use up all the memory on the server."
I tested a native client\server with 100k echo calls where in variant1 the server just returns and variant2, the server reply to the client that it should call RequestStream.CompleteAsync(). In both cases the memory use is the same and the time used is the same. And the memory use is stable both on client on server.

So for the native implementation, calling RequestStream.CompleteAsync() is purely optional. Maybe this call only exist so that the client can end the stream before the server returns, as a micro optimization, or something.

So yes, I see that it is possible to workaround this problem for me with having the server tell the client to call RequestStream.CompleteAsync(). Even so, I think this is a too big difference between native and grpc-dotnet to just ignore.

@JamesNK
Copy link
Member

JamesNK commented Jan 6, 2023

You didn't consider the possibility that the native gRPC server doesn't track reset streams and silently ignores frames sent to non-open streams. I'm not familiar with the native gRPC's HTTP/2 layer, but it seems to me that it is ignoring DATA frames when it can't match them to a stream.

Kestrel's priorities are security and adherence to HTTP specifications. Those priorities are causing you to get an error, but it's in a stress test that isn't gracefully completing streams. Kestrel isn't going to be changed to reduce security or ignore the HTTP/2 spec for this situation.

@osexpert
Copy link
Author

osexpert commented Jan 6, 2023

But maybe it could be a setting (opt-in) or something that can be done via ServerCallContext.GetHttpContext() to make it close the stream immediately so it can behave the same?

@osexpert
Copy link
Author

osexpert commented Jan 7, 2023

I tried to read the http2 spec.
As I understand it, either end can write a END_STREAM frame\flag. Stream then become half closed while waiting for other end to write a END_STREAM frame\flag and then stream is closed. When it is half closed, it needs to be tracked until it times out (if the END_STREAM acknowledgement never comes), and this is where the stream get stuck in this case?

But when the server call return, it should write a END_STREAM frame\flag to the client and initiate the close.
The client ResponseStream.MoveNext() would then return false? And the client calls RequestStream.CompleteAsync() (or indirectly via dispose the call) and client will write a END_STREAM frame\flag to the server and the stream is closed.

I don't see why this should not work with a server initiated close. The reason for the stream ending up in this 200 streams list is because it is waiting for the END_STREAM frame\flag from the client? Maybe there is some logic missing within the client in the ResponseStream.MoveNext() that call, so it is not returning with false? I don't know:-)

"Because the server is ending the request before the client, it has to keep some state about the request active for a time. This is required because it is a connection error for an HTTP/2 connection to send data to an unknown stream ID. Kestrel keeps up to 200ish streams active in the background, eventually cleaning them up after a grace period to allow in incoming data that might still be in flight to be drained."
As I understand it, this is the state when stream is half closed and waiting for a END_STREAM reply? Yes, it make sense. And it would make sense that it would fill up 200 streams list if the client died before it would acknowledge the END_STREAM the server send. But this is not the case here, the client is alive and well and not crashing.

I will try to test some more with the native client <-> dotnet server and vice versa to see if this changes anything. Until now I have only tested native agains native and dotnet agains dotnet.

Found a related issue: https://stackoverflow.com/questions/67806047/best-way-to-return-bidirectional-grpc-method-net-5-c-sharp

@osexpert
Copy link
Author

osexpert commented Jan 7, 2023

Native client -> dotnet server give a different error at almost the same place:
Message from server: Via server : hello 220
Unhandled exception. Grpc.Core.RpcException: Status(StatusCode="ResourceExhausted", Detail="Received RST_STREAM with error code 11", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1673098686.020000000","description":"Error received from peer ipv6:[::1]:5000","file":"......\src\core\lib\surface\call.cc","file_line":953,"grpc_message":"Received RST_STREAM with error code 11","grpc_status":8}")
at Grpc.Core.Internal.ClientResponseStream`2.MoveNext(CancellationToken token)
at GrpcNativeClient.Program.Main(String[] args) in E:\git\GrpcDotNetProblem\GrpcNativeClient\Program.cs:line 24
at GrpcNativeClient.Program.

(String[] args)

@JamesNK
Copy link
Member

JamesNK commented Jan 7, 2023

When the server completes a response without completely reading the request it sends RST_STREAM to the client, not END_STREAM. It doesn't enter a half-closed state, it goes immediately to close. The client state also goes immediately to close and doesn't send any more frames to the server.

@osexpert
Copy link
Author

osexpert commented Jan 7, 2023

I tried a ninja hack, for fun. In the server handler I added this at the end:

var ctx = context.GetHttpContext();
var s = ctx.Features.Get<IHttp2StreamIdFeature>();
s.GetType().GetMethod("OnEndStreamReceived", System.Reflection.BindingFlags.Public | System.Reflection.BindingFlags.Instance).Invoke(s, null);

This works well with the native client. Fast, no handled exception in the client (such as cancellations). Testet up to 300k echoes, no issues.
With the dotnet client it works not so well, it generate a System.Threading.Tasks.TaskCanceledException per call
at this point: while (await streaming.ResponseStream.MoveNext()). It also seems to hang\stop after a while at different points. Can be after 180k echoes or after 24k echoes. When I attach debugger to client\server, nothing is really happening, seems like it just "burnt out". The server does not accept new grpc connections at this point (if a start a new client). It is just "dead". This is also not very comforting.
Edit: my mistake! the server do accept new grpc connections!

@osexpert
Copy link
Author

osexpert commented Jan 8, 2023

I updated the github example and added a native client and I added what I think is supposed to be the correct workaround for this problem (server write to the client that the client should "hang up" (call RequestStream.CompleteAsync() before server return).

This seems to work for the dotnet client but the native client still seems to fail with this after approx 239 echoes:
Grpc.Core.RpcException: 'Status(StatusCode="ResourceExhausted", Detail="Received RST_STREAM with error code 11", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1673214051.963000000","description":"Error received from peer ipv6:[::1]:5000","file":"......\src\core\lib\surface\call.cc","file_line":953,"grpc_message":"Received RST_STREAM with error code 11","grpc_status":8}")'

So I still feel this is a bug. At least there seem to be no way to make the native client work with the grpc server without the server failing (without using ninja hacks). Writing to the client from the server and telling the client to "hang up" seems to work for the dotnet client but this feels weird and not logical too.

@osexpert
Copy link
Author

osexpert commented Jan 17, 2023

Updated with Python client and server examples.

python client -> dotnet server give similar problem. In this case the client does break in response to "quit", but still fail (similar to native client -> dotnet server, that also fail even if client "hang up" first

Sending First message1902
Traceback (most recent call last):
  File "E:\git\GrpcDotNetProblem\PythonClient\PythonClient.py", line 75, in <module>
    run()
  File "E:\git\GrpcDotNetProblem\PythonClient\PythonClient.py", line 67, in run
    echo_test_hangup(stub, count)
  File "E:\git\GrpcDotNetProblem\PythonClient\PythonClient.py", line 51, in echo_test_hangup
    for response in responses:
  File "{private}\Roaming\Python\Python39\site-packages\grpc\_channel.py", line 426, in __next__
    return self._next()
  File "{private}\AppData\Roaming\Python\Python39\site-packages\grpc\_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.RESOURCE_EXHAUSTED
        details = "Received RST_STREAM with error code 11"
        debug_error_string = "UNKNOWN:Error received from peer {private?} {created_time:"2023-01-16T23:36:29.063046895+00:00", grpc_status:8, grpc_message:"Received RST_STREAM with error code 11"}"
>

Python example losely based on RouteChat in https://github.com/grpc/grpc/tree/v1.50.0/examples/python/route_guide

@JamesNK
Copy link
Member

JamesNK commented Feb 1, 2023

I don't see anything actionable here. Closing because of lack of activity.

@JamesNK JamesNK closed this as completed Feb 1, 2023
@osexpert
Copy link
Author

osexpert commented Feb 16, 2023

So by your logic it is probably all the other grpc implementations that are wrong. Even if that is true, which seems strange, interoperability should have been the main goal IMO (it does not help to be "right" if it does not work).

@Libra321
Copy link

Libra321 commented Jul 7, 2023

Hello! Was it resolved in the end? I had the same problem here.
Here is the error code:

Status(StatusCode="ResourceExhausted", Detail="Error starting gRPC call. HttpRequestException: An error occurred while sending the request. IOException: The request was aborted. Http2StreamException: The HTTP/2 server reset the stream. HTTP/2 error code 'ENHANCE_YOUR_CALM' (0xb).", DebugException="System.Net.Http.HttpRequestException: An error occurred while sending the request.

---> System.IO.IOException: The request was aborted.

---> System.Net.Http.Http2StreamException: The HTTP/2 server reset the stream. HTTP/2 error code 'ENHANCE_YOUR_CALM' (0xb).

--- End of inner exception stack trace ---

at System.Net.Http.Http2Connection.ThrowRequestAborted(Exception innerException)

at System.Net.Http.Http2Connection.Http2Stream.CheckResponseBodyState()

at System.Net.Http.Http2Connection.Http2Stream.TryEnsureHeaders()

at System.Net.Http.Http2Connection.Http2Stream.ReadResponseHeadersAsync(CancellationToken cancellationToken)

at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)

--- End of inner exception stack trace ---

at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)

at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)

at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)

at Grpc.Net.Client.Internal.GrpcCall2.RunCall(HttpRequestMessage request, Nullable1 timeout)")

@leidegre
Copy link

leidegre commented Jan 8, 2024

@JamesNK sorry for bothering you with this but I ran into this error when trying to abort a client streaming call, like so

public override async Task<Response> ServerMethod(IAsyncStreamReader<Request> requestStream, ServerCallContext context)
{
    await foreach (var message in requestStream.ReadAllAsync())
    {
        if (ShouldAbortCall())
        {
            return new Response(); // abort!
        }
    }
    return new Response();
}

I can't really keep reading the request stream when I want to abort the call. So I'm thinking here that I will have to change this into a bidirectional streaming call so that I can coordinate with the client because there's no way to return a response and at the same time keeping reading and waiting for the client to end the request stream.

If it helps, this is a file upload. The server can sometimes quickly check that it already has the file and thus abort the file upload since it is redundant. In this case I thought it would be sufficient to have a client streaming call but given what's been said here the expectation is that the client is supposed to complete the request stream first.

@JamesNK
Copy link
Member

JamesNK commented Jan 8, 2024

So I'm thinking here that I will have to change this into a bidirectional streaming call so that I can coordinate with the client because there's no way to return a response and at the same time keeping reading and waiting for the client to end the request stream.

Yes. A disadvantage of server streaming calls is they can only be stopped from the client by aborting.

@osexpert
Copy link
Author

osexpert commented Apr 11, 2024

Status update.
Using the repro https://github.com/osexpert/GrpcDotNetProblem
Now upgraded to latest nuget packages.
Using Python 3.9 x64, grpcio 1.62.1, protobuf 5.26.1

Still fails:
Run GrpcService, then GrpcNativeClient: still fails after 200+ calls with:
Grpc.Core.RpcException: 'Status(StatusCode="ResourceExhausted", Detail="Received RST_STREAM with error code 11", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1712864484.457000000","description":"Error received from peer ipv6:[::1]:5000","file":"......\src\core\lib\surface\call.cc","file_line":953,"grpc_message":"Received RST_STREAM with error code 11","grpc_status":8}")'
[Exception] Grpc.Core.dll!Grpc.Core.Internal.ClientResponseStream<TRequest, TResponse>.MoveNext(System.Threading.CancellationToken token) C#
[Exception] GrpcNativeClient.dll!GrpcNativeClient.Program.EchoBidirHangupTest(GrpcService.Greeter.GreeterClient client) Line 27 C#
[Exception] GrpcNativeClient.dll!GrpcNativeClient.Program.Main(string[] args) Line 14 C#

No longer failing:
Run GrpcService, then PythonClient. Seems to work now. I had assumed Python used the native client c\c++ dlls under the hood as the native wrapper, and always would fail (or not fail) the same way. But now Python works but not the native client. Weird.
Update: I am speculating, but I see that eg. enabling logging in GrpcService will make the GrpcNativeClient -> GrpcService test work. I assume logging slows down the GrpcService and give it time to clean up the leaking resources. I had a feeling the Python test was slower now than before, in case it may be a performance problem in the Python test case that now make it so slow that GrpcService get time to clean up the leaking resources and not fail.
Update2: Yes, I was correct. I was running via VS originally. Now when running directly via command line, it fails:
Sending First message1127
Traceback (most recent call last):
File "...\GrpcDotNetProblem\PythonClient\PythonClient.py", line 75, in
run()
File "...\GrpcDotNetProblem\PythonClient\PythonClient.py", line 67, in run
echo_test_hangup(stub, count)
File "...\GrpcDotNetProblem\PythonClient\PythonClient.py", line 51, in echo_test_hangup
for response in responses:
File "...\AppData\Roaming\Python\Python39\site-packages\grpc_channel.py", line 542, in next
return self._next()
File "...\AppData\Roaming\Python\Python39\site-packages\grpc_channel.py", line 968, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.RESOURCE_EXHAUSTED
details = "Received RST_STREAM with error code 11"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Received RST_STREAM with error code 11", grpc_status:8, created_time:"2024-04-12T12:35:35.911228+00:00"}"

All other combinations works fine. So I guess it is status Q.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants