Unhandled exception (Operation is not valid due to the current state of the object in RabbitMQ.Stream.Client/Client.cs:line 976) #384

mbaillargeon-ubi · 2024-06-07T18:21:43Z

Describe the bug

Hi,
We are experiencing crashes on our servers (windows / net6) which internally use RabbitMQ.Stream.Client. It occurs when multiple clients are connecting at the same time. Here is the callstack we get in the windows event logs

CoreCLR Version: 6.0.3124.26714
.NET Version: 6.0.31
Description: The process was terminated due to an unhandled exception.
Exception Info: System.AggregateException: One or more errors occurred. (Operation is not valid due to the current state of the object.)
 ---> System.InvalidOperationException: Operation is not valid due to the current state of the object.
   at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore`1.SignalCompletion()
   at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore`1.SetException(Exception error)
   at RabbitMQ.Stream.Client.ManualResetValueTaskSource`1.SetException(Exception error) in /_/RabbitMQ.Stream.Client/Client.cs:line 976
   at RabbitMQ.Stream.Client.Client.<>c__56`2.<Request>b__56_0(Object valueTaskSource) in /_/RabbitMQ.Stream.Client/Client.cs:line 500
   at System.Threading.CancellationTokenSource.Invoke(Delegate d, Object state, CancellationTokenSource source)
   at System.Threading.CancellationTokenSource.CallbackNode.<>c.<ExecuteCallback>b__9_0(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.CancellationTokenSource.CallbackNode.ExecuteCallback()
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   --- End of inner exception stack trace ---
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   at System.Threading.CancellationTokenSource.TimerCallback(Object state)
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()

This crash is with version 1.8.6 but we initially had it with 1.7.4. I tried updating the package, but no luck it still crashes.

From what I could debug, the following is occurring. In the following code (from client.cs), the callback that gets called on timeout to do the SetException is called by a system timer (as we can see in the callstack of the crash). When it gets called, the task seems to sometime be in the completed state (I guess the task just completed). The SetException call is internally calling SetException on a ManualResetValueTaskSourceCore which throws an InvalidOperationException if the task is completed. Since the exception is not catched it makes our process crash.

await using (cts.Token.Register(
                             valueTaskSource =>
                             {
                                ((ManualResetValueTaskSource<TOut>)valueTaskSource).SetException(
                                    new TimeoutException());
                             }, tcs).ConfigureAwait(false))
            {
                var valueTask = new ValueTask<TOut>(tcs, tcs.Version);
                var result = await valueTask.ConfigureAwait(false);
                PooledTaskSource<TOut>.Return(tcs);
                return result;
            }

I tried the following code in ManualResetValueTaskSource to validate the status before doing the call to SetException and I no longer reproduce the crash. If you want I could do a pull request with that code.

public void SetException(Exception error)
{
   if (_logic.GetStatus(_logic.Version) == ValueTaskSourceStatus.Pending)
     _logic.SetException(error);
 }

Reproduction steps

Have multiple clients being created at the same time and seeking in the stream
...

Expected behavior

The exception should either not occur or be handled since it occurs in a different thread and the users of the library cannot catch

Additional context

To give some context we have 2 servers load balanced. Whenever a client application connects to our server we create a client to stream rabbitmq messages exchanged on a stream. Whenever we shutdown one of the servers (to perform an update for instance), all the clients switch to the other server which will create a client and stream back to the last received message from the stream. This crash occurs when we shutdown one of our server that has lots of clients (about 100). So we get about 100 clients connecting to the other server at the same time and connecting to the stream.

The text was updated successfully, but these errors were encountered:

Gsantomaggio · 2024-06-08T12:39:28Z

Hi @mbaillargeon-ubi,
The clients receive the timeout because the server is handling a lot of connections/streams at the same time.
You should also check the server logs to see if there is some timeout or error.

In this case, you should add some random sleep before connecting all the clients. It will prevent all the clients from connecting at the same time.

Gsantomaggio · 2024-06-09T07:45:01Z

Btw pr with you change is welcome 😀!

configuration. Check the logic status before raise set the event Fixes: #384 Signed-off-by: Gabriele Santomaggio <g.santomaggio@gmail.com>

Gsantomaggio · 2024-06-10T10:21:48Z

@mbaillargeon-ubi Can you please check #385 ?

mbaillargeon-ubi · 2024-06-10T11:15:12Z

Hi @Gsantomaggio . Thanks for #385, I was going to create a pull request today :) . I like that you also exposed a way to configure the timeout. It looks good to me. I will be waiting for an official release with this fix to try it out. Until then, I will try your suggestion to limit the number of clients connecting back to the stream at the same time to see if it reduces the occurences of timeouts or of that actual crash.

* Fixes: #384 Check the logic status before raising the event * Add RPC timeout configuration. Signed-off-by: Gabriele Santomaggio <g.santomaggio@gmail.com>

Gsantomaggio · 2024-06-10T12:44:37Z

@mbaillargeon-ubi FYI: https://www.nuget.org/packages/RabbitMQ.Stream.Client/1.8.7
Please let me know

mbaillargeon-ubi · 2024-06-10T18:29:13Z

I tested with the package 1.8.7 and I could not repro the crash. I will go ahead and update our production servers with this version. Thanks for the quick release

mbaillargeon-ubi added the bug Something isn't working label Jun 7, 2024

Gsantomaggio added a commit that referenced this issue Jun 10, 2024

Add RCP timeout

076c4bf

configuration. Check the logic status before raise set the event Fixes: #384 Signed-off-by: Gabriele Santomaggio <g.santomaggio@gmail.com>

Gsantomaggio mentioned this issue Jun 10, 2024

Add RCP timeout #385

Merged

Gsantomaggio closed this as completed in #385 Jun 10, 2024

Gsantomaggio added a commit that referenced this issue Jun 10, 2024

Check the logic status before raising the event (#385)

14b6e2d

* Fixes: #384 Check the logic status before raising the event * Add RPC timeout configuration. Signed-off-by: Gabriele Santomaggio <g.santomaggio@gmail.com>

Gsantomaggio self-assigned this Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unhandled exception (Operation is not valid due to the current state of the object in RabbitMQ.Stream.Client/Client.cs:line 976) #384

Unhandled exception (Operation is not valid due to the current state of the object in RabbitMQ.Stream.Client/Client.cs:line 976) #384

mbaillargeon-ubi commented Jun 7, 2024

Gsantomaggio commented Jun 8, 2024

Gsantomaggio commented Jun 9, 2024

Gsantomaggio commented Jun 10, 2024

mbaillargeon-ubi commented Jun 10, 2024

Gsantomaggio commented Jun 10, 2024

mbaillargeon-ubi commented Jun 10, 2024

Unhandled exception (Operation is not valid due to the current state of the object in RabbitMQ.Stream.Client/Client.cs:line 976) #384

Unhandled exception (Operation is not valid due to the current state of the object in RabbitMQ.Stream.Client/Client.cs:line 976) #384

Comments

mbaillargeon-ubi commented Jun 7, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

Gsantomaggio commented Jun 8, 2024

Gsantomaggio commented Jun 9, 2024

Gsantomaggio commented Jun 10, 2024

mbaillargeon-ubi commented Jun 10, 2024

Gsantomaggio commented Jun 10, 2024

mbaillargeon-ubi commented Jun 10, 2024