Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled exception (Operation is not valid due to the current state of the object in RabbitMQ.Stream.Client/Client.cs:line 976) #384

Closed
mbaillargeon-ubi opened this issue Jun 7, 2024 · 6 comments · Fixed by #385
Assignees
Labels
bug Something isn't working

Comments

@mbaillargeon-ubi
Copy link

Describe the bug

Hi,
We are experiencing crashes on our servers (windows / net6) which internally use RabbitMQ.Stream.Client. It occurs when multiple clients are connecting at the same time. Here is the callstack we get in the windows event logs

CoreCLR Version: 6.0.3124.26714
.NET Version: 6.0.31
Description: The process was terminated due to an unhandled exception.
Exception Info: System.AggregateException: One or more errors occurred. (Operation is not valid due to the current state of the object.)
 ---> System.InvalidOperationException: Operation is not valid due to the current state of the object.
   at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore`1.SignalCompletion()
   at System.Threading.Tasks.Sources.ManualResetValueTaskSourceCore`1.SetException(Exception error)
   at RabbitMQ.Stream.Client.ManualResetValueTaskSource`1.SetException(Exception error) in /_/RabbitMQ.Stream.Client/Client.cs:line 976
   at RabbitMQ.Stream.Client.Client.<>c__56`2.<Request>b__56_0(Object valueTaskSource) in /_/RabbitMQ.Stream.Client/Client.cs:line 500
   at System.Threading.CancellationTokenSource.Invoke(Delegate d, Object state, CancellationTokenSource source)
   at System.Threading.CancellationTokenSource.CallbackNode.<>c.<ExecuteCallback>b__9_0(Object s)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.CancellationTokenSource.CallbackNode.ExecuteCallback()
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   --- End of inner exception stack trace ---
   at System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean throwOnFirstException)
   at System.Threading.CancellationTokenSource.TimerCallback(Object state)
   at System.Threading.TimerQueueTimer.Fire(Boolean isThreadPool)
   at System.Threading.TimerQueue.FireNextTimers()

This crash is with version 1.8.6 but we initially had it with 1.7.4. I tried updating the package, but no luck it still crashes.

From what I could debug, the following is occurring. In the following code (from client.cs), the callback that gets called on timeout to do the SetException is called by a system timer (as we can see in the callstack of the crash). When it gets called, the task seems to sometime be in the completed state (I guess the task just completed). The SetException call is internally calling SetException on a ManualResetValueTaskSourceCore which throws an InvalidOperationException if the task is completed. Since the exception is not catched it makes our process crash.

await using (cts.Token.Register(
                             valueTaskSource =>
                             {
                                ((ManualResetValueTaskSource<TOut>)valueTaskSource).SetException(
                                    new TimeoutException());
                             }, tcs).ConfigureAwait(false))
            {
                var valueTask = new ValueTask<TOut>(tcs, tcs.Version);
                var result = await valueTask.ConfigureAwait(false);
                PooledTaskSource<TOut>.Return(tcs);
                return result;
            }

I tried the following code in ManualResetValueTaskSource to validate the status before doing the call to SetException and I no longer reproduce the crash. If you want I could do a pull request with that code.

public void SetException(Exception error)
{
   if (_logic.GetStatus(_logic.Version) == ValueTaskSourceStatus.Pending)
     _logic.SetException(error);
 }

Reproduction steps

  1. Have multiple clients being created at the same time and seeking in the stream
    ...

Expected behavior

The exception should either not occur or be handled since it occurs in a different thread and the users of the library cannot catch

Additional context

To give some context we have 2 servers load balanced. Whenever a client application connects to our server we create a client to stream rabbitmq messages exchanged on a stream. Whenever we shutdown one of the servers (to perform an update for instance), all the clients switch to the other server which will create a client and stream back to the last received message from the stream. This crash occurs when we shutdown one of our server that has lots of clients (about 100). So we get about 100 clients connecting to the other server at the same time and connecting to the stream.

@mbaillargeon-ubi mbaillargeon-ubi added the bug Something isn't working label Jun 7, 2024
@Gsantomaggio
Copy link
Member

Hi @mbaillargeon-ubi,
The clients receive the timeout because the server is handling a lot of connections/streams at the same time.
You should also check the server logs to see if there is some timeout or error.

In this case, you should add some random sleep before connecting all the clients. It will prevent all the clients from connecting at the same time.

@Gsantomaggio
Copy link
Member

Btw pr with you change is welcome 😀!

Gsantomaggio added a commit that referenced this issue Jun 10, 2024
configuration. Check the logic status before raise set the event
Fixes: #384

Signed-off-by: Gabriele Santomaggio <g.santomaggio@gmail.com>
@Gsantomaggio
Copy link
Member

@mbaillargeon-ubi Can you please check #385 ?

@mbaillargeon-ubi
Copy link
Author

Hi @Gsantomaggio . Thanks for #385, I was going to create a pull request today :) . I like that you also exposed a way to configure the timeout. It looks good to me. I will be waiting for an official release with this fix to try it out. Until then, I will try your suggestion to limit the number of clients connecting back to the stream at the same time to see if it reduces the occurences of timeouts or of that actual crash.

Gsantomaggio added a commit that referenced this issue Jun 10, 2024
* Fixes: #384
  Check the logic status before raising the event
* Add RPC timeout configuration. 

Signed-off-by: Gabriele Santomaggio <g.santomaggio@gmail.com>
@Gsantomaggio
Copy link
Member

@Gsantomaggio Gsantomaggio self-assigned this Jun 10, 2024
@mbaillargeon-ubi
Copy link
Author

I tested with the package 1.8.7 and I could not repro the crash. I will go ahead and update our production servers with this version. Thanks for the quick release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants