-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Router is blocked for several minutes after message bursts #159
Comments
After some hours of debugging this, I now have another suspiction. I saw that registrations are still present within the router state even if the connections are not alive anymore, which I think happens iff I run into Edit: First router instance unblocked itself after around 50hours of hanging in the dealer action loop |
Do you know if this is a new behavior introduced by the recent update of gorilla websockets (#157)? I will start working on this now. |
This is unrelated to the vendor updates, we experienced the problem in a production setup running an older version (#155) One problem I see is that the router is blocked when waiting for clients to become available, which is a great problem, since the incoming messages will queue up indefinitely (push to actionChan in broker and dealer) and while queueing up, lose their order, which leads eventually to registrations or subscriptions which have been established for an already closed connection. I tried to debug it here, adding more logging and tried to fix it, but that didn't work either. |
I will have this fixed ASAP. |
The main cause of the problem is when the output channel for a slow client is full. This causes a temporary block waiting to write the channel, which blocks in the critical section of the broker. When there are a number of such clients, this becomes particularly bad. The original intent was to give client output goroutine time to read the outgoing messages from the channel and write them to the socket. This fails badly when the socket is blocked and the message on the output channel cannot be removed. What needs to change is for the router to NEVER block in a critical section of the broker or dealer for any reason. This means that if a message cannot be written to a client output channel, that message should be dropped immediately. Alternatively, the message could be written to a queueing mechanism that never blocks and keeps messages in a queue that can keep growing, and will feed messages to the client output queue. There are pros and cons to each approach:
Infinite queue size
Originally, dropping messages seemed like a reasonable approach, but that was before things like progressive results were implemented. Queuing messages seems like the more correct approach, but needs a safety mechanism to guard against slow clients building up excessively large queues of pending messages. Maybe dropping the client if the average queue size exceeds some limit for some amount of time. I have the queuing implemented and tested, and am trying various safety strategies. |
I also thought about having a infinite sized queue, another approach which might take memory peressure would be to actively send errors to pending requests when there are messages in the queue, which is imho compliant to the wamp spec (something like However, this would require to split the messages into a |
Hitting this issue in a project and would like to try the fix if it's available in any branch. |
@brick0, @martin31821 Some discussion of solutions... The key to fixing this is to avoid ever making he router (broker or dealer) block, as that stops everything. Even small periods of blocking can cumulatively result it major bottleneck. The router used to just drop messages that would block, but a small wait was added to allow a client to catch up after a message burst. This is what is causing the problem, when clients repeatedly block. There are a few ways that I see to approach this:
I think the configurable behavior is necessary, since this depends on the use of WAMP. Consider a publisher that publishes events at high frequency. If not all clients can keep up and consume events fast enough, in some uses it may be best to discard events. In other uses it may be best to disconnect the slow client after the limit of queued messages is reached. In other cases the correct thing may be to block the publisher until the subscribers have consumed more events. Any additional thoughts before I PR this? |
Maybe it's best to discuss this with the other WAMP implementors at the gitter? I think having back-pressure requires a bit of work on the spec, too, so their opinion on this topic is interesting as well. I'd vote against dropping any messages as this may cause a client implementation to block for an undefined amount of time (possibly over the lifetime of a connection), so flow control seems reasonable for me, but I'm unsure on how to implement this into a client in a nice way. For PubSub a solution might be to specify a publish option A configurable send queue size limit sounds good for me, although I would keep it as small as possible in production deployments. |
@martin31821 To clarify, all of the proposed implementation is in the router. The RPC flow control is a special case of back-pressure that happens regardless of configured queue behavior, and would prevent queuing large amounts of progressive results even if queue size was large. The go routine handling incoming messages for the sending session blocks if the receiving session still has pending progressive RPC messages. In your case it would be up to your client to drop messages as it saw fit. If no special case for RPC, then this can be simplified to :
The key point here is that if there is any blocking, it is only the sending session blocking on the receiving session. The router and all other sessions continue to operate. Maybe session meta information could flag sessions that are blocked and sessions that are causing blocking. and yes, a gitter discussion would probably yield some insight |
Ah. I see the implementation way for RPC and progressive result/calls now. That's good IMHO. |
@gammazero In a pub/sub setup where there are some transient slow receivers; they should not impact the whole system by sending backpressure to the producer(sending session). |
@brick0 Very useful feedback. It makes sense for the configurable behavior to be in the realm config. Also, indicates that behavior may need to be different for broker and dealer. |
This fixes issue #159. When a client leaves the realm or is disconnected by the router, do not attempt to send more messages to the client. Web and raw socker server's outbound per-client channel size is configurable. Progressive results use backpressure to limit influx of results when results are being produced faster than they are being consumed.
@martin31821 Now that PR is finally merged, if you have any time to check that it fixes the reported problems, I would be very thankful. |
I'm going to take a look at the weekend. |
@gammazero I can confirm that it fixes the issue I originally mentioned! |
Fixed by #167 |
I'm running into problems with the sendTimeout when clients are blocked, I try to describe my problem:
The problem I identified is that the 'session died' handler is processed synchronously with the pending requests for all sessions, but I'm not sure what's the best way to improve the situation.
I also have a repro here
The text was updated successfully, but these errors were encountered: