Router is blocked for several minutes after message bursts #159

martin31821 · 2018-11-14T10:31:55Z

I'm running into problems with the sendTimeout when clients are blocked, I try to describe my problem:

I have some clients (like 300), some of them have slow network connections and are subscribed to 'hot' topics (frequently published to)
The router is correctly blocked and drops some messages according to https://github.com/gammazero/nexus/blob/master/transport/websocketpeer.go#L165
Eventually, some clients get dropped (because the pingpong mechanism fails)
The messages from the dropped clients are processed after the client is gone, causing the router to block for several minutes.
This in turn blocks the router from adding and processing new sessions.

The problem I identified is that the 'session died' handler is processed synchronously with the pending requests for all sessions, but I'm not sure what's the best way to improve the situation.

I also have a repro here

martin31821 · 2018-11-14T14:41:54Z

After some hours of debugging this, I now have another suspiction. I saw that registrations are still present within the router state even if the connections are not alive anymore, which I think happens iff I run into peer is not responging to pings, closing websocket which is not passed through correctly to the upper layers.

Edit: First router instance unblocked itself after around 50hours of hanging in the dealer action loop

gammazero · 2018-11-17T17:07:52Z

Do you know if this is a new behavior introduced by the recent update of gorilla websockets (#157)?

I will start working on this now.

martin31821 · 2018-11-17T17:15:40Z

This is unrelated to the vendor updates, we experienced the problem in a production setup running an older version (#155)

One problem I see is that the router is blocked when waiting for clients to become available, which is a great problem, since the incoming messages will queue up indefinitely (push to actionChan in broker and dealer) and while queueing up, lose their order, which leads eventually to registrations or subscriptions which have been established for an already closed connection.

I tried to debug it here, adding more logging and tried to fix it, but that didn't work either.

gammazero · 2018-11-18T00:58:35Z

I will have this fixed ASAP.

gammazero · 2018-11-22T06:38:26Z

The main cause of the problem is when the output channel for a slow client is full. This causes a temporary block waiting to write the channel, which blocks in the critical section of the broker. When there are a number of such clients, this becomes particularly bad. The original intent was to give client output goroutine time to read the outgoing messages from the channel and write them to the socket. This fails badly when the socket is blocked and the message on the output channel cannot be removed.

What needs to change is for the router to NEVER block in a critical section of the broker or dealer for any reason. This means that if a message cannot be written to a client output channel, that message should be dropped immediately. Alternatively, the message could be written to a queueing mechanism that never blocks and keeps messages in a queue that can keep growing, and will feed messages to the client output queue.

There are pros and cons to each approach:
Limited size queue

Pro: limits the number of messages queued for any one client
Con: No correct queue size. May need large queue for large message bursts (progressive results), which is otherwise wasteful.
Con: Dropping messages can cause problems with missing important messages such as results delivered progressively. Clients may not know that some messages were dropped.

Infinite queue size

Pro: Handles any size message burst without dropping messages.
Con: One or more slow clients may exhaust memory if sent messages faster than client is able to read them.

Originally, dropping messages seemed like a reasonable approach, but that was before things like progressive results were implemented. Queuing messages seems like the more correct approach, but needs a safety mechanism to guard against slow clients building up excessively large queues of pending messages. Maybe dropping the client if the average queue size exceeds some limit for some amount of time.

I have the queuing implemented and tested, and am trying various safety strategies.

martin31821 · 2018-11-22T09:05:52Z

I also thought about having a infinite sized queue, another approach which might take memory peressure would be to actively send errors to pending requests when there are messages in the queue, which is imho compliant to the wamp spec (something like The router can't process your request at this time).

However, this would require to split the messages into a dropable (i.e. PUBLISH) and needs to be processed (i.e. YIELD) categories.

brick0 · 2019-01-21T10:00:38Z

Hitting this issue in a project and would like to try the fix if it's available in any branch.
Interested in a solution that may drop messages to slow clients, but never blocks the router

gammazero · 2019-01-30T00:15:12Z

@brick0, @martin31821 Some discussion of solutions...

The key to fixing this is to avoid ever making he router (broker or dealer) block, as that stops everything. Even small periods of blocking can cumulatively result it major bottleneck. The router used to just drop messages that would block, but a small wait was added to allow a client to catch up after a message burst. This is what is causing the problem, when clients repeatedly block.

There are a few ways that I see to approach this:

Just drop messages if a receiving client blocks (send queue full). This keeps things working, but can lose messages if a burst of messages is received. This is particularly bad with RCP: and basically completely broken for progressive requests/results. Also, this may not be acceptable at all if a critical message gets dropped and neither the sending or receiving client have any way to know.
Router uses large or growable send queues for each client. Disconnects client, not drops individual messages, after queue reaches some limit.
This can still break in the case of progressive RPC results. Imagine streaming a large body of data to a client that requested it, where the callee produces content (progressive results) much faster than the caller can consume it. Router eats up all memory queuing results, or disconnects caller that cannot receive any more messages.
Back-pressure on sending client. In this approach the router calculates the routing (generates message + recipient(s)) and then lets the sending client's goroutine actually do all the sending. This works in the RPC case, but for pub/sub this allows a single slow client, or one busy receiving progressive RPC results, to impact delivery of events to all other clients.
Hybrid approach (what I am implementing/testing)

Flow control for progressive RPC, so that sender is blocked until recipient can receive more messages, without queuing progressive messages in router.
Configurable router send queue size limit.
Configurable behavior when client's send queue is at limit:
a) Drop messages b) Disconnect client c) Block sender

I think the configurable behavior is necessary, since this depends on the use of WAMP. Consider a publisher that publishes events at high frequency. If not all clients can keep up and consume events fast enough, in some uses it may be best to discard events. In other uses it may be best to disconnect the slow client after the limit of queued messages is reached. In other cases the correct thing may be to block the publisher until the subscribers have consumed more events.

Any additional thoughts before I PR this?

martin31821 · 2019-01-30T00:36:09Z

Maybe it's best to discuss this with the other WAMP implementors at the gitter? I think having back-pressure requires a bit of work on the spec, too, so their opinion on this topic is interesting as well.

I'd vote against dropping any messages as this may cause a client implementation to block for an undefined amount of time (possibly over the lifetime of a connection), so flow control seems reasonable for me, but I'm unsure on how to implement this into a client in a nice way.

For PubSub a solution might be to specify a publish option 'dropable': true to indicate that the router may drop this publication for specific clients. In our use case we have a live video stream running over WAMP to achieve the lowest possible bandwidth, where dropping messages would result in minor degraded image quality but not total state corruption.

A configurable send queue size limit sounds good for me, although I would keep it as small as possible in production deployments.

gammazero · 2019-01-30T02:39:51Z

@martin31821 To clarify, all of the proposed implementation is in the router. The RPC flow control is a special case of back-pressure that happens regardless of configured queue behavior, and would prevent queuing large amounts of progressive results even if queue size was large. The go routine handling incoming messages for the sending session blocks if the receiving session still has pending progressive RPC messages. In your case it would be up to your client to drop messages as it saw fit.

If no special case for RPC, then this can be simplified to :

Configurable session outbound queue size in router [small default size]
Block sending session (back-pressure) when recipient queue full
When sender session is blocked, do one of the following:
a) Continue blocking indefinitely [default behavior]
b) Drop message to blocking recipient session(s) after timeout

The key point here is that if there is any blocking, it is only the sending session blocking on the receiving session. The router and all other sessions continue to operate.

Maybe session meta information could flag sessions that are blocked and sessions that are causing blocking.

and yes, a gitter discussion would probably yield some insight

martin31821 · 2019-01-30T08:01:41Z

Ah. I see the implementation way for RPC and progressive result/calls now. That's good IMHO.

brick0 · 2019-01-30T11:20:53Z

@gammazero In a pub/sub setup where there are some transient slow receivers; they should not impact the whole system by sending backpressure to the producer(sending session).
I think having a configurable queue size per receiving session (in our case we'd probably configure a large queue) and dropping messages to that queue just for that receiver work best in pub/sub.
Each behavior is useful in different scenarios, not sure if these different behaviors could be configurable per realm; so the router would behave differently.
In our case we track loss and recover slow receivers separately; but the sending session should not have to block.

gammazero · 2019-01-30T19:12:23Z

@brick0 Very useful feedback. It makes sense for the configurable behavior to be in the realm config. Also, indicates that behavior may need to be different for broker and dealer.

This fixes issue #159. When a client leaves the realm or is disconnected by the router, do not attempt to send more messages to the client. Web and raw socker server's outbound per-client channel size is configurable. Progressive results use backpressure to limit influx of results when results are being produced faster than they are being consumed.

gammazero · 2019-04-25T13:34:34Z

@martin31821 Now that PR is finally merged, if you have any time to check that it fixes the reported problems, I would be very thankful.

martin31821 · 2019-04-25T17:59:02Z

I'm going to take a look at the weekend.

martin31821 · 2019-05-03T16:33:57Z

@gammazero I can confirm that it fixes the issue I originally mentioned!

gammazero · 2019-05-03T22:11:48Z

Fixed by #167

gammazero mentioned this issue Mar 14, 2019

Fix blocking with slow clients #167

Merged

gammazero closed this as completed May 3, 2019

neiledgar mentioned this issue Sep 30, 2019

Handing dropped Events #202

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Router is blocked for several minutes after message bursts #159

Router is blocked for several minutes after message bursts #159

martin31821 commented Nov 14, 2018

martin31821 commented Nov 14, 2018 •

edited

Loading

gammazero commented Nov 17, 2018

martin31821 commented Nov 17, 2018

gammazero commented Nov 18, 2018

gammazero commented Nov 22, 2018

martin31821 commented Nov 22, 2018

brick0 commented Jan 21, 2019

gammazero commented Jan 30, 2019

martin31821 commented Jan 30, 2019

gammazero commented Jan 30, 2019 •

edited

Loading

martin31821 commented Jan 30, 2019

brick0 commented Jan 30, 2019

gammazero commented Jan 30, 2019

gammazero commented Apr 25, 2019

martin31821 commented Apr 25, 2019

martin31821 commented May 3, 2019

gammazero commented May 3, 2019

Router is blocked for several minutes after message bursts #159

Router is blocked for several minutes after message bursts #159

Comments

martin31821 commented Nov 14, 2018

martin31821 commented Nov 14, 2018 • edited Loading

gammazero commented Nov 17, 2018

martin31821 commented Nov 17, 2018

gammazero commented Nov 18, 2018

gammazero commented Nov 22, 2018

martin31821 commented Nov 22, 2018

brick0 commented Jan 21, 2019

gammazero commented Jan 30, 2019

martin31821 commented Jan 30, 2019

gammazero commented Jan 30, 2019 • edited Loading

martin31821 commented Jan 30, 2019

brick0 commented Jan 30, 2019

gammazero commented Jan 30, 2019

gammazero commented Apr 25, 2019

martin31821 commented Apr 25, 2019

martin31821 commented May 3, 2019

gammazero commented May 3, 2019

martin31821 commented Nov 14, 2018 •

edited

Loading

gammazero commented Jan 30, 2019 •

edited

Loading