Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Router is blocked for several minutes after message bursts #159

Closed
martin31821 opened this issue Nov 14, 2018 · 17 comments
Closed

Router is blocked for several minutes after message bursts #159

martin31821 opened this issue Nov 14, 2018 · 17 comments

Comments

@martin31821
Copy link
Contributor

I'm running into problems with the sendTimeout when clients are blocked, I try to describe my problem:

  • I have some clients (like 300), some of them have slow network connections and are subscribed to 'hot' topics (frequently published to)
  • The router is correctly blocked and drops some messages according to https://github.com/gammazero/nexus/blob/master/transport/websocketpeer.go#L165
  • Eventually, some clients get dropped (because the pingpong mechanism fails)
  • The messages from the dropped clients are processed after the client is gone, causing the router to block for several minutes.
  • This in turn blocks the router from adding and processing new sessions.

The problem I identified is that the 'session died' handler is processed synchronously with the pending requests for all sessions, but I'm not sure what's the best way to improve the situation.

I also have a repro here

@martin31821
Copy link
Contributor Author

martin31821 commented Nov 14, 2018

After some hours of debugging this, I now have another suspiction. I saw that registrations are still present within the router state even if the connections are not alive anymore, which I think happens iff I run into peer is not responging to pings, closing websocket which is not passed through correctly to the upper layers.

Edit: First router instance unblocked itself after around 50hours of hanging in the dealer action loop

@gammazero
Copy link
Owner

Do you know if this is a new behavior introduced by the recent update of gorilla websockets (#157)?

I will start working on this now.

@martin31821
Copy link
Contributor Author

This is unrelated to the vendor updates, we experienced the problem in a production setup running an older version (#155)

One problem I see is that the router is blocked when waiting for clients to become available, which is a great problem, since the incoming messages will queue up indefinitely (push to actionChan in broker and dealer) and while queueing up, lose their order, which leads eventually to registrations or subscriptions which have been established for an already closed connection.

I tried to debug it here, adding more logging and tried to fix it, but that didn't work either.

@gammazero
Copy link
Owner

I will have this fixed ASAP.

@gammazero
Copy link
Owner

The main cause of the problem is when the output channel for a slow client is full. This causes a temporary block waiting to write the channel, which blocks in the critical section of the broker. When there are a number of such clients, this becomes particularly bad. The original intent was to give client output goroutine time to read the outgoing messages from the channel and write them to the socket. This fails badly when the socket is blocked and the message on the output channel cannot be removed.

What needs to change is for the router to NEVER block in a critical section of the broker or dealer for any reason. This means that if a message cannot be written to a client output channel, that message should be dropped immediately. Alternatively, the message could be written to a queueing mechanism that never blocks and keeps messages in a queue that can keep growing, and will feed messages to the client output queue.

There are pros and cons to each approach:
Limited size queue

  • Pro: limits the number of messages queued for any one client
  • Con: No correct queue size. May need large queue for large message bursts (progressive results), which is otherwise wasteful.
  • Con: Dropping messages can cause problems with missing important messages such as results delivered progressively. Clients may not know that some messages were dropped.

Infinite queue size

  • Pro: Handles any size message burst without dropping messages.
  • Con: One or more slow clients may exhaust memory if sent messages faster than client is able to read them.

Originally, dropping messages seemed like a reasonable approach, but that was before things like progressive results were implemented. Queuing messages seems like the more correct approach, but needs a safety mechanism to guard against slow clients building up excessively large queues of pending messages. Maybe dropping the client if the average queue size exceeds some limit for some amount of time.

I have the queuing implemented and tested, and am trying various safety strategies.

@martin31821
Copy link
Contributor Author

I also thought about having a infinite sized queue, another approach which might take memory peressure would be to actively send errors to pending requests when there are messages in the queue, which is imho compliant to the wamp spec (something like The router can't process your request at this time).

However, this would require to split the messages into a dropable (i.e. PUBLISH) and needs to be processed (i.e. YIELD) categories.

@brick0
Copy link

brick0 commented Jan 21, 2019

Hitting this issue in a project and would like to try the fix if it's available in any branch.
Interested in a solution that may drop messages to slow clients, but never blocks the router

@gammazero
Copy link
Owner

@brick0, @martin31821 Some discussion of solutions...

The key to fixing this is to avoid ever making he router (broker or dealer) block, as that stops everything. Even small periods of blocking can cumulatively result it major bottleneck. The router used to just drop messages that would block, but a small wait was added to allow a client to catch up after a message burst. This is what is causing the problem, when clients repeatedly block.

There are a few ways that I see to approach this:

  1. Just drop messages if a receiving client blocks (send queue full). This keeps things working, but can lose messages if a burst of messages is received. This is particularly bad with RCP: and basically completely broken for progressive requests/results. Also, this may not be acceptable at all if a critical message gets dropped and neither the sending or receiving client have any way to know.

  2. Router uses large or growable send queues for each client. Disconnects client, not drops individual messages, after queue reaches some limit.
    This can still break in the case of progressive RPC results. Imagine streaming a large body of data to a client that requested it, where the callee produces content (progressive results) much faster than the caller can consume it. Router eats up all memory queuing results, or disconnects caller that cannot receive any more messages.

  3. Back-pressure on sending client. In this approach the router calculates the routing (generates message + recipient(s)) and then lets the sending client's goroutine actually do all the sending. This works in the RPC case, but for pub/sub this allows a single slow client, or one busy receiving progressive RPC results, to impact delivery of events to all other clients.

  4. Hybrid approach (what I am implementing/testing)

  • Flow control for progressive RPC, so that sender is blocked until recipient can receive more messages, without queuing progressive messages in router.
  • Configurable router send queue size limit.
  • Configurable behavior when client's send queue is at limit:
    a) Drop messages b) Disconnect client c) Block sender

I think the configurable behavior is necessary, since this depends on the use of WAMP. Consider a publisher that publishes events at high frequency. If not all clients can keep up and consume events fast enough, in some uses it may be best to discard events. In other uses it may be best to disconnect the slow client after the limit of queued messages is reached. In other cases the correct thing may be to block the publisher until the subscribers have consumed more events.

Any additional thoughts before I PR this?

@martin31821
Copy link
Contributor Author

Maybe it's best to discuss this with the other WAMP implementors at the gitter? I think having back-pressure requires a bit of work on the spec, too, so their opinion on this topic is interesting as well.

I'd vote against dropping any messages as this may cause a client implementation to block for an undefined amount of time (possibly over the lifetime of a connection), so flow control seems reasonable for me, but I'm unsure on how to implement this into a client in a nice way.

For PubSub a solution might be to specify a publish option 'dropable': true to indicate that the router may drop this publication for specific clients. In our use case we have a live video stream running over WAMP to achieve the lowest possible bandwidth, where dropping messages would result in minor degraded image quality but not total state corruption.

A configurable send queue size limit sounds good for me, although I would keep it as small as possible in production deployments.

@gammazero
Copy link
Owner

gammazero commented Jan 30, 2019

@martin31821 To clarify, all of the proposed implementation is in the router. The RPC flow control is a special case of back-pressure that happens regardless of configured queue behavior, and would prevent queuing large amounts of progressive results even if queue size was large. The go routine handling incoming messages for the sending session blocks if the receiving session still has pending progressive RPC messages. In your case it would be up to your client to drop messages as it saw fit.

If no special case for RPC, then this can be simplified to :

  • Configurable session outbound queue size in router [small default size]
  • Block sending session (back-pressure) when recipient queue full
  • When sender session is blocked, do one of the following:
    a) Continue blocking indefinitely [default behavior]
    b) Drop message to blocking recipient session(s) after timeout

The key point here is that if there is any blocking, it is only the sending session blocking on the receiving session. The router and all other sessions continue to operate.

Maybe session meta information could flag sessions that are blocked and sessions that are causing blocking.

and yes, a gitter discussion would probably yield some insight

@martin31821
Copy link
Contributor Author

Ah. I see the implementation way for RPC and progressive result/calls now. That's good IMHO.

@brick0
Copy link

brick0 commented Jan 30, 2019

@gammazero In a pub/sub setup where there are some transient slow receivers; they should not impact the whole system by sending backpressure to the producer(sending session).
I think having a configurable queue size per receiving session (in our case we'd probably configure a large queue) and dropping messages to that queue just for that receiver work best in pub/sub.
Each behavior is useful in different scenarios, not sure if these different behaviors could be configurable per realm; so the router would behave differently.
In our case we track loss and recover slow receivers separately; but the sending session should not have to block.

@gammazero
Copy link
Owner

@brick0 Very useful feedback. It makes sense for the configurable behavior to be in the realm config. Also, indicates that behavior may need to be different for broker and dealer.

gammazero added a commit that referenced this issue Mar 14, 2019
This fixes issue #159.  When a client leaves the realm or is disconnected by the router, do not attempt to send more messages to the client.

Web and raw socker server's outbound per-client channel size is configurable.

Progressive results use backpressure to limit influx of results when results are being produced faster than they are being consumed.
@gammazero
Copy link
Owner

@martin31821 Now that PR is finally merged, if you have any time to check that it fixes the reported problems, I would be very thankful.

@martin31821
Copy link
Contributor Author

I'm going to take a look at the weekend.

@martin31821
Copy link
Contributor Author

@gammazero I can confirm that it fixes the issue I originally mentioned!

@gammazero
Copy link
Owner

Fixed by #167

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants