Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only store history when there are already subscribers #50

Closed
banks opened this issue Dec 21, 2015 · 9 comments
Closed

Only store history when there are already subscribers #50

banks opened this issue Dec 21, 2015 · 9 comments

Comments

@banks
Copy link
Member

banks commented Dec 21, 2015

This is an RFC for a new (optional) optimisation that is crucial for us.

I'm going to work on implementing this today on a fork and will hopefully have a PR, but wanted to document the rationale and design first.

Problem

Right now, if history/recovery is enabled, Centrifugo will store up to history_size messages for every channel, published to in the last history_lifetime seconds.

This makes designs with high degree of fan-out impractically expensive.

In my case I have about 5k messages a second being published (after fan-out) where some messages might be published into tens of thousand of channels - one for each "watcher". Even with no subscribers active and modest history settings, that means I need tens or hundreds of GB of Redis memory which is all doing nothing - never read by anyone.

In vast majority of cases, a user who is watched by tens of thousands of people will likely only have a very small fraction of them currently online and subscribed to centrifugo; most users with few watchers will have zero watchers online most of the time. I expect this optimisation to reduce redis memory requirements (and operations per second) by several orders of magnitude, with almost no change to app behaviour (see Edge Case section).

Solution Overview

I propose an optimisation that will ensure we simply don't store history when it is not needed by "active" subscribers. "Active" means: online now, or within history_lifetime seconds (and so will possibly reconnect, expecting recovery).

The logic below is almost transparent to the application - it will behave identically to current logic, but will result in only using redis memory for data that is actually delivered.

At a high level it's simple:

  • check if engine.publish() actually delivered to anyone
  • check if history already exists (i.e. something was listening in the last history_lifetime)
  • Add message to history key if and only if at least one of the above checks indicates active subscriber for that channel

Implementation

This should be relatively simple. If I were designing from scratch I'd be tempted to not have separate publish and addHistory methods in the engine interface as they are fairly closely linked. Indeed if you made that change it would be possible to implement this entire request in the publish method and potentially with a lua script inside redis entirely.

But in interest of minimising changes to core centrifugo code I propose we implement it with these minimal changes:

  • change Engine.publish signature to publish(chID ChannelID, message []byte) (bool, error) where the bool return value is true when we delivered to at least 1 active subscriber and false otherwise.
  • change addHistory's options struct to include a new option: OnlySaveIfActive bool. If this is set to true then addHistory should only add the new event if the history key already exists.

Engine Compatibility

  • The changes to publish are efficient to implement in memory engine as you have state about subscribers
  • Also efficient in redis engine since PUBLISH returns the number of active subscriptions that were delivered to (http://redis.io/commands/publish)
  • Even if there are future engines that don't know efficiently - e.g. use a third-party service that doesn't return that info, or queue publishes asynchronously - they can still just return true for every call, and you end up with exactly the same behaviour as now - suboptimal but totally correct.
  • The semantics of current client code is that we only recover on disconnection, so messages delivered before a client first connects are never read by the client anyway. Thus it's safe to simply not store them at all if there is no client connected now, and hasn't been for the last history_lifetime. Clients end up with identical deliverability guarantees - indeed they couldn't tell the difference with and without this.

Edge Case (resolved with #148 )

I've said this is //almost// completely transparent to the client which is true. There is one case in which it is not which I regard as an edge case, and acceptable given centrifugos "best effort" general design.

Given the proposed optimisation, consider the following:

  • Client connects and subscribes to foo
  • No messages are published before client loses connection (phone goes into tunnel for example)
  • While client is offline, message A is published into foo
  • Since there are no active subscribers (dropped connection pubsub state cleaned up) AND no existing history, this message is dropped not saved
  • Client reconnects within history_lifetime with recover option. But they don't get A because it was not saved.

Personally I don't consider this a big problem. However given that it is a change, we could consider making this optimisation configurable in case others find it unacceptable.

Self-subscriptions

The one other case that doesn't affect me but should be pointed out is that if you are using the Websocket API to allow publishing (i.e. publish: true in config) then these messages will always be saved due to the client API design - a client can't publish to a socket unless they have a subscription to it which means publish() will always see at least one active subscription.

This is no worse that current behaviour and is probably expected/desirable anyway so I don't think it's an argument against the optimisation.

@roytan883
Copy link

about the Edge Case:

Can you treat channel is active when at least one user online(SUB) in TTL of history_lifetime once.
consider the following:

  • Client connects and subscribes to foo (which history_lifetime is 1 hour)
  • No messages are published before client loses connection (phone goes into tunnel for example)
  • While client is offline, message A is published into foo
  • Although there are no active subscribers (dropped connection pubsub state cleaned up) AND no existing history. But foo has client subscribed in history_lifetime TTL(1 hour). So foo is active. Then the message saved in memory or redis for channel foo.
  • Client reconnects within history_lifetime with recover option. And client get A.

I think this is somehow solve the Edge Case. If one channel has no subscription in TTL time, then it is inactive, otherwise it is active. For those channel with no subscription in a long time, it always inactive, no memory cost.

And to implement this is also simple, just one field lastClientSubTime compare TTL can do the job.

@FZambia
Copy link
Member

FZambia commented Dec 11, 2016

@roytan883 looks like pretty reasonable and useful suggestion.

I doubt though that it's possible to implement in a way you've written - 2 reasons against:

  • lastClientSubTime is not a correct value to rely on in this case. As client connection can live days and subscription could be created a week ago for example. But we can look at unsubscribe time in this case - i.e sth like lastClientActivityTime. But...
  • ...but lastClientSubTime (or lastClientActivityTime from first point) is not known on every Centrifugo node. New message can come to any node and we could not have information about lastClientSubTime for this subscription on this node. I am not sure we have to keep this in Redis...

I have another idea - can't decide now is it pretty or ugly:) Or will it even work?:)

Anyway: when last user in channel disconnects from node we usually unsubscribe this node from channel in Redis. What if we won't unsubscribe node until history TTL time expires. This way we will have subscription on Redis channel and message will be saved into history. This behaviour must be optional. And it's not simple to implement as I can see.

@roytan883
Copy link

roytan883 commented Dec 12, 2016

@FZambia lastClientSubTime is not replacement of your current history_drop_inactive logic. It just an additional logic which should add to history_drop_inactive logic.

For this scenario:

As client connection can live days and subscription could be created a week ago for example.

lastClientSubTime will treat this channel as inactive. But your rule:

check if engine.publish() actually delivered to anyone

should work, mark this channel is active, because the channel has subscriber.

So the active conditions should be any one of the three:

  • check if any subscription or unsubscription happened in history_lifetime TTL (from now-TTL to now)
  • check if engine.publish() actually delivered to anyone
  • check if history already exists (i.e. something was listening in the last history_lifetime)

Also I think the first rule can optimize the history_drop_inactive performance in real world. Just save lastClientSubTime in memory or redis for each channel. For most active channels which have users online and offline frequently, most of their lastClientSubTime will be in TTL. So in most scenarios no need do the engine.publish() and history already exists inspection which may much slower than TTL check.

lastClientSubTime maybe named wrong, perhaps lastClientSubOrUnsubTime will be better. Both subscription and unsubscription in TTL should be treat as active event.

@FZambia
Copy link
Member

FZambia commented Dec 14, 2016

Sorry, I am not sure that I understood you right - especially part about any of 3 points (I only understand point 1). We already check if engine.publish() actually delivered to anyone for example

Also I think the first rule can optimize the history_drop_inactive performance in real world.

history_drop_inactive already have no performance overhead.

I understand what you want here and understand how we can solve this (my suggestion above and your suggestion from point 1). Maybe you could write more accurately what you mean in other parts of your previous comment.

@banks
Copy link
Member Author

banks commented Jan 31, 2017

I think @roytan883 means: if every channel has a Redis key containing a "last activity" timestamp then there is no need to use the current "try to publish and see if anyone gets it, try to save if history exists" technique.

But I'm not sure it's an optimisation - currently we can do it all in one round trip with a redis script anyway so cost is extremely low, the only real saving would be that if you did check first you could possibly avoid serialisation etc if message entirely but then the active case would require two round trips, one to check if publish needs and one to do it which is almost certainly much worse performing than now.

I think we should forget about this suggestion being an optimisation - it's either not or is likely insignificant - and just consider it on the basis of less surprising behaviour.

One alternative that occurs to me would be that we could publish "null" messages to a channel on client subscribe or unsubscribe. They could consume almost no memory and would get the desired behaviour without changing either the publish/history logic or adding new keys into redis.

@FZambia how easy would it be to introduce a null message that is just ignored by engine code that handles subscribed messages but is enough to keep a history channel alive for the activity TTL? Seems easy and less hacks than alternatives. Especially if we only publish them in the case that the channel IS inactive at the time (that might be harder though).

@FZambia
Copy link
Member

FZambia commented Jan 31, 2017

@banks it's an interesting idea - I have not thought about this from a side that we can just create history key with TTL or prolong existing on TTL.

Let me clarify - you mean "null" messages that will be published in Redis and not actually added into history and won't be published into channel right? The goal is just to touch history TTL.

Maybe we should do this as separate routine without mixing publish logic with such history touch. It's an implementation detail of course but could be semantically cleaner and not exposed to public API anyway. Also need to think about mass resubscription case after reconnect to Redis and how this will affect existing metrics.

@banks
Copy link
Member Author

banks commented Jan 31, 2017 via email

@FZambia
Copy link
Member

FZambia commented Jan 31, 2017

Got it, thanks! Will try to make proof of concept, maybe some details will arise in process.

@FZambia
Copy link
Member

FZambia commented Feb 5, 2017

@roytan883 @banks I ended up with pull request #148 - please have a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants