-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standalone REDIRECT, Pub/Sub and failover #1780
Comments
Thanks for summarizing tricky aspects about Pub/Sub. I also agree mostly about option 3. Weird enough people are managing until version 8 of the product without complaining about the disjoint behavior! |
I am guessing here, but this has probably been addressed in relevant standalone scenarios. I am most familiar with Sentinel deployments. The Sentinel protocol ensures that a client will be disconnected on a role change and on reconnection, the protocol will ensure that the client re-connects to a primary (if it wants to be connected to a primary, which it will when using pub/sub). The FAILOVER command is a rather recent addition and does not work with Sentinel (shameless plug: I would like to change that, see #1292). As the goal of the new REDIRECT mode is to ensure that the client will notice itself when it is time to connect to another node (without relying on an external entity like Sentinel), we run into this problem right now I think. |
@soloestoy PTAL 👀 |
The consideration of these aspects was also integrated into the initial design of As we have previously mentioned the PUBSUB issue in Redis, we prefer to address them separately in discussions. This is because in Cluster mode, PUBSUB-related commands do not trigger I would also like to reference issue #307 by @hpatro . We previously attempted to transition sharded pubsub to a standalone mode, utilizing replication streams instead of cluster-wide broadcasts, and to mark SPUBLISH as a write command. This seems quite similar to the third point mentioned above, and its implementation is also straightforward, making it an good option. BTW, this has been a long-standing issue, not a new problem introduced by REDIRECT. I believe the title should be revised to "How to Ensure PUBSUB Accessibility After Switchover" for better precision. |
@soloestoy It wasn't my intention to create the impression that REDIRECT somehow introduced this problem (notice that I did not file this as a bug). But it is easy to get the impression that REDIRECT in its current form is equivalent to MOVED in cluster and thus, addresses all use cases. (At least I had this impression until recently and I did not find the pub/sub problem described explicitly anywhere.) I hope you also subscribe to my statement that "the goal of the new REDIRECT mode is to ensure that the client will notice itself when it is time to connect to another node". Thus, from my point of view, REDIRECT is not the problem, it will hopefully be the crucial part of the solution to reach this goal. (As said above, my main interest is on Sentinel deployments. I would like to get to the point where clients do not need to be aware of Sentinel and just use redirects to connect to the current primary. And Sentinel would not need to kill those clients after failover anymore)
I could not agree more 😀
I don't know if I want to broaden it to that level. Switchover in the form of the FAILOVER command does not care at all neither for read/write nor for pub/sub. Sentinel ensures that accessibility, so one could even argue that this is a solved problem. The focus of this issue for me is that there should be no external entity necessary in this case. Shall I change it to "How to enhance REDIRECT to ensure PUBSUB Accessibility After Switchover"? |
There is a fundamental difference w.r.t. Pub/Sub between standalone and cluster
mode:
In cluster mode, pub/sub operations are role-agnostic: Messages propagate seamlessly across primaries/replicas.
In standalone mode this is not the case; publishes are replicated, though.
Furthermote, PUBLISH is not a writing command. This means that one can publish
on a read-only replica** and the message will only be seen on that replica.
(This is why sentinel issues a "CLIENT KILL TYPE PUBSUB" in addition to a
"CLIENT KILL TYPE NORMAL" in order to force a reconnect when changing the role
of a node.)
**: With one exception:
EVAL_RO
does not allow publishing (Interestingly in this context PUBLISH is treated as a writing command)Currently, REDIRECT & the FAILOVER command neither impact the publishing commands nor do they impact connections in subcribed
mode. The following scenarios are possible:
A client that is connected to the primary and e.g. only issuing PUBLISH
commands will be on a replica after a FAILOVER. Now, PUBLISH will only publish
locally to this replica, i.e. subscribers connecting to the new primary won't
receive these messages anymore.
A client that is connected to the primary and is in subscription mode won't
notice a role switch either. However, since published messages are
replicated, it will receive messages that were published on the primary (and
also those published on the replica).
Still, there is a user visible change: A client on another node will not be reported as
a client the message was sent to (in the reply of a PUBLISH command).
This means that in contrast to a cluster failover, there is a chance that a
standalone failover creates two disjoint pub/sub domains. And, currently,
a client in REDIRECT mode will not be notified about role changes
if the connection is used for pub/sub only. This is in contrast to the "smooth switchover"
idea of REDIRECT, IMHO.
Solution options/proposals: (for simplicity these proposals don't make a distinction
between "regular", "pattern", and "sharded" variants. The proposals apply to all variants.)
Make pub/sub fully agnostic to role (like in cluster mode).
While ideal, replicating cluster-mode Pub/Sub in standalone is impractical as it would require significant
architectural changes.
Don't change the current behavior and document this limitation of REDIRECT mode
This leaves users vulnerable to subtle message-loss edge cases.
Issue REDIRECTs and kill client connections if necessary:
Modify REDIRECT logic to treat PUBLISH as a "write-like" command, redirecting
it to the primary regardless of the client’s mode (even in READONLY).
This avoids inconsistencies where publishing on a replica could isolate messages.
Rationale for redirecting always instead of READWRITE only: READONLY means that the client
is willing to read replicated data. It does not mean that the client is willing to accept domain splits.
Note that there is a case in which we won't be able to redirect: Publishing in a script without keys (e.g.
eval 'return redis.call("publish", "foo", "bar")' 0
). This is because we can't issue a MOVED/REDIRECT from within a partly executed script. We need to return a different error instead.(However, this looks like a uncommon use case for a script)
SUBSCRIBE in REDIRECT mode is redirected to the primary in READWRITE mode. (rationale: READONLY means that the client is willing to read replicated data. We can assume that replicated publishes are fine as well.)
A connection in subscribe mode is killed (or unsubscribed?) on a role change if the client
is in REDIRECT READWRITE mode.
Note: This is not the first connection type that is killed on a role change. Connections waiting on a blocked command are already killed today.
I prefer option 3 (with complementary documentation). Although the change is substantial, it is the simplest change that preserves consistency I can think of.
The text was updated successfully, but these errors were encountered: