Standalone REDIRECT, Pub/Sub and failover #1780

gmbnomis · 2025-02-25T21:10:52Z

There is a fundamental difference w.r.t. Pub/Sub between standalone and cluster
mode:

In cluster mode, pub/sub operations are role-agnostic: Messages propagate seamlessly across primaries/replicas.
In standalone mode this is not the case; publishes are replicated, though.
Furthermote, PUBLISH is not a writing command. This means that one can publish
on a read-only replica** and the message will only be seen on that replica.

(This is why sentinel issues a "CLIENT KILL TYPE PUBSUB" in addition to a
"CLIENT KILL TYPE NORMAL" in order to force a reconnect when changing the role
of a node.)

**: With one exception: EVAL_RO does not allow publishing (Interestingly in this context PUBLISH is treated as a writing command)

Currently, REDIRECT & the FAILOVER command neither impact the publishing commands nor do they impact connections in subcribed
mode. The following scenarios are possible:

A client that is connected to the primary and e.g. only issuing PUBLISH
commands will be on a replica after a FAILOVER. Now, PUBLISH will only publish
locally to this replica, i.e. subscribers connecting to the new primary won't
receive these messages anymore.
A client that is connected to the primary and is in subscription mode won't
notice a role switch either. However, since published messages are
replicated, it will receive messages that were published on the primary (and
also those published on the replica).

Still, there is a user visible change: A client on another node will not be reported as
a client the message was sent to (in the reply of a PUBLISH command).

This means that in contrast to a cluster failover, there is a chance that a
standalone failover creates two disjoint pub/sub domains. And, currently,
a client in REDIRECT mode will not be notified about role changes
if the connection is used for pub/sub only. This is in contrast to the "smooth switchover"
idea of REDIRECT, IMHO.

Solution options/proposals: (for simplicity these proposals don't make a distinction
between "regular", "pattern", and "sharded" variants. The proposals apply to all variants.)

Make pub/sub fully agnostic to role (like in cluster mode).

While ideal, replicating cluster-mode Pub/Sub in standalone is impractical as it would require significant
architectural changes.
Don't change the current behavior and document this limitation of REDIRECT mode

This leaves users vulnerable to subtle message-loss edge cases.
Issue REDIRECTs and kill client connections if necessary:
- Modify REDIRECT logic to treat PUBLISH as a "write-like" command, redirecting
  it to the primary regardless of the client’s mode (even in READONLY).
  This avoids inconsistencies where publishing on a replica could isolate messages.
  
  Rationale for redirecting always instead of READWRITE only: READONLY means that the client
  is willing to read replicated data. It does not mean that the client is willing to accept domain splits.
  
  Note that there is a case in which we won't be able to redirect: Publishing in a script without keys (e.g. eval 'return redis.call("publish", "foo", "bar")' 0). This is because we can't issue a MOVED/REDIRECT from within a partly executed script. We need to return a different error instead.
  (However, this looks like a uncommon use case for a script)
- SUBSCRIBE in REDIRECT mode is redirected to the primary in READWRITE mode. (rationale: READONLY means that the client is willing to read replicated data. We can assume that replicated publishes are fine as well.)
- A connection in subscribe mode is killed (or unsubscribed?) on a role change if the client
  is in REDIRECT READWRITE mode.
  
  Note: This is not the first connection type that is killed on a role change. Connections waiting on a blocked command are already killed today.

I prefer option 3 (with complementary documentation). Although the change is substantial, it is the simplest change that preserves consistency I can think of.

The text was updated successfully, but these errors were encountered:

hpatro · 2025-02-25T23:38:53Z

Thanks for summarizing tricky aspects about Pub/Sub. I also agree mostly about option 3. Weird enough people are managing until version 8 of the product without complaining about the disjoint behavior!

gmbnomis · 2025-02-26T00:00:10Z

... Weird enough people are managing until version 8 of the product without complaining about the disjoint behavior!

I am guessing here, but this has probably been addressed in relevant standalone scenarios. I am most familiar with Sentinel deployments. The Sentinel protocol ensures that a client will be disconnected on a role change and on reconnection, the protocol will ensure that the client re-connects to a primary (if it wants to be connected to a primary, which it will when using pub/sub). The FAILOVER command is a rather recent addition and does not work with Sentinel (shameless plug: I would like to change that, see #1292).

As the goal of the new REDIRECT mode is to ensure that the client will notice itself when it is time to connect to another node (without relying on an external entity like Sentinel), we run into this problem right now I think.

zuiderkwast · 2025-02-26T10:47:30Z

@soloestoy PTAL 👀

soloestoy · 2025-02-26T11:59:00Z

The consideration of these aspects was also integrated into the initial design of REDIRECT. Specifically in #325, I emphasized that "the data access commands (read and write) will be redirected", while other non-data access commands have yet to be addressed. This includes PUBSUB as well as additional commands such as CONFIG, PING, INFO, and others.

As we have previously mentioned the PUBSUB issue in Redis, we prefer to address them separately in discussions. This is because in Cluster mode, PUBSUB-related commands do not trigger -MOVED responses. Seems it's the time to discuss.

I would also like to reference issue #307 by @hpatro . We previously attempted to transition sharded pubsub to a standalone mode, utilizing replication streams instead of cluster-wide broadcasts, and to mark SPUBLISH as a write command. This seems quite similar to the third point mentioned above, and its implementation is also straightforward, making it an good option.

BTW, this has been a long-standing issue, not a new problem introduced by REDIRECT. I believe the title should be revised to "How to Ensure PUBSUB Accessibility After Switchover" for better precision.

gmbnomis · 2025-02-26T15:58:41Z

@soloestoy It wasn't my intention to create the impression that REDIRECT somehow introduced this problem (notice that I did not file this as a bug).

But it is easy to get the impression that REDIRECT in its current form is equivalent to MOVED in cluster and thus, addresses all use cases. (At least I had this impression until recently and I did not find the pub/sub problem described explicitly anywhere.)

I hope you also subscribe to my statement that "the goal of the new REDIRECT mode is to ensure that the client will notice itself when it is time to connect to another node". Thus, from my point of view, REDIRECT is not the problem, it will hopefully be the crucial part of the solution to reach this goal.

(As said above, my main interest is on Sentinel deployments. I would like to get to the point where clients do not need to be aware of Sentinel and just use redirects to connect to the current primary. And Sentinel would not need to kill those clients after failover anymore)

...Seems it's the time to discuss.

I could not agree more 😀

BTW, this has been a long-standing issue, not a new problem introduced by REDIRECT. I believe the title should be revised to "How to Ensure PUBSUB Accessibility After Switchover" for better precision.

I don't know if I want to broaden it to that level. Switchover in the form of the FAILOVER command does not care at all neither for read/write nor for pub/sub. Sentinel ensures that accessibility, so one could even argue that this is a solved problem.

The focus of this issue for me is that there should be no external entity necessary in this case.

Shall I change it to "How to enhance REDIRECT to ensure PUBSUB Accessibility After Switchover"?

gmbnomis mentioned this issue Feb 25, 2025

[BUG] "CLIENT CAPA redirect" has no effect in scripts #868

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standalone REDIRECT, Pub/Sub and failover #1780

Standalone REDIRECT, Pub/Sub and failover #1780

gmbnomis commented Feb 25, 2025

hpatro commented Feb 25, 2025

gmbnomis commented Feb 26, 2025 •

edited

Loading

zuiderkwast commented Feb 26, 2025

soloestoy commented Feb 26, 2025 •

edited

Loading

gmbnomis commented Feb 26, 2025

Standalone REDIRECT, Pub/Sub and failover #1780

Standalone REDIRECT, Pub/Sub and failover #1780

Comments

gmbnomis commented Feb 25, 2025

hpatro commented Feb 25, 2025

gmbnomis commented Feb 26, 2025 • edited Loading

zuiderkwast commented Feb 26, 2025

soloestoy commented Feb 26, 2025 • edited Loading

gmbnomis commented Feb 26, 2025

gmbnomis commented Feb 26, 2025 •

edited

Loading

soloestoy commented Feb 26, 2025 •

edited

Loading