[BUG] `CLIENT CAPA redirect` may result in bouncing redirects during failover #821

gmbnomis · 2024-07-24T00:08:30Z

Describe the bug

A client using the new CLIENT CAPA redirect feature may see bouncing REDIRECT responses
during a failover initiated by the FAILOVER command. To be more precise, during the
FAILOVER_IN_PROGRESS failover state.

While the current primary is in FAILOVER_IN_PROGRESS state, server.primary_host is already
set to the designated primary. However, the designated primary still acts as a replica until the PSYNC FAILOVER command is processed.

The primary will already reply with a REDIRECT (to the designated primary) in this situation. But the latter is still a replica and will REDIRECT the client back to the old primary.

To reproduce

This is hard to reproduce since, under normal circumstances, the duration of the
FAILOVER_IN_PROGRESS state is very short: During this phase, a connection from the primary
to the designated primary is opened to send, among other commands, the PSYNC FAILOVER command.

Nevertheless, in case of e.g. a network glitch, this may take much longer (actually, there
seems to be no timeout for this phase(?))

I produced this behavior by blocking the client when receiving the PSYNC FAILOVER in syncCommand.

Update: The behavior can also be reproduced by letting the replica sleep, see gmbnomis@ca708a4 for a demonstration.

Expected behavior

During the first phase of a failover (FAILOVER_WAIT_FOR_SYNC), a writing client will be paused
while reads can still proceed. During the second phase (FAILOVER_IN_PROGRESS) the same behavior
is expected.

Only after finishing the failover procedure, i.e. when the clients are unpaused again, REDIRECT replies
should be sent (since the new primary is ready to accept writes now.)

Additional information

Note that it isn't sufficient to just ensure that we aren't in a "failover" when sending redirect at

valkey/src/server.c

Line 3903 in 59aa008

    
           if (!server.cluster_enabled && c->capa & CLIENT_CAPA_REDIRECT && server.primary_host && !mustObeyClient(c) &&

, e.g.:

    if (!server.cluster_enabled && c->capa & CLIENT_CAPA_REDIRECT && server.primary_host && server.failover_state == NO_FAILOVER &&
        !obey_client && (is_write_command || (is_read_command && !c->flag.readonly))) {
        addReplyErrorSds(c, sdscatprintf(sdsempty(), "-REDIRECT %s:%d", server.primary_host, server.primary_port));
        return C_OK;
    }

Instead of REDIRECT, the client will see READONLY errors during FAILOVER_IN_PROGRESS (caused by

valkey/src/server.c

Line 3987 in 59aa008

/* Don't accept write commands if this is a read only replica. But

), which as explained in #319, prevents a smooth switch.

We could add a corresponding server.failover_state == NO_FAILOVER clause to the respective if. However, it looks like MASTERDOWN could be triggered during FAILOVER_IN_PROGRESS as well if replica-serve-stale-data is no (I did not test this).

It may also be worth considering whether we should change the order of the if clauses (i.e. pause clients "earlier").

The text was updated successfully, but these errors were encountered:

enjoy-binbin · 2024-07-24T03:23:43Z

thanks for the report, @soloestoy please also take a look.

soloestoy · 2024-07-24T12:53:04Z

This is an interesting scenario, and the initial solution seems to be able to maintain the "client pause write" state until FAILOVER_IN_PROGRESS ends. Additionally, there may also be a similar issue when performing cluster failover in cluster mode, right?

madolson · 2024-07-24T15:43:12Z

Additionally, there may also be a similar issue when performing cluster failover in cluster mode, right?

The redirect for CME is based on whether the node owns the slot, not whether the node is a primary. So, there shouldn't be a loop since one of the node always owns the slot.

This is an interesting scenario, and the initial solution seems to be able to maintain the "client pause write" state until FAILOVER_IN_PROGRESS ends

This generally sounds about right. We should be entering a pause situation on the primary instead of redirecting duing the failover.

gmbnomis · 2024-07-24T16:07:19Z

Just to avoid a misunderstanding (I may not have made this clear enough above): During standalone FAILOVER, writing clients will be paused during the entire failover and paused clients are unpaused when the failover finishes (successfully or not).

However, if a client issues the first write command since the beginning of the failover during the time window described above (or a read command in the case of the stale-replica condition), the order of the conditions in processCommand (redirect, ..., readonly, ..., stale-replica, ..., block client during pause) actually makes the client pause ineffective in some situations.

So, we might want to check which of those error replies we don't want to send during a failover. Or we could re-order conditions (or do a combination of the two). However, re-ordering looks too scary to me to even give it a try...

(I am not familiar with clustered setups, so I can't say anything about these. I looked into failover handling in more detail because of the sentinel improvement issue redis/redis#13118)

soloestoy · 2024-07-25T03:27:22Z

I have checked cluster failover in the cluster mode. Unlike in standalone mode, in cluster failover, it is initiated by the replica:

The cluster failover command is executed on the replica;
The replica sends a CLUSTERMSG_TYPE_MFSTART message to the primary and enters manual failover state;
Upon receiving the CLUSTERMSG_TYPE_MFSTART message, the primary enters a pause write state and sets a timeout, then responds to the replica with its offset (along with the CLUSTERMSG_FLAG0_PAUSED flag);
The replica waits for its offset to be equal to that of the primary and then initiates a vote;
The replica is elected as the new primary and takes over the slot;
The replica becomes the new primary and broadcasts;
The old primary receives the new routing table, notices the change in slot ownership, automatically becomes a replica of the new primary, and unpause.

It can be seen that under normal circumstances, the primary does not unpause until the slot is taken over by its replica and it becomes the new primary. Therefore, in standalone mode, the unpause operation should be performed after receiving the reply to the PSYNC FAILOVER command sent to the replica.

soloestoy · 2024-07-25T03:42:30Z

    if (server.failover_state == FAILOVER_IN_PROGRESS) {
        if (psync_result == PSYNC_CONTINUE || psync_result == PSYNC_FULLRESYNC) {
            clearFailoverState();
        } else {
            abortFailover("Failover target rejected psync request");
            return;
        }
    }

We are already doing it this way now, @gmbnomis can you provide more details about your issue?

gmbnomis · 2024-07-25T06:35:40Z

@soloestoy Yes, the locations of pausing and unpausing clients are correct. However, "pausing clients" means that there is a flag indicating that clients issuing new commands (after the pause started) should be blocked.

Actually blocking clients while paused happens here in processCommand:

valkey/src/server.c

Line 4061 in 59aa008

blockPostponeClient(c);

And that check happens later in processCommand than e.g. the check for redirecting:

valkey/src/server.c

Line 3903 in 59aa008

    
           if (!server.cluster_enabled && c->capa & CLIENT_CAPA_REDIRECT && server.primary_host && !mustObeyClient(c) &&

As you said, clients are paused during the FAILOVER_IN_PROGRESS state. But this does not mean that they are already in blocked state. So:

If a client is already blocked due to a previous command, all is fine.
But if a client isn't blocked yet (because it just connected, or it did not send a write command yet) and sends a write command during the time window described in my original issue, processCommand will reply with a REDIRECT, because the part of processCommand that is responsible for blocking the client isn't reached.

gmbnomis · 2024-07-25T22:04:28Z

I found a way to force FAILOVER_WAIT_FOR_SYNC and FAILOVER_IN_PROGRESS respectively by letting the replica sleep (taken from the failover tests).

gmbnomis@9a26fae is a test that:

demonstrates the erroneous REDIRECT response (instead of blocking the client) and the MASTERDOWN response if replica-serve-stale-data no is configured.

It also demonstrates another issue: When entering FAILOVER_IN_PROGRESS, the call to disconnectAllBlockedClients already disconnects blocked clients that were executing a blocking command with an UNBLOCKED response.

I think this has two problems:
1. In general, this is too early. There is no primary to connect to yet.
2. Analogous to the "MOVED" case in cluster mode, I think we should issue a REDIRECT to those clients (when we unblock the clients in general.)
Should I open a separate issue for this case?
verifies the correct responses in FAILOVER_WAIT_FOR_SYNC

soloestoy · 2024-08-05T09:57:43Z

Hi @gmbnomis , I think I understand your point. The main issue is during the FAILOVER_IN_PROGRESS period, both nodes may be replicas, causing -REDIRECT to bounce back and forth between the two replicas.

I have seen your PR, but it seems too complicated and does not completely solve the problem (please correct me if I am wrong). I believe a simple and effective approach would be to, when the primary becomes a replica, i.e., during the FAILOVER_IN_PROGRESS state, replace the behavior that should return -REDIRECT with pausing the client instead. Then, once the replica truly becomes the primary, i.e., after receiving the response to PSYNC FAILOVER, the suspended clients' commands will be resumed.

gmbnomis · 2024-08-05T10:44:06Z

Hi @soloestoy,

I have seen your PR, but it seems too complicated and does not completely solve the problem (please correct me if I am wrong).

my PR solves a different problem (handling of clients that are already blocked (not postponed) before the failover begins). So yes, it does not solve the problem described here.

Fix valkey-io#821 During the `FAILOVER` process, when conditions are met (such as when the force time is reached or the primary and replica offsets are consistent), the primary actively becomes the replica and transitions to the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica, and after handshaking and other operations, it will eventually send the `PSYNC FAILOVER` command to the replica, after which the replica will become the primary. This means that the upgrade of the replica to the primary is an asynchronous operation, which implies that during the `FAILOVER_IN_PROGRESS` state, there may be a period of time where both nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the request will be redirected to the replica and then redirected back, causing back and forth redirection. To avoid this situation, during the `FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that need to be redirected until the replica truly becomes the primary, and then resume the execution. --------- Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com> Signed-off-by: mwish <maplewish117@gmail.com>

Fix #821 During the `FAILOVER` process, when conditions are met (such as when the force time is reached or the primary and replica offsets are consistent), the primary actively becomes the replica and transitions to the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica, and after handshaking and other operations, it will eventually send the `PSYNC FAILOVER` command to the replica, after which the replica will become the primary. This means that the upgrade of the replica to the primary is an asynchronous operation, which implies that during the `FAILOVER_IN_PROGRESS` state, there may be a period of time where both nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the request will be redirected to the replica and then redirected back, causing back and forth redirection. To avoid this situation, during the `FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that need to be redirected until the replica truly becomes the primary, and then resume the execution. --------- Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>

Fix valkey-io#821 During the `FAILOVER` process, when conditions are met (such as when the force time is reached or the primary and replica offsets are consistent), the primary actively becomes the replica and transitions to the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica, and after handshaking and other operations, it will eventually send the `PSYNC FAILOVER` command to the replica, after which the replica will become the primary. This means that the upgrade of the replica to the primary is an asynchronous operation, which implies that during the `FAILOVER_IN_PROGRESS` state, there may be a period of time where both nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the request will be redirected to the replica and then redirected back, causing back and forth redirection. To avoid this situation, during the `FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that need to be redirected until the replica truly becomes the primary, and then resume the execution. --------- Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com> Signed-off-by: Ping Xie <pingxie@google.com>

gmbnomis added a commit to gmbnomis/valkey that referenced this issue Jul 25, 2024

Demonstrate issue valkey-io#821

89bb201

gmbnomis added a commit to gmbnomis/valkey that referenced this issue Jul 25, 2024

Demonstrate issue valkey-io#821

9a1ccb8

gmbnomis added a commit to gmbnomis/valkey that referenced this issue Jul 25, 2024

Demonstrate issue valkey-io#821

ca708a4

gmbnomis mentioned this issue Jul 30, 2024

Standalone FAILOVER: Fix disconnect of blocked clients in standalone failover and support REDIRECT response #848

Open

soloestoy mentioned this issue Aug 5, 2024

To avoid bouncing -REDIRECT during FAILOVER #871

Merged

madolson added this to Valkey 8.0 Aug 5, 2024

madolson moved this to Optional for rc2 in Valkey 8.0 Aug 5, 2024

soloestoy closed this as completed in #871 Aug 14, 2024

soloestoy closed this as completed in 131857e Aug 14, 2024

github-project-automation bot moved this from Optional for rc2 to Done in Valkey 8.0 Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `CLIENT CAPA redirect` may result in bouncing redirects during failover #821

[BUG] `CLIENT CAPA redirect` may result in bouncing redirects during failover #821

gmbnomis commented Jul 24, 2024 •

edited

Loading

enjoy-binbin commented Jul 24, 2024

soloestoy commented Jul 24, 2024

madolson commented Jul 24, 2024

gmbnomis commented Jul 24, 2024 •

edited

Loading

soloestoy commented Jul 25, 2024

soloestoy commented Jul 25, 2024

gmbnomis commented Jul 25, 2024

gmbnomis commented Jul 25, 2024 •

edited

Loading

soloestoy commented Aug 5, 2024

gmbnomis commented Aug 5, 2024

[BUG] CLIENT CAPA redirect may result in bouncing redirects during failover #821

[BUG] CLIENT CAPA redirect may result in bouncing redirects during failover #821

Comments

gmbnomis commented Jul 24, 2024 • edited Loading

enjoy-binbin commented Jul 24, 2024

soloestoy commented Jul 24, 2024

madolson commented Jul 24, 2024

gmbnomis commented Jul 24, 2024 • edited Loading

soloestoy commented Jul 25, 2024

soloestoy commented Jul 25, 2024

gmbnomis commented Jul 25, 2024

gmbnomis commented Jul 25, 2024 • edited Loading

soloestoy commented Aug 5, 2024

gmbnomis commented Aug 5, 2024

[BUG] `CLIENT CAPA redirect` may result in bouncing redirects during failover #821

[BUG] `CLIENT CAPA redirect` may result in bouncing redirects during failover #821

gmbnomis commented Jul 24, 2024 •

edited

Loading

gmbnomis commented Jul 24, 2024 •

edited

Loading

gmbnomis commented Jul 25, 2024 •

edited

Loading