Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CLIENT CAPA redirect may result in bouncing redirects during failover #821

Closed
gmbnomis opened this issue Jul 24, 2024 · 10 comments · Fixed by #871
Closed

[BUG] CLIENT CAPA redirect may result in bouncing redirects during failover #821

gmbnomis opened this issue Jul 24, 2024 · 10 comments · Fixed by #871

Comments

@gmbnomis
Copy link
Contributor

gmbnomis commented Jul 24, 2024

Describe the bug

A client using the new CLIENT CAPA redirect feature may see bouncing REDIRECT responses
during a failover initiated by the FAILOVER command. To be more precise, during the
FAILOVER_IN_PROGRESS failover state.

While the current primary is in FAILOVER_IN_PROGRESS state, server.primary_host is already
set to the designated primary. However, the designated primary still acts as a replica until the PSYNC FAILOVER command is processed.

The primary will already reply with a REDIRECT (to the designated primary) in this situation. But the latter is still a replica and will REDIRECT the client back to the old primary.

To reproduce

This is hard to reproduce since, under normal circumstances, the duration of the
FAILOVER_IN_PROGRESS state is very short: During this phase, a connection from the primary
to the designated primary is opened to send, among other commands, the PSYNC FAILOVER command.

Nevertheless, in case of e.g. a network glitch, this may take much longer (actually, there
seems to be no timeout for this phase(?))

I produced this behavior by blocking the client when receiving the PSYNC FAILOVER in syncCommand.

Update: The behavior can also be reproduced by letting the replica sleep, see gmbnomis@ca708a4 for a demonstration.

Expected behavior

During the first phase of a failover (FAILOVER_WAIT_FOR_SYNC), a writing client will be paused
while reads can still proceed. During the second phase (FAILOVER_IN_PROGRESS) the same behavior
is expected.

Only after finishing the failover procedure, i.e. when the clients are unpaused again, REDIRECT replies
should be sent (since the new primary is ready to accept writes now.)

Additional information

Note that it isn't sufficient to just ensure that we aren't in a "failover" when sending redirect at

if (!server.cluster_enabled && c->capa & CLIENT_CAPA_REDIRECT && server.primary_host && !mustObeyClient(c) &&
, e.g.:

    if (!server.cluster_enabled && c->capa & CLIENT_CAPA_REDIRECT && server.primary_host && server.failover_state == NO_FAILOVER &&
        !obey_client && (is_write_command || (is_read_command && !c->flag.readonly))) {
        addReplyErrorSds(c, sdscatprintf(sdsempty(), "-REDIRECT %s:%d", server.primary_host, server.primary_port));
        return C_OK;
    }

Instead of REDIRECT, the client will see READONLY errors during FAILOVER_IN_PROGRESS (caused by

/* Don't accept write commands if this is a read only replica. But
), which as explained in #319, prevents a smooth switch.

We could add a corresponding server.failover_state == NO_FAILOVER clause to the respective if. However, it looks like MASTERDOWN could be triggered during FAILOVER_IN_PROGRESS as well if replica-serve-stale-data is no (I did not test this).

It may also be worth considering whether we should change the order of the if clauses (i.e. pause clients "earlier").

@enjoy-binbin
Copy link
Member

thanks for the report, @soloestoy please also take a look.

@soloestoy
Copy link
Member

This is an interesting scenario, and the initial solution seems to be able to maintain the "client pause write" state until FAILOVER_IN_PROGRESS ends. Additionally, there may also be a similar issue when performing cluster failover in cluster mode, right?

@madolson
Copy link
Member

Additionally, there may also be a similar issue when performing cluster failover in cluster mode, right?

The redirect for CME is based on whether the node owns the slot, not whether the node is a primary. So, there shouldn't be a loop since one of the node always owns the slot.

This is an interesting scenario, and the initial solution seems to be able to maintain the "client pause write" state until FAILOVER_IN_PROGRESS ends

This generally sounds about right. We should be entering a pause situation on the primary instead of redirecting duing the failover.

@gmbnomis
Copy link
Contributor Author

gmbnomis commented Jul 24, 2024

Just to avoid a misunderstanding (I may not have made this clear enough above): During standalone FAILOVER, writing clients will be paused during the entire failover and paused clients are unpaused when the failover finishes (successfully or not).

However, if a client issues the first write command since the beginning of the failover during the time window described above (or a read command in the case of the stale-replica condition), the order of the conditions in processCommand (redirect, ..., readonly, ..., stale-replica, ..., block client during pause) actually makes the client pause ineffective in some situations.

So, we might want to check which of those error replies we don't want to send during a failover. Or we could re-order conditions (or do a combination of the two). However, re-ordering looks too scary to me to even give it a try...

(I am not familiar with clustered setups, so I can't say anything about these. I looked into failover handling in more detail because of the sentinel improvement issue redis/redis#13118)

@soloestoy
Copy link
Member

I have checked cluster failover in the cluster mode. Unlike in standalone mode, in cluster failover, it is initiated by the replica:

  1. The cluster failover command is executed on the replica;
  2. The replica sends a CLUSTERMSG_TYPE_MFSTART message to the primary and enters manual failover state;
  3. Upon receiving the CLUSTERMSG_TYPE_MFSTART message, the primary enters a pause write state and sets a timeout, then responds to the replica with its offset (along with the CLUSTERMSG_FLAG0_PAUSED flag);
  4. The replica waits for its offset to be equal to that of the primary and then initiates a vote;
  5. The replica is elected as the new primary and takes over the slot;
  6. The replica becomes the new primary and broadcasts;
  7. The old primary receives the new routing table, notices the change in slot ownership, automatically becomes a replica of the new primary, and unpause.

It can be seen that under normal circumstances, the primary does not unpause until the slot is taken over by its replica and it becomes the new primary. Therefore, in standalone mode, the unpause operation should be performed after receiving the reply to the PSYNC FAILOVER command sent to the replica.

@soloestoy
Copy link
Member

    if (server.failover_state == FAILOVER_IN_PROGRESS) {
        if (psync_result == PSYNC_CONTINUE || psync_result == PSYNC_FULLRESYNC) {
            clearFailoverState();
        } else {
            abortFailover("Failover target rejected psync request");
            return;
        }
    }

We are already doing it this way now, @gmbnomis can you provide more details about your issue?

@gmbnomis
Copy link
Contributor Author

@soloestoy Yes, the locations of pausing and unpausing clients are correct. However, "pausing clients" means that there is a flag indicating that clients issuing new commands (after the pause started) should be blocked.

Actually blocking clients while paused happens here in processCommand:

blockPostponeClient(c);

And that check happens later in processCommand than e.g. the check for redirecting:

if (!server.cluster_enabled && c->capa & CLIENT_CAPA_REDIRECT && server.primary_host && !mustObeyClient(c) &&

As you said, clients are paused during the FAILOVER_IN_PROGRESS state. But this does not mean that they are already in blocked state. So:

  • If a client is already blocked due to a previous command, all is fine.
  • But if a client isn't blocked yet (because it just connected, or it did not send a write command yet) and sends a write command during the time window described in my original issue, processCommand will reply with a REDIRECT, because the part of processCommand that is responsible for blocking the client isn't reached.

gmbnomis added a commit to gmbnomis/valkey that referenced this issue Jul 25, 2024
gmbnomis added a commit to gmbnomis/valkey that referenced this issue Jul 25, 2024
gmbnomis added a commit to gmbnomis/valkey that referenced this issue Jul 25, 2024
@gmbnomis
Copy link
Contributor Author

gmbnomis commented Jul 25, 2024

I found a way to force FAILOVER_WAIT_FOR_SYNC and FAILOVER_IN_PROGRESS respectively by letting the replica sleep (taken from the failover tests).

gmbnomis@9a26fae is a test that:

  1. demonstrates the erroneous REDIRECT response (instead of blocking the client) and the MASTERDOWN response if replica-serve-stale-data no is configured.

    It also demonstrates another issue: When entering FAILOVER_IN_PROGRESS, the call to disconnectAllBlockedClients already disconnects blocked clients that were executing a blocking command with an UNBLOCKED response.

    I think this has two problems:

    1. In general, this is too early. There is no primary to connect to yet.
    2. Analogous to the "MOVED" case in cluster mode, I think we should issue a REDIRECT to those clients (when we unblock the clients in general.)

    Should I open a separate issue for this case?

  2. verifies the correct responses in FAILOVER_WAIT_FOR_SYNC

@soloestoy
Copy link
Member

Hi @gmbnomis , I think I understand your point. The main issue is during the FAILOVER_IN_PROGRESS period, both nodes may be replicas, causing -REDIRECT to bounce back and forth between the two replicas.

I have seen your PR, but it seems too complicated and does not completely solve the problem (please correct me if I am wrong). I believe a simple and effective approach would be to, when the primary becomes a replica, i.e., during the FAILOVER_IN_PROGRESS state, replace the behavior that should return -REDIRECT with pausing the client instead. Then, once the replica truly becomes the primary, i.e., after receiving the response to PSYNC FAILOVER, the suspended clients' commands will be resumed.

@gmbnomis
Copy link
Contributor Author

gmbnomis commented Aug 5, 2024

Hi @soloestoy,

I have seen your PR, but it seems too complicated and does not completely solve the problem (please correct me if I am wrong).

my PR solves a different problem (handling of clients that are already blocked (not postponed) before the failover begins). So yes, it does not solve the problem described here.

@madolson madolson moved this to Optional for rc2 in Valkey 8.0 Aug 5, 2024
@github-project-automation github-project-automation bot moved this from Optional for rc2 to Done in Valkey 8.0 Aug 14, 2024
mapleFU pushed a commit to mapleFU/valkey that referenced this issue Aug 21, 2024
Fix valkey-io#821

During the `FAILOVER` process, when conditions are met (such as when the
force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
`PSYNC FAILOVER` command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
`FAILOVER_IN_PROGRESS` state, there may be a period of time where both
nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
`FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that
need to be redirected until the replica truly becomes the primary, and
then resume the execution.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: mwish <maplewish117@gmail.com>
mapleFU pushed a commit to mapleFU/valkey that referenced this issue Aug 22, 2024
Fix valkey-io#821

During the `FAILOVER` process, when conditions are met (such as when the
force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
`PSYNC FAILOVER` command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
`FAILOVER_IN_PROGRESS` state, there may be a period of time where both
nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
`FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that
need to be redirected until the replica truly becomes the primary, and
then resume the execution.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: mwish <maplewish117@gmail.com>
madolson pushed a commit that referenced this issue Sep 2, 2024
Fix #821

During the `FAILOVER` process, when conditions are met (such as when the
force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
`PSYNC FAILOVER` command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
`FAILOVER_IN_PROGRESS` state, there may be a period of time where both
nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
`FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that
need to be redirected until the replica truly becomes the primary, and
then resume the execution.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
madolson pushed a commit that referenced this issue Sep 3, 2024
Fix #821

During the `FAILOVER` process, when conditions are met (such as when the
force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
`PSYNC FAILOVER` command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
`FAILOVER_IN_PROGRESS` state, there may be a period of time where both
nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
`FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that
need to be redirected until the replica truly becomes the primary, and
then resume the execution.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
PingXie added a commit to PingXie/valkey that referenced this issue Sep 14, 2024
Fix valkey-io#821

During the `FAILOVER` process, when conditions are met (such as when the
force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
`PSYNC FAILOVER` command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
`FAILOVER_IN_PROGRESS` state, there may be a period of time where both
nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
`FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that
need to be redirected until the replica truly becomes the primary, and
then resume the execution.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: Ping Xie <pingxie@google.com>
PingXie added a commit to PingXie/valkey that referenced this issue Sep 14, 2024
Fix valkey-io#821

During the `FAILOVER` process, when conditions are met (such as when the
force time is reached or the primary and replica offsets are
consistent), the primary actively becomes the replica and transitions to
the `FAILOVER_IN_PROGRESS` state. After the primary becomes the replica,
and after handshaking and other operations, it will eventually send the
`PSYNC FAILOVER` command to the replica, after which the replica will
become the primary. This means that the upgrade of the replica to the
primary is an asynchronous operation, which implies that during the
`FAILOVER_IN_PROGRESS` state, there may be a period of time where both
nodes are replicas. In this scenario, if a `-REDIRECT` is returned, the
request will be redirected to the replica and then redirected back,
causing back and forth redirection. To avoid this situation, during the
`FAILOVER_IN_PROGRESS state`, we temporarily suspend the clients that
need to be redirected until the replica truly becomes the primary, and
then resume the execution.

---------

Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Signed-off-by: Ping Xie <pingxie@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants