Trigger manual failover on SIGTERM / shutdown to cluster primary #1091

enjoy-binbin · 2024-09-30T04:16:25Z

When a primary disappears, its slots are not served until an automatic
failover happens. It takes about n seconds (node timeout plus some seconds).
It's too much time for us to not accept writes.

If the host machine is about to shutdown for any reason, the processes
typically get a sigterm and have some time to shutdown gracefully. In
Kubernetes, this is 30 seconds by default.

When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover
to one of the replicas as part of the graceful shutdown. This can reduce
some unavailability time. For example the replica needs to sense the
primary failure within the node-timeout before initating an election,
and now it can initiate an election quickly and win and gossip it.

The primary does this by sending a CLUSTER FAILOVER command to the replica.
We added a replicaid arg to CLUSTER FAILOVER, after receiving the command,
the replica will check whether the node-id is itself, if not, the command
will be ignored. The node-id is set by the replica through client setname
during the replication handshake.

New argument for CLUSTER FAILOVER

So the format now become CLUSTER FAILOVER [FORCE TAKEOVER] [REPLICAID node-id],
this arg does not intented for user use, so it will not be added to the JSON
file.

Replica sends REPLCONF SET-CLUSTER-NODE-ID to inform its node-id

During the replication handshake, replica now will use REPLCONF SET-CLUSTER-NODE-ID
to inform the primary of replica node-id.

Primary issue CLUSTER FAILOVER

Primary sends CLUSTER FAILOVER FORCE REPLICAID node-id to all replicas because
it is a shared replication buffer but only the replica with the mathching id
will execute it.

Add a new auto-failover-on-shutdown config

People can disable this feature if they don't like it, the default is 0.

This closes #939.

When a primary disappears, its slots are not served until an automatic failover happens. It takes about n seconds (node timeout plus some seconds). It's too much time for us to not accept writes. If the host machine is about to shutdown for any reason, the processes typically get a sigterm and have some time to shutdown gracefully. In Kubernetes, this is 30 seconds by default. When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover to one of the replicas as part of the graceful shutdown. This can reduce some unavailability time. For example the replica needs to sense the primary failure within the node-timeout before initating an election, and now it can initiate an election quickly and win and gossip it. This closes valkey-io#939. Signed-off-by: Binbin <binloveplay1314@qq.com>

Signed-off-by: Binbin <binloveplay1314@qq.com>

codecov · 2024-09-30T04:31:34Z

Codecov Report

Attention: Patch coverage is 94.28571% with 4 lines in your changes missing coverage. Please review.

Project coverage is 71.12%. Comparing base (bd951e6) to head (e3fdb7c).
Report is 1 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	95.23%	2 Missing ⚠️
src/replication.c	92.59%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1091      +/-   ##
============================================
+ Coverage     71.10%   71.12%   +0.02%     
============================================
  Files           123      123              
  Lines         65552    65614      +62     
============================================
+ Hits          46610    46669      +59     
- Misses        18942    18945       +3

Files with missing lines	Coverage Δ
src/config.c	`78.39% <ø> (ø)`
src/server.c	`87.54% <100.00%> (ø)`
src/server.h	`100.00% <ø> (ø)`
src/cluster_legacy.c	`86.11% <95.23%> (+0.05%)`	⬆️
src/replication.c	`87.29% <92.59%> (+0.05%)`	⬆️

... and 8 files with indirect coverage changes

zuiderkwast

Nice! Thanks for doing this.

The PR description can be updated to explain the solution. Now it is just copy-pasted from the issue. :)

I'm thinking that doing failover in finishShutdown() is maybe too late. finishShutdown is only called when all replicas already have replication offset equal to the primary (checked by isReadyToShutdown()), or after timeout (10 seconds). If one replica is very slow, it will delay the failover. I think we can do the manual failover earlier.

This is the sequence:

SHUTDOWN or SIGTERM calls prepareForShutdown(). Here, pause clients for writing and start waiting for replicas offset.
In serverCron(), we check isReadyToShutdown() which checks if all replicas have repl_ack_off == primary_repl_offset. If yes, finishShutdown() is called, otherwise wait more.
finishShutdown.

I think we can send CLUSTER FAILOVER FORCE to the first replica which has repl_ack_off == primary_repl_offset. We can do it in isReadyToShutdown() I think. (We can rename to indicated it does more then check if ready.) Then, we also wait for it to send failover auth request and the primary votes before isReadyToShutdown() returns true.

What do you think?

src/server.c

tests/unit/cluster/auto-failover-on-shutdown.tcl

zuiderkwast · 2024-10-01T14:49:11Z

src/cluster_legacy.c

+            /* todo: see if this is needed. */
+            /* This is a failover triggered by my primary, let's counts its vote. */
+            if (server.cluster->mf_is_primary_failover) {
+                server.cluster->failover_auth_count++;
+            }


It's not very good that we add some conditional logic about how we are counting the votes. It makes the voting algorithm more complex to analyze.

I understand we need this special case if the primary exits immediately after sending CLUSTER FAILOVER FORCE. Maybe we can avoid it if the primary waits for the failover auth requests and actually votes, before the primary shuts down..?

yes, it is not very good so i add a todo. I am also not sure about this logic, and in fact we can skip this logic and the failover will also work since i think we will still have the enouth votes. I add this logic just trying open the discusstion so that we can discuss it and see if should we handle this case.

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2024-10-06T06:34:00Z

The PR description can be updated to explain the solution. Now it is just copy-pasted from the issue. :)

The issue desc is good and very detailed so i copied it, i will update it later.

I'm thinking that doing failover in finishShutdown() is maybe too late. finishShutdown is only called when all replicas already have replication offset equal to the primary (checked by isReadyToShutdown()), or after timeout (10 seconds). If one replica is very slow, it will delay the failover. I think we can do the manual failover earlier.

yean, a failover as soon as possible is good, but itn't true that the primary is down only after it actually exit? so in this case, if a replica is slow and it does not have the chance to catch up the primary, and then the other replica trigger the failover, so the slow replica will need a full sync when it doing the reconfiguration.

I think we can send CLUSTER FAILOVER FORCE to the first replica which has repl_ack_off == primary_repl_offset. We can do it in isReadyToShutdown() I think. (We can rename to indicated it does more then check if ready.) Then, we also wait for it to send failover auth request and the primary votes before isReadyToShutdown() returns true.

so let me sort it out again, you are suggesting that if one replica has already caught up the offset, we should trigger a failover immediately?

I guess it is also make sense in this case.

Signed-off-by: Binbin <binloveplay1314@qq.com>

zuiderkwast · 2024-10-06T19:45:00Z

if a replica is slow and it does not have the chance to catch up the primary, and then the other replica trigger the failover, so the slow replica will need a full sync when it doing the reconfiguration.

I didn't think about this. The replica can't do psync to the new primary after failover? If it can't, then maybe you're right that the primary should wait for all replicas, at least for some time, to avoid full sync.

So, wait for all, then trigger manual failover. If you want, we can add another wait after that (after "finish shutdown"), so the primary can vote for the replica before exit. Wdyt?

enjoy-binbin · 2024-10-18T09:35:04Z

Sorry for the late reply, i somehow missed this thread.

I didn't think about this. The replica can't do psync to the new primary after failover? If it can't, then maybe you're right that the primary should wait for all replicas, at least for some time, to avoid full sync.

Yes, i think this may happen, like if the primary does not flush its output buffer to the slow replica, like primary does not write the buffer to the slow replica, when doing the reconfiguration, the slow replica may use an old offset to psync with the new primary, which will cause a full sync. This may happen, but the probability should be small since the primary will call flushReplicasOutputBuffers to write as much as possible before shutdown.

So, wait for all, then trigger manual failover. If you want, we can add another wait after that (after "finish shutdown"), so the primary can vote for the replica before exit. Wdyt?

wait for the vote, i think both are OK. Even if we don't wait, I think the replica will have enough votes. If we really want to, we can even wait until the replica successfully becomes primary before exiting... Do you have a final decision? I will do whatever you think is right.

zuiderkwast · 2024-10-18T09:54:41Z

wait for the vote, i think both are OK. Even if we don't wait, I think the replica will have enough votes. If we really want to, we can even wait until the replica successfully becomes primary before exiting... Do you have a final decision? I will do whatever you think is right.

I'm thinking if there are any corner cases, like if the cluster is too small to have quorum without the shutting down primary...

If it is simple, I prefer to let the primary wait and vote. Then we can avoid the server.cluster->mf_is_primary_failover variable. I don't like this variable and special case. :)

But if this implementation to wait for the vote will be too complex, then let's just skip the vote. I think it's also fine. Without this feature, we wait for automatic failover, which will also not have the vote from the already shutdown primary.

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2024-10-27T10:27:35Z

But if this implementation to wait for the vote will be too complex, then let's just skip the vote. I think it's also fine. Without this feature, we wait for automatic failover, which will also not have the vote from the already shutdown primary.

i am going to skip the vote for now, i tried a bit which seemed not easy and not good looking to finish it. Maybe I'll have a better idea later, i will keep it in mind.

Signed-off-by: Binbin <binloveplay1314@qq.com>

zuiderkwast

i am going to skip the vote for now, i tried a bit which seemed not easy and not good looking to finish it. Maybe I'll have a better idea later, i will keep it in mind.

I understand. Simple is better.

But possible data loss is not good. See comments below.

tests/unit/cluster/auto-failover-on-shutdown.tcl

Signed-off-by: Binbin <binloveplay1314@qq.com>

PingXie · 2024-10-28T04:21:31Z

This is an interesting idea. I like the direction we are going in but I agree with @zuiderkwast that potential data loss is not appealing.

We can do both though IMO triggering a (graceful) failover as part of CLUSTER FORGET is more valuable than making it part of shutdown, because it is cleaner to forget a node prior to shutting it down in any production environment.

Today, we can't forget "myself" nor "my primary" (with the latter being a dynamic state). This adds operational complexity. Imagine that the admin could just send CLUSTER FORGET to any node in the cluster and then the server will do the right thing, failing over the primaryship to one of its replicas, if applicable, and then broadcast the forget message to the cluster.

Signed-off-by: Binbin <binloveplay1314@qq.com>

tests/unit/cluster/auto-failover-on-shutdown.tcl

zuiderkwast · 2024-10-28T10:47:58Z

We can do both though IMO triggering a (graceful) failover as part of CLUSTER FORGET is more valuable than making it part of shutdown, because it is cleaner to forget a node prior to shutting it down in any production environment.

@PingXie Yes, it's a good idea, but this PR is about the scenario that the machine is taken down without the control of the Valkey admin. For example, in Kubernetes when a worker is shutdown, SIGTERM is sent to all processes and it waits for 30 seconds by default. When you shutdown your laptop, I believe it's similar, each application gets SIGTERM and has some time to be able to do a graceful shutdown.

src/cluster_legacy.c

Signed-off-by: Binbin <binloveplay1314@qq.com>

zuiderkwast

LGTM.

@valkey-io/core-team Let's see if we can include this in 8.1. The only blocker now is to find time to discuss it and approve it.

src/cluster_legacy.c

PingXie · 2025-02-11T07:41:26Z

src/cluster_legacy.c

+            }
+            if (replica->repl_data->repl_state == REPLICA_STATE_ONLINE &&
+                replica->name && sdslen(replica->name->ptr) == CLUSTER_NAMELEN &&
+                replica->repl_data->repl_ack_off == server.primary_repl_offset) {


I feel that the equal test is too strict. would it make sense to pick the replica with the highest replication offset? whether or not we yield the primaryship to it, it is going to become the new primary anyways. the only difference is "how long it takes" I think.

good point, @zuiderkwast thoughts?

It's not too strict. The primary has already paused writes and waited for the replica. It can wait to shutdown for up to 10 seconds by default to make sure at least one of the replica has matched the offset.

If we do manual failover when the offset doesn't match, there is a risk that another replica has a better offset. The best one will usually become the primary in an automatic failover when they calculate the rank.

Yeah I'd expect that in the vast majority cases we will have an exact match, which is great. In the rare or very rare case where we couldn't get a matching replica, are we saying we are ok with giving up on this fast failover path but letting the best replica to win after a full cluster_node_timeout? which I believe is the same as the primary picking the best replica?

Almost the same. The only difference is if the replica didn't ack the offset in time, or it did but the ack didn't reach the primary yet, so the primary doesn't know.

I'm not completely against this idea, but I don't think it's necessary either. It's a corner case. It's easier to explain the feature if the primary selects the best replica only when it is certain.

I am ok with the equal test. I was trying to validate my understanding. I am aligned.

Now that i think of it, the cluster failover command is added last on the replication stream so the replica will see it only after replicating everything first. Thus it's very safe to do it. :-)

src/cluster_legacy.c

Signed-off-by: Binbin <binloveplay1314@qq.com>

PingXie

LGTM overall

src/cluster_legacy.c

PingXie · 2025-02-17T19:17:34Z

src/cluster_legacy.c

+            }
+            if (replica->repl_data->repl_state == REPLICA_STATE_ONLINE &&
+                replica->name && sdslen(replica->name->ptr) == CLUSTER_NAMELEN &&
+                replica->repl_data->repl_ack_off == server.primary_repl_offset) {


Yeah I'd expect that in the vast majority cases we will have an exact match, which is great. In the rare or very rare case where we couldn't get a matching replica, are we saying we are ok with giving up on this fast failover path but letting the best replica to win after a full cluster_node_timeout? which I believe is the same as the primary picking the best replica?

src/cluster_legacy.c

valkey.conf

madolson · 2025-02-17T22:59:38Z

src/replication.c

@@ -3626,6 +3626,14 @@ void syncWithPrimary(connection *conn) {
        err = sendCommand(conn, "REPLCONF", "version", VALKEY_VERSION, NULL);
        if (err) goto write_error;

+        /* Inform the primary of our (replica) node name. */
+        if (server.cluster_enabled) {
+            char *argv[] = {"CLIENT", "SETNAME", server.cluster->myself->name};


I don't understand why this isn't a replconf. It seems like this should be more specific information than just setting it as the name.

Replconf needs some more work and a new field in the client struct, etc. Why is it bad to reuse the setname?

Psync and replconf are internal undocumented commands so I don't see any risk. Is it about code complexity?

I thought about reusing REPLCONF as well but since what we need a replica's identity, I feel that CLIENT SETNAME works too. Ideally though I would love to see us deduce the name automatically on the server side. One idea could be looking up the remote address against all the replica nodes after detecting a replica client.

Neither https://valkey.io/commands/replconf/ nor https://valkey.io/commands/psync/ are technically undocumented, I agree they aren't very well documented but they are part of the protocol and people build tooling around it.

My other concern is now show up with a weird name that wasn't there before, so people might take dependencies on it. Do we want to forever behind the replica name (we could instead put something more human readable there)? It's a bit of a binding decision, and would rather have flexibility.

EDIT: I originally missed ping's comment, I think that ideally inferring the client would also be ideal, I'm just not sure that's always possible.

REPLCONF has no documented arguments. The page just says "Usage: REPLCONF".

My other concern is now show up with a weird name that wasn't there before, so people might take dependencies on it.

Good point. The replicas show up in CLIENT LIST.

The replicas show up in CLIENT LIST.

I use SETNAME on purpose. I would like to show it in the client list so that we can quickly locate the node and its associated node id in the replication part (not the cluster part). But it’s not that important, sorry i did not mention this impact at the first time.

changed to use replconf

madolson · 2025-02-17T23:03:37Z

src/cluster_legacy.c

+         * REPLICAID is currently available only for internal so we won't
+         * put it into the JSON file. */


I don't think we should just not document it. Someone will find it and use it, so I think removing it becomes a breaking change. All other arguments, including other internal commands, are documented in the json files.

If we document it, I think it will mostly be confusing to users.

Users might think they can send it to the primary like with standalone FAILOVER TO, but it's not replicated so it doesn't work.

I agree though that even if we don't document it, we probably need to keep it around.

All other arguments, including other internal commands, are documented in the json files.

REPLCONF has no arguments in the JSON file. The page just says "Usage: REPLCONF". So this will not be the only undocumented argument.

madolson

The high level idea still seems really good to me. Sorry for not commenting sooner, since I think some of what I'm suggesting reverts earlier discussions, but the conversation wasn't that clear so I don't really understand why this design won out.

I don't really like the primary explicitly picks a replica and then sends it a failover request for all replicas into the replica buffer. It seems odd we'll commit more out of band data into the replication stream and if for any reason that replica doesn't take over none of the other ones will do it quickly.

Since this isn't backwards compatible anyways, can we not use a ping extension message to just broadcast to all the replicas that one of them should try to take over? I had an informal conversation with another person from kubernetes that it would also be nice to mark a node as failing itself and let whichever replica take over.

Alternatively, it also diverges from our failover command design, which relies on the primary sending a PSYNC command with a special flag indicating that it has stepped down which can get rejected.

PingXie · 2025-02-18T00:24:22Z

It seems odd we'll commit more out of band data into the replication stream

Yeah, that is indeed not 100% clean. We are reusing the replication stream for non-replication purposes.

if for any reason that replica doesn't take over none of the other ones will do it quickly.

I think this is fine and it fall back to the current behavior - a new primary will be elected after cluster-node-timeout.

can we not use a ping extension message to just broadcast to all the replicas that one of them should try to take over?

And mark itself failed? I think it is a good idea and it helps the cluster converge faster on this node's status too.

which relies on the primary sending a PSYNC command with a special flag indicating that it has stepped down which can get rejected.

Where do we do this today?

zuiderkwast · 2025-02-18T00:28:07Z

I don't really like the primary explicitly picks a replica and then sends it a failover request for all replicas into the replica buffer. It seems odd we'll commit more out of band data into the replication stream and if for any reason that replica doesn't take over none of the other ones will do it quickly.

Wdym none of the others will do it quickly? Only one can do it quickly and only the primary can select it. If all replicas try to get elected quickly, none of them gets a majority. That's why there's a random delay in automatic failover, which is what we want to avoid here.

If we use the cluster bus, the primary sends it to the chosen replica only? We could use a new message and a cluster bus capability flag instead then.

IIRC the idea of using the replication buffer was that the replica will get to it only when the replication is complete. Maybe we were overthinking it?

zuiderkwast · 2025-02-18T00:34:27Z

It seems odd we'll commit more out of band data into the replication stream

Yeah, that is indeed not 100% clean. We are reusing the replication stream for non-replication purposes.

@PingXie Like we do with SETSLOT? It was a similar discussion. :)

PingXie · 2025-02-18T00:40:01Z

@PingXie Like we do with SETSLOT? It was a similar discussion. :)

Not exactly. We do want all replicas to execute SETSLOT. The thing in common though is that there is no user data involved in both cases. I can give you that :)

madolson · 2025-02-18T00:57:13Z

Btw, my opinion isn't that strong. I don't really mind another mechanism, it just seems weird that we are doing something yet another way.

Where do we do this today?

valkey/src/replication.c

Line 1047 in f85c933

    
           if (c->argc > 3 && !strcasecmp(c->argv[0]->ptr, "psync") && !strcasecmp(c->argv[3]->ptr, "failover")) {

Wdym none of the others will do it quickly? Only one can do it quickly and only the primary can select it. If all replicas try to get elected quickly, none of them gets a majority. That's why there's a random delay in automatic failover, which is what we want to avoid here.

One of them will have the lowest delay, and yes it will have some random component, but the benefit is that there will be less complexity and only a single failure path. There is a belief at AWS that it's more important to have fewer failover pathways, and they should all reinforce each other since there is a lower chance of bugs there.

Not exactly. We do want all replicas to execute SETSLOT. The thing in common though is that there is no user data involved in both cases. I can give you that :)

Maybe it's just me, but I think it kind of is data related. It's whether or not the given node is able to serve data related to a given slot. I guess you could say this is giving ownership over all of the slots to a given replica. I think our handling of slot state is a bit of a mess, although that's not really related to this PR.

IIRC the idea of using the replication buffer was that the replica will get to it only when the replication is complete. Maybe we were overthinking it?

Maybe I'm overthinking it :D

zuiderkwast · 2025-02-18T17:47:50Z

One of them will have the lowest delay, and yes it will have some random component, but the benefit is that there will be less complexity and only a single failure path.

@madolson So with this solution, we just skip the time it takes for the primary to be marked as FAIL. That's about half of the time it takes to get a new primary. If we do that, I think this feature misses the point.

Internally, we have a small script that starts valkey-server inside the same container, so valkey-server is a child process of the script. This script catches the SIGTERM from the OS (or kubernetes) and performs the failover by sending commands. It is very fast.

The purpose is to catch a worker reboot (machine, infra level) that is out of control of the valkey admin (or kubernetes operator). By any means necessary we want to avoid this 3 seconds downtime. (I think we use a node timeout of 2 or something.)

There is a belief at AWS that it's more important to have fewer failover pathways, and they should all reinforce each other since there is a lower chance of bugs there.

It's simplicity to the price of worse availability. We can (and we already do) handle this outside of valkey-server but that's more complex and non-standard solution that an official valkey-operator probably wouldn't use. It's more clean to just start valkey-server as it is.

Signed-off-by: Binbin <binloveplay1314@qq.com>

Co-authored-by: Ping Xie <pingxie@outlook.com> Signed-off-by: Binbin <binloveplay1314@qq.com>

zuiderkwast

I like the update with REPLCONF. Just some small comments.

src/server.h

valkey.conf

src/server.h

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2025-02-25T09:04:17Z

src/replication.c

+                return;
+            }
+
+            clusterNode *n = clusterLookupNode(c->argv[j + 1]->ptr, sdslen(c->argv[j + 1]->ptr));


I'm not sure if we need to do more checks here, this will use clusterNode in replication.c. We can also just sdsdup without any checks

sdsdup without check seems OK to me. We do the check when we lookup the node on shutdown...

Another possibility it to store the clusterNode pointer in the client struct instead of storing the node-id. Is it safe?

on the other hand, like in set-rdb-client-id, we did check if the client_id is valid

} else if (!strcasecmp(c->argv[j]->ptr, "set-rdb-client-id")) { /* REPLCONF identify <client-id> is used to identify the current replica main channel with existing * rdb-connection with the given id. */ long long client_id = 0; if (getLongLongFromObjectOrReply(c, c->argv[j + 1], &client_id, NULL) != C_OK) { return; } if (!lookupRdbClientByID(client_id)) { addReplyErrorFormat(c, "Unrecognized RDB client id %lld", client_id); return; } c->repl_data->associated_rdb_client_id = (uint64_t)client_id;

It bother me... should we check? or should we check more like check if the nodeid is a replica of myself.

Another possibility it to store the clusterNode pointer in the client struct instead of storing the node-id. Is it safe?

This seems safe, but we get stuck in the include header issue again.

on the other hand, like in set-rdb-client-id, we did check if the client_id is valid

OK, so let's keep sdsdup with simple validation then? No need to change it.

Another possibility it to store the clusterNode pointer in the client struct instead of storing the node-id. Is it safe?

This seems safe, but we get stuck in the include header issue again.

replication.c includes cluster.h.

The mess is just server.h that has prototypes for many C files. Recently @rjd15372 moved module protypes from server.h to module.h. I think that was a good idea. We could move replication related stuff from server.h to a new file replication.h. It would still need cluster.h though. It's not trivial to reduce the coupling between units.

enjoy-binbin · 2025-02-25T09:06:27Z

tests/assets/minimal-cluster.conf

@@ -1,4 +1,5 @@
 # Minimal configuration for testing.
+databases 1


This can be removed, in the future we will support multi dbs.

Suggested change

databases 1

I added this because it will print the warning in local.

/* in case cluster mode is enabled dbnum must be 1 */ if (server.cluster_enabled && server.dbnum > 1) { serverLog(LL_WARNING, "WARNING: Changing databases number from %d to 1 since we are in cluster mode", server.dbnum); server.dbnum = 1; }

enjoy-binbin requested a review from zuiderkwast September 30, 2024 04:16

enjoy-binbin force-pushed the shutdown_failover branch from 42dc65e to 6ab8888 Compare September 30, 2024 04:16

fix typo

4b49f03

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Sep 30, 2024

zuiderkwast reviewed Oct 1, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/unstable' into shutdown_failover

f9ca731

Signed-off-by: Binbin <binloveplay1314@qq.com>

add comment in the test

df0ef8d

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin added 2 commits October 21, 2024 10:44

Merge remote-tracking branch 'upstream/unstable' into shutdown_failover

594fd5a

Signed-off-by: Binbin <binloveplay1314@qq.com>

removing mf_is_primary_failover

519eb2a

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin force-pushed the shutdown_failover branch from 366e082 to 519eb2a Compare October 27, 2024 10:25

enjoy-binbin added 3 commits October 27, 2024 20:45

try to fix test

32043dd

Signed-off-by: Binbin <binloveplay1314@qq.com>

try to stable the test

e7b33fa

Signed-off-by: Binbin <binloveplay1314@qq.com>

Move the logic to clusterHandleServerShutdown

d6649e5

Signed-off-by: Binbin <binloveplay1314@qq.com>

zuiderkwast reviewed Oct 28, 2024

View reviewed changes

tests/unit/cluster/auto-failover-on-shutdown.tcl Outdated Show resolved Hide resolved

tests/unit/cluster/auto-failover-on-shutdown.tcl Outdated Show resolved Hide resolved

Adjust the tests

64831c9

Signed-off-by: Binbin <binloveplay1314@qq.com>

Do shutdown failover only when offset is match

b06a8c4

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin force-pushed the shutdown_failover branch from 8ee3291 to b06a8c4 Compare October 28, 2024 04:33

zuiderkwast reviewed Oct 28, 2024

View reviewed changes

tests/unit/cluster/auto-failover-on-shutdown.tcl Outdated Show resolved Hide resolved

enjoy-binbin commented Oct 28, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

enjoy-binbin added 2 commits January 8, 2025 13:31

Merge remote-tracking branch 'upstream/unstable' into shutdown_failover

5f7b429

Signed-off-by: Binbin <binloveplay1314@qq.com>

remove count++ and fix confilct

e56a360

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin added 2 commits February 7, 2025 19:37

minor fixes in regexp, avoid matching the second line

37147e8

Signed-off-by: Binbin <binloveplay1314@qq.com>

Merge remote-tracking branch 'upstream/unstable' into shutdown_failover

bf60ed6

Signed-off-by: Binbin <binloveplay1314@qq.com>

zuiderkwast approved these changes Feb 9, 2025

View reviewed changes

zuiderkwast mentioned this pull request Feb 10, 2025

[NEW] Rolling downgrade, forward compatibility #1108

Open

PingXie reviewed Feb 11, 2025

View reviewed changes

enjoy-binbin added 3 commits February 11, 2025 16:13

code review from Ping

27b6f6d

Signed-off-by: Binbin <binloveplay1314@qq.com>

Merge remote-tracking branch 'upstream/unstable' into shutdown_failover

61dd999

Signed-off-by: Binbin <binloveplay1314@qq.com>

Fix test

8423921

Signed-off-by: Binbin <binloveplay1314@qq.com>

zuiderkwast added the release-notes This issue should get a line item in the release notes label Feb 11, 2025

zuiderkwast mentioned this pull request Feb 12, 2025

Valkey release 8.1.0-rc1 #1713

Merged

PingXie approved these changes Feb 17, 2025

View reviewed changes

madolson reviewed Feb 17, 2025

View reviewed changes

madolson reviewed Feb 18, 2025

View reviewed changes

enjoy-binbin added 2 commits February 25, 2025 10:33

Merge remote-tracking branch 'upstream/unstable' into shutdown_failover

ed8c9bb

Signed-off-by: Binbin <binloveplay1314@qq.com>

change to use replconf set-cluster-node-id

9e00910

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin force-pushed the shutdown_failover branch from e95b857 to 9e00910 Compare February 25, 2025 06:45

Update valkey.conf

8cba555

Co-authored-by: Ping Xie <pingxie@outlook.com> Signed-off-by: Binbin <binloveplay1314@qq.com>

zuiderkwast reviewed Feb 25, 2025

View reviewed changes

src/server.h Outdated Show resolved Hide resolved

valkey.conf Outdated Show resolved Hide resolved

src/server.h Outdated Show resolved Hide resolved

enjoy-binbin added 3 commits February 25, 2025 17:00

Change nodeid to sds

ade48cb

Signed-off-by: Binbin <binloveplay1314@qq.com>

code review from Ping, add the assert <= 128

6b5cf7f

Signed-off-by: Binbin <binloveplay1314@qq.com>

Fix the index

e3fdb7c

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin commented Feb 25, 2025

View reviewed changes

		* REPLICAID is currently available only for internal so we won't
		* put it into the JSON file. */

		@@ -1,4 +1,5 @@
		# Minimal configuration for testing.
		databases 1

Trigger manual failover on SIGTERM / shutdown to cluster primary #1091

Are you sure you want to change the base?

Trigger manual failover on SIGTERM / shutdown to cluster primary #1091

Conversation

enjoy-binbin commented Sep 30, 2024 • edited Loading

New argument for CLUSTER FAILOVER

Replica sends REPLCONF SET-CLUSTER-NODE-ID to inform its node-id

Primary issue CLUSTER FAILOVER

Add a new auto-failover-on-shutdown config

codecov bot commented Sep 30, 2024 • edited Loading

Codecov Report

zuiderkwast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enjoy-binbin commented Oct 6, 2024

zuiderkwast commented Oct 6, 2024

enjoy-binbin commented Oct 18, 2024 • edited Loading

zuiderkwast commented Oct 18, 2024

enjoy-binbin commented Oct 27, 2024

zuiderkwast left a comment

Choose a reason for hiding this comment

PingXie commented Oct 28, 2024

zuiderkwast commented Oct 28, 2024

zuiderkwast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PingXie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madolson Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madolson left a comment

Choose a reason for hiding this comment

PingXie commented Feb 18, 2025

zuiderkwast commented Feb 18, 2025

zuiderkwast commented Feb 18, 2025

PingXie commented Feb 18, 2025 • edited Loading

madolson commented Feb 18, 2025

zuiderkwast commented Feb 18, 2025

zuiderkwast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enjoy-binbin commented Sep 30, 2024 •

edited

Loading

codecov bot commented Sep 30, 2024 •

edited

Loading

enjoy-binbin commented Oct 18, 2024 •

edited

Loading

madolson Feb 18, 2025 •

edited

Loading

PingXie commented Feb 18, 2025 •

edited

Loading