-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trigger manual failover on SIGTERM / shutdown to cluster primary #1091
Open
enjoy-binbin
wants to merge
31
commits into
valkey-io:unstable
Choose a base branch
from
enjoy-binbin:shutdown_failover
base: unstable
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+313
−20
Open
Changes from 1 commit
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
6ab8888
Trigger manual failover on SIGTERM / shutdown to cluster primary
enjoy-binbin 4b49f03
fix typo
enjoy-binbin f9ca731
Merge remote-tracking branch 'upstream/unstable' into shutdown_failover
enjoy-binbin df0ef8d
add comment in the test
enjoy-binbin 594fd5a
Merge remote-tracking branch 'upstream/unstable' into shutdown_failover
enjoy-binbin 519eb2a
removing mf_is_primary_failover
enjoy-binbin 32043dd
try to fix test
enjoy-binbin e7b33fa
try to stable the test
enjoy-binbin d6649e5
Move the logic to clusterHandleServerShutdown
enjoy-binbin 64831c9
Adjust the tests
enjoy-binbin b06a8c4
Do shutdown failover only when offset is match
enjoy-binbin 5f7b429
Merge remote-tracking branch 'upstream/unstable' into shutdown_failover
enjoy-binbin e56a360
remove count++ and fix confilct
enjoy-binbin 0ccc4e4
Merge remote-tracking branch 'upstream/unstable' into shutdown_failover
enjoy-binbin c9bfd69
CLUSTER FAILOVER replicaid node-id
enjoy-binbin c8037a1
code review v1
enjoy-binbin d70036b
Merge remote-tracking branch 'upstream/unstable' into shutdown_failover
enjoy-binbin 7d55db6
code review: remove error reply check, add cross-version test
enjoy-binbin 4d5da8a
fix format and add log to debug the test
enjoy-binbin a1f957c
Update valkey.conf
enjoy-binbin 37147e8
minor fixes in regexp, avoid matching the second line
enjoy-binbin bf60ed6
Merge remote-tracking branch 'upstream/unstable' into shutdown_failover
enjoy-binbin 27b6f6d
code review from Ping
enjoy-binbin 61dd999
Merge remote-tracking branch 'upstream/unstable' into shutdown_failover
enjoy-binbin 8423921
Fix test
enjoy-binbin ed8c9bb
Merge remote-tracking branch 'upstream/unstable' into shutdown_failover
enjoy-binbin 9e00910
change to use replconf set-cluster-node-id
enjoy-binbin 8cba555
Update valkey.conf
enjoy-binbin ade48cb
Change nodeid to sds
enjoy-binbin 6b5cf7f
code review from Ping, add the assert <= 128
enjoy-binbin e3fdb7c
Fix the index
enjoy-binbin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
proc shutdown_on_how {srv_id how} { | ||
enjoy-binbin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if {$how == "shutdown"} { | ||
catch {R $srv_id shutdown nosave} | ||
} elseif {$how == "sigterm"} { | ||
exec kill -SIGTERM [s -$srv_id process_id] | ||
} | ||
} | ||
|
||
proc test_main {how} { | ||
test "auto-failover-on-shutdown will always pick a best replica and send CLUSTER FAILOVER - $how" { | ||
set primary [srv 0 client] | ||
set replica1 [srv -3 client] | ||
enjoy-binbin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
set replica1_pid [s -3 process_id] | ||
set replica2 [srv -6 client] | ||
set replica2_ip [srv -6 host] | ||
set replica2_port [srv -6 port] | ||
|
||
# Pause a replica so it has no chance to catch up with the offset. | ||
pause_process $replica1_pid | ||
|
||
# Primary write some data to increse the offset. | ||
for {set i 0} {$i < 10} {incr i} { | ||
$primary incr key_991803 | ||
} | ||
|
||
# Wait the replica2 catch up with the offset | ||
wait_replica_acked_ofs $primary $replica2_ip $replica2_port | ||
|
||
# Shutdown the primary. | ||
shutdown_on_how 0 $how | ||
|
||
# Wait for the replica2 to become a primary. | ||
wait_for_condition 1000 50 { | ||
[s -6 role] eq {master} | ||
zuiderkwast marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} else { | ||
puts "s -6 role: [s -6 role]" | ||
fail "Failover does not happened" | ||
} | ||
|
||
# Make sure that the expected logs are printed. | ||
verify_log_message 0 "*Sending CLUSTER FAILOVER FORCE to replica*" 0 | ||
verify_log_message -6 "*Forced failover primary request accepted*" 0 | ||
|
||
resume_process $replica1_pid | ||
} | ||
|
||
test "Unable to find a replica to perform an auto failover - $how" { | ||
set primary [srv -6 client] | ||
set replica1 [srv -3 client] | ||
set replica1_pid [s -3 process_id] | ||
|
||
pause_process $replica1_pid | ||
|
||
$primary client kill type replica | ||
shutdown_on_how 6 $how | ||
wait_for_log_messages -6 {"*Unable to find a replica to perform an auto failover on shutdown*"} 0 1000 10 | ||
|
||
resume_process $replica1_pid | ||
} | ||
} | ||
|
||
start_cluster 3 4 {tags {external:skip cluster} overrides {cluster-ping-interval 1000 cluster-node-timeout 5000 shutdown-timeout 0 auto-failover-on-shutdown yes}} { | ||
test_main "shutdown" | ||
} | ||
|
||
start_cluster 3 4 {tags {external:skip cluster} overrides {cluster-ping-interval 1000 cluster-node-timeout 5000 shutdown-timeout 0 auto-failover-on-shutdown yes}} { | ||
test_main "sigterm" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not very good that we add some conditional logic about how we are counting the votes. It makes the voting algorithm more complex to analyze.
I understand we need this special case if the primary exits immediately after sending CLUSTER FAILOVER FORCE. Maybe we can avoid it if the primary waits for the failover auth requests and actually votes, before the primary shuts down..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is not very good so i add a todo. I am also not sure about this logic, and in fact we can skip this logic and the failover will also work since i think we will still have the enouth votes. I add this logic just trying open the discusstion so that we can discuss it and see if should we handle this case.