-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NRG (2.11): Mark n.Leader() when complete with applied floor #6518
Conversation
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
node.StepDown(preferredLeader) | ||
|
||
// Call actual stepdown. | ||
err = node.StepDown(preferredLeader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to call StepDown
in a go routine after a 250ms delay when stepping down stream/consumer.
This was introduced in #3079, to not process messages while stepping down. This is not needed anymore, as the RAFT logic itself will take care of that.
return nil | ||
} | ||
// If we are no longer the leader stop trying. | ||
if !mset.isLeader() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
De-flake for TestJetStreamSuperClusterSourceAndMirrorConsumersLeaderChange
as it would still create a mirror consumer even if the stream was not a leader anymore after being stepped down.
// Make sure this is not a new consumer with the same name. | ||
if nca != nil && nca == ca { | ||
// Make sure this is the same consumer assignment, and not a new consumer with the same name. | ||
if nca != nil && reflect.DeepEqual(nca, ca) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
De-flakes TestJetStreamClusterGhostEphemeralsAfterRestart
which would flake with Still have %d missing consumers
. This would happen because a server could come online and need to catchup on some consumer deletes. But the leader had made a snapshot, making this server process that snapshot, which overwrites the consumer assignment to have the exact same config and RAFT group, but using different pointers. That would leave these consumer assignments around, whereas they should have been deleted still.
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, have discussed this one with Maurice.
Test could fail with: ``` jetstream_cluster_4_test.go:201: Should have dropped message published on "messages.3553" since got error: nats: timeout ``` This was due to a regression introduced by #6518, since we now wait to return `isLeader()` until we're fully up-to-date with our initial log when becoming new leader. But we could have some entries from a previous leader that weren't applied yet, which would result in timeouts because we wouldn't respond under the new logic. So don't check `n.Leader()` state, but just RAFT leader state to respond in `processJetStreamMsg`. But it could also genuinely timeout, in which case the test shouldn't fail either. Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Follow-up of #6485
After above PR the RAFT state would become
Leader
as normal, but a call toupdateLeadChange
would be delayed until the server is up-to-date with all stored but unapplied entries from its log. The intention of this is to not respond to read/write requests for meta/stream/consumer until the leader is in a complete state, which ensures consistent handling of requests after leader changes.However, calls like
isLeader
,isStreamLeader
, andisConsumerLeader
would requestn.Leader()
which returns whether the RAFT node is leader. And it would not use the signal sent toupdateLeadChange
.To achieve this level of consistency with minimal code changes:
n.Leader()
only returns true once the RAFT node is both a leader, and all initial entries from its log were applied.n.StepDown()
would be preceded byn.Leader()
, however with it being changed to not just return whether the RAFT node is leader this is not possible anymore. Instead we can always just calln.StepDown()
without checking for being leader. As the RAFT code itself already checks for this as well.Some test de-flakes are included as well.
Signed-off-by: Maurice van Veen github@mauricevanveen.com