NRG (2.11): Mark n.Leader() when complete with applied floor #6518

MauriceVanVeen · 2025-02-17T19:29:59Z

Follow-up of #6485

After above PR the RAFT state would become Leader as normal, but a call to updateLeadChange would be delayed until the server is up-to-date with all stored but unapplied entries from its log. The intention of this is to not respond to read/write requests for meta/stream/consumer until the leader is in a complete state, which ensures consistent handling of requests after leader changes.

However, calls like isLeader, isStreamLeader, and isConsumerLeader would request n.Leader() which returns whether the RAFT node is leader. And it would not use the signal sent to updateLeadChange.

To achieve this level of consistency with minimal code changes:

n.Leader() only returns true once the RAFT node is both a leader, and all initial entries from its log were applied.
Calls to n.StepDown() would be preceded by n.Leader(), however with it being changed to not just return whether the RAFT node is leader this is not possible anymore. Instead we can always just call n.StepDown() without checking for being leader. As the RAFT code itself already checks for this as well.

Some test de-flakes are included as well.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

MauriceVanVeen · 2025-02-17T19:32:10Z

server/jetstream_api.go

-		node.StepDown(preferredLeader)
-
+	// Call actual stepdown.
+	err = node.StepDown(preferredLeader)


No need to call StepDown in a go routine after a 250ms delay when stepping down stream/consumer.
This was introduced in #3079, to not process messages while stepping down. This is not needed anymore, as the RAFT logic itself will take care of that.

MauriceVanVeen · 2025-02-17T19:33:01Z

server/stream.go

-			return nil
-		}
+	// If we are no longer the leader stop trying.
+	if !mset.isLeader() {


De-flake for TestJetStreamSuperClusterSourceAndMirrorConsumersLeaderChange as it would still create a mirror consumer even if the stream was not a leader anymore after being stepped down.

MauriceVanVeen · 2025-02-17T19:34:51Z

server/consumer.go

-				// Make sure this is not a new consumer with the same name.
-				if nca != nil && nca == ca {
+				// Make sure this is the same consumer assignment, and not a new consumer with the same name.
+				if nca != nil && reflect.DeepEqual(nca, ca) {


De-flakes TestJetStreamClusterGhostEphemeralsAfterRestart which would flake with Still have %d missing consumers. This would happen because a server could come online and need to catchup on some consumer deletes. But the leader had made a snapshot, making this server process that snapshot, which overwrites the consumer assignment to have the exact same config and RAFT group, but using different pointers. That would leave these consumer assignments around, whereas they should have been deleted still.

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

neilalexander

LGTM, have discussed this one with Maurice.

Test could fail with: ``` jetstream_cluster_4_test.go:201: Should have dropped message published on "messages.3553" since got error: nats: timeout ``` This was due to a regression introduced by #6518, since we now wait to return `isLeader()` until we're fully up-to-date with our initial log when becoming new leader. But we could have some entries from a previous leader that weren't applied yet, which would result in timeouts because we wouldn't respond under the new logic. So don't check `n.Leader()` state, but just RAFT leader state to respond in `processJetStreamMsg`. But it could also genuinely timeout, in which case the test shouldn't fail either. Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

MauriceVanVeen added 2 commits February 17, 2025 12:34

NRG: Mark n.Leader() when complete with applied floor

d35b6ad

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

De-flake tests

f7c34a9

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

MauriceVanVeen commented Feb 17, 2025

View reviewed changes

MauriceVanVeen changed the title ~~NRG: Mark n.Leader() when complete with applied floor~~ NRG (2.11): Mark n.Leader() when complete with applied floor Feb 17, 2025

Don't use n.State() for upper layer leader check

e754e52

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

MauriceVanVeen marked this pull request as ready for review February 19, 2025 12:10

MauriceVanVeen requested a review from a team as a code owner February 19, 2025 12:10

neilalexander approved these changes Feb 19, 2025

View reviewed changes

derekcollison merged commit 9782bf7 into main Feb 19, 2025
5 checks passed

derekcollison deleted the maurice/nrg-consistency branch February 19, 2025 13:42

MauriceVanVeen mentioned this pull request Mar 10, 2025

[FIXED] (2.11) Respond to new JetStream msg from previous leader #6627

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NRG (2.11): Mark n.Leader() when complete with applied floor #6518

NRG (2.11): Mark n.Leader() when complete with applied floor #6518

MauriceVanVeen commented Feb 17, 2025

MauriceVanVeen Feb 17, 2025

MauriceVanVeen Feb 17, 2025

MauriceVanVeen Feb 17, 2025

neilalexander left a comment

NRG (2.11): Mark n.Leader() when complete with applied floor #6518

NRG (2.11): Mark n.Leader() when complete with applied floor #6518

Conversation

MauriceVanVeen commented Feb 17, 2025

MauriceVanVeen Feb 17, 2025

Choose a reason for hiding this comment

MauriceVanVeen Feb 17, 2025

Choose a reason for hiding this comment

MauriceVanVeen Feb 17, 2025

Choose a reason for hiding this comment

neilalexander left a comment

Choose a reason for hiding this comment