Fix desync after errCatchupAbortedNoLeader #5986

MauriceVanVeen · 2024-10-10T16:09:39Z

Previously a related case of RAFT state being deleted was fixed, when running into errCatchupTooManyRetries: #5939

After hitting this we shutdown and retry.. but if we have not elected a leader yet we'd hit "catchup for stream '%s > %s' aborted, no leader", which then would again throw away RAFT state. This PR proposes a fix for that case.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

server/jetstream_cluster.go

mprimi

Is this a fair summary of the change? (the PR description makes some references and the commit message does not say what the change does):

When aborting catchup due to leader not present, do not wipe replica state

server/jetstream_cluster.go

MauriceVanVeen · 2024-10-10T20:31:46Z

Is this a fair summary of the change?
When aborting catchup due to leader not present, do not wipe replica state

Yes 🙂

server/jetstream_cluster.go

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

derekcollison · 2024-10-11T20:20:44Z

LMK when this is good to go for review.

MauriceVanVeen · 2024-10-11T20:34:41Z

LMK when this is good to go for review.

Awaiting CI to be green, but otherwise good for review

derekcollison

LGTM

Includes: - #5986 - #5995 - #6000 - #5996 - #6002 - #6003 - #6007 Signed-off-by: Neil Twigg <neil@nats.io>

With the introduction of these PRs, #5939 and #5986, we don't blow away our state anymore as we can keep retrying. However, if a follower had installed a snapshot from the leader and would then start processing it, only for the leader to go offline for an extended period, we could spin. Since we'd immediately detect there's no leader, stop the RAFT group, recreate it, stop since no leader, etc. etc. Prevent spinning by introducing some wait time in-between if it's the first time trying, and check before returning if a leader became available since as then we could still continue. Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

MauriceVanVeen commented Oct 10, 2024

View reviewed changes

server/jetstream_cluster.go Outdated Show resolved Hide resolved

mprimi reviewed Oct 10, 2024

View reviewed changes

server/jetstream_cluster.go Outdated Show resolved Hide resolved

MauriceVanVeen force-pushed the maurice/desync-after-catchup-no-leader branch from d421d9b to c391623 Compare October 10, 2024 20:30

mprimi reviewed Oct 11, 2024

View reviewed changes

server/jetstream_cluster.go Outdated Show resolved Hide resolved

Fix desync after errCatchupAbortedNoLeader

bf69ce9

Signed-off-by: Maurice van Veen <github@mauricevanveen.com>

MauriceVanVeen force-pushed the maurice/desync-after-catchup-no-leader branch from c391623 to bf69ce9 Compare October 11, 2024 20:02

MauriceVanVeen marked this pull request as ready for review October 11, 2024 20:34

MauriceVanVeen requested a review from a team as a code owner October 11, 2024 20:34

derekcollison approved these changes Oct 11, 2024

View reviewed changes

derekcollison merged commit d18f743 into main Oct 11, 2024
5 checks passed

derekcollison deleted the maurice/desync-after-catchup-no-leader branch October 11, 2024 21:35

neilalexander mentioned this pull request Oct 16, 2024

Cherry-picks for 2.10.22-RC.3 #6012

Merged

neilalexander added a commit that referenced this pull request Oct 16, 2024

Cherry-picks for 2.10.22-RC.3 (#6012)

d493f0d

Includes: - #5986 - #5995 - #6000 - #5996 - #6002 - #6003 - #6007 Signed-off-by: Neil Twigg <neil@nats.io>

MauriceVanVeen mentioned this pull request Oct 29, 2024

[FIXED] Don't spin during snapshot processing with no leader #6050

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix desync after errCatchupAbortedNoLeader #5986

Fix desync after errCatchupAbortedNoLeader #5986

MauriceVanVeen commented Oct 10, 2024

mprimi left a comment

MauriceVanVeen commented Oct 10, 2024

derekcollison commented Oct 11, 2024

MauriceVanVeen commented Oct 11, 2024

derekcollison left a comment

Fix desync after errCatchupAbortedNoLeader #5986

Fix desync after errCatchupAbortedNoLeader #5986

Conversation

MauriceVanVeen commented Oct 10, 2024

mprimi left a comment

Choose a reason for hiding this comment

MauriceVanVeen commented Oct 10, 2024

derekcollison commented Oct 11, 2024

MauriceVanVeen commented Oct 11, 2024

derekcollison left a comment

Choose a reason for hiding this comment