Skip to content

Commit

Permalink
[FIXED] Don't spin during snapshot processing with no leader (#6050)
Browse files Browse the repository at this point in the history
With the introduction of these PRs,
#5939 and
#5986, we don't blow away our
state anymore as we can keep retrying.

However, if a follower had installed a snapshot from the leader and
would then start processing it, only for the leader to go offline for an
extended period, we could spin. Since we'd immediately detect there's no
leader, stop the RAFT group, recreate it, stop since no leader, etc.
etc.

Prevent spinning by introducing some wait time in-between if it's the
first time trying, and check before returning if a leader became
available since as then we could still continue.


Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
  • Loading branch information
derekcollison authored Oct 29, 2024
2 parents c9e24ca + 1be38b9 commit 07c7eda
Showing 1 changed file with 10 additions and 1 deletion.
11 changes: 10 additions & 1 deletion server/jetstream_cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -8380,7 +8380,16 @@ RETRY:
releaseSyncOutSem()

if n.GroupLeader() == _EMPTY_ {
return fmt.Errorf("%w for stream '%s > %s'", errCatchupAbortedNoLeader, mset.account(), mset.name())
// Prevent us from spinning if we've installed a snapshot from a leader but there's no leader online.
// We wait a bit to check if a leader has come online in the meantime, if so we can continue.
var canContinue bool
if numRetries == 0 {
time.Sleep(startInterval)
canContinue = n.GroupLeader() != _EMPTY_
}
if !canContinue {
return fmt.Errorf("%w for stream '%s > %s'", errCatchupAbortedNoLeader, mset.account(), mset.name())
}
}

// If we have a sub clear that here.
Expand Down

0 comments on commit 07c7eda

Please sign in to comment.