[FIXED] Don't spin during snapshot processing with no leader (#6050)

With the introduction of these PRs, #5939 and #5986, we don't blow away our state anymore as we can keep retrying. However, if a follower had installed a snapshot from the leader and would then start processing it, only for the leader to go offline for an extended period, we could spin. Since we'd immediately detect there's no leader, stop the RAFT group, recreate it, stop since no leader, etc. etc. Prevent spinning by introducing some wait time in-between if it's the first time trying, and check before returning if a leader became available since as then we could still continue. Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
nats-io · Oct 29, 2024 · 07c7eda · 07c7eda
2 parents c9e24ca + 1be38b9
commit 07c7eda
Showing 1 changed file with 10 additions and 1 deletion.
diff --git a/server/jetstream_cluster.go b/server/jetstream_cluster.go
@@ -8380,7 +8380,16 @@ RETRY:
 	releaseSyncOutSem()
 
 	if n.GroupLeader() == _EMPTY_ {
-		return fmt.Errorf("%w for stream '%s > %s'", errCatchupAbortedNoLeader, mset.account(), mset.name())
+		// Prevent us from spinning if we've installed a snapshot from a leader but there's no leader online.
+		// We wait a bit to check if a leader has come online in the meantime, if so we can continue.
+		var canContinue bool
+		if numRetries == 0 {
+			time.Sleep(startInterval)
+			canContinue = n.GroupLeader() != _EMPTY_
+		}
+		if !canContinue {
+			return fmt.Errorf("%w for stream '%s > %s'", errCatchupAbortedNoLeader, mset.account(), mset.name())
+		}
 	}
 
 	// If we have a sub clear that here.