-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix desync after errCatchupAbortedNoLeader #5986
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a fair summary of the change? (the PR description makes some references and the commit message does not say what the change does):
When aborting catchup due to leader not present, do not wipe replica state
d421d9b
to
c391623
Compare
Yes 🙂 |
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
c391623
to
bf69ce9
Compare
LMK when this is good to go for review. |
Awaiting CI to be green, but otherwise good for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
With the introduction of these PRs, #5939 and #5986, we don't blow away our state anymore as we can keep retrying. However, if a follower had installed a snapshot from the leader and would then start processing it, only for the leader to go offline for an extended period, we could spin. Since we'd immediately detect there's no leader, stop the RAFT group, recreate it, stop since no leader, etc. etc. Prevent spinning by introducing some wait time in-between if it's the first time trying, and check before returning if a leader became available since as then we could still continue. Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Previously a related case of RAFT state being deleted was fixed, when running into
errCatchupTooManyRetries
: #5939After hitting this we shutdown and retry.. but if we have not elected a leader yet we'd hit
"catchup for stream '%s > %s' aborted, no leader"
, which then would again throw away RAFT state. This PR proposes a fix for that case.Signed-off-by: Maurice van Veen github@mauricevanveen.com