Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix desync after errCatchupAbortedNoLeader #5986

Merged
merged 1 commit into from
Oct 11, 2024

Conversation

MauriceVanVeen
Copy link
Member

Previously a related case of RAFT state being deleted was fixed, when running into errCatchupTooManyRetries: #5939

After hitting this we shutdown and retry.. but if we have not elected a leader yet we'd hit "catchup for stream '%s > %s' aborted, no leader", which then would again throw away RAFT state. This PR proposes a fix for that case.

Signed-off-by: Maurice van Veen github@mauricevanveen.com

Copy link
Contributor

@mprimi mprimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a fair summary of the change? (the PR description makes some references and the commit message does not say what the change does):

When aborting catchup due to leader not present, do not wipe replica state

server/jetstream_cluster.go Outdated Show resolved Hide resolved
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/desync-after-catchup-no-leader branch from d421d9b to c391623 Compare October 10, 2024 20:30
@MauriceVanVeen
Copy link
Member Author

Is this a fair summary of the change?

When aborting catchup due to leader not present, do not wipe replica state

Yes 🙂

server/jetstream_cluster.go Outdated Show resolved Hide resolved
Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
@MauriceVanVeen MauriceVanVeen force-pushed the maurice/desync-after-catchup-no-leader branch from c391623 to bf69ce9 Compare October 11, 2024 20:02
@derekcollison
Copy link
Member

LMK when this is good to go for review.

@MauriceVanVeen
Copy link
Member Author

LMK when this is good to go for review.

Awaiting CI to be green, but otherwise good for review

@MauriceVanVeen MauriceVanVeen marked this pull request as ready for review October 11, 2024 20:34
@MauriceVanVeen MauriceVanVeen requested a review from a team as a code owner October 11, 2024 20:34
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit d18f743 into main Oct 11, 2024
5 checks passed
@derekcollison derekcollison deleted the maurice/desync-after-catchup-no-leader branch October 11, 2024 21:35
neilalexander added a commit that referenced this pull request Oct 16, 2024
Includes:

- #5986
- #5995 
- #6000
- #5996
- #6002
- #6003
- #6007

Signed-off-by: Neil Twigg <neil@nats.io>
derekcollison added a commit that referenced this pull request Oct 29, 2024
With the introduction of these PRs,
#5939 and
#5986, we don't blow away our
state anymore as we can keep retrying.

However, if a follower had installed a snapshot from the leader and
would then start processing it, only for the leader to go offline for an
extended period, we could spin. Since we'd immediately detect there's no
leader, stop the RAFT group, recreate it, stop since no leader, etc.
etc.

Prevent spinning by introducing some wait time in-between if it's the
first time trying, and check before returning if a leader became
available since as then we could still continue.


Signed-off-by: Maurice van Veen <github@mauricevanveen.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants