Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: step down immediately on leader transfer #130123

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Sep 4, 2024

Part of #129807.

This commit updates the handling of raft leadership transfers to have the outgoing leader step down immediately after the MsgTimeoutNow is sent. We retain the leadTransferee field because there is still a "pending", reversible transfer state while the leader is catching the follower up on its log. However, once the catch up is complete, the leader now steps down, as it has irrevocably compromised its leadership term by giving the target permission to overthrow it.

If the transfer fails after the MsgTimeoutNow has been sent, the leader (who has stepped down to a follower) must call a new election at a new term in order to reestablish leadership.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten requested review from pav-kv and removed request for pav-kv September 5, 2024 17:38
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/raftLeaderTransfer branch 2 times, most recently from 91ce7d8 to b7fe394 Compare September 5, 2024 19:06
Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Talking about this in the pod meeting yesterday and then coming to review the PR made it a breeze

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/raft/raft.go line 372 at r3 (raw file):

	// leadTransferee, if set, is the id of the leader transfer target during a
	// pending leadership transfer. The value is set while the outgoing leader
	// (this node) is catching the target up on its log. Once the target is caught

Consider adding some words about how this field is used when the leader is catching up the target. You mentioned in yesterday's pod meeting that it's to reject incoming proposals, right?

nit: maybe a [1] pattern that we use elsewhere to not distract from the main idea of the comment.


pkg/raft/raft.go line 376 at r3 (raw file):

// it has irrevocably compromised its leadership term by giving the target
// permission to overthrow it.

ChatGPT couldn't have dramatized this better 😂


pkg/raft/raft.go line 1760 at r3 (raw file):

		// Transfer leadership to third party.
		r.logger.Infof("%x [term %d] starts to transfer leadership to %x", r.id, r.Term, leadTransferee)
		// Transfer leadership should be finished in one electionTimeout, so reset r.electionElapsed.

nit: should we expand this comment to say that if it doesn't, the outgoing leader is free to campaign at a higher term.


pkg/raft/raft.go line 1768 at r3 (raw file):

		} else {
			pr.MsgAppProbesPaused = false
			r.maybeSendAppend(leadTransferee)

Consider adding a comment here that explains how we handle the state transition when we're trying to transfer leadership to a follower that's behind on it's log. Might be worth including a pointer to MsgAppResp where we call transferLeader?

All of this was very easy to follow along because we talked about this yesterday and you've got a great explanation in your commit message, but if I were just reading the code, I'd think this was a bit under documented and subtle.


pkg/raft/raft.go line 2301 at r3 (raw file):

}

func (r *raft) transferLeader(to pb.PeerID) {

Should we add an assertion to say r.state == StateLeader && r.lead == r.id?


pkg/raft/raft.go line 2306 at r3 (raw file):

}

func (r *raft) abortLeaderTransfer() {

Should we add the assertion from above here as well?


pkg/raft/testdata/confchange_v2_replace_leader.txt line 158 at r3 (raw file):

raft-state
----
1: StateFollower (Voter) Term:1 Lead:0

Unfortunate how the Lead: 0 diff came about due to

if s.RaftState == StateFollower && s.Lead == r.id {
// A raft leader's term ends when it is shut down. It'll rejoin its peers as
// a follower when it comes back up, but its Lead and Term field may still
// correspond to its pre-restart leadership term. We expect this to quickly
// be updated when it hears from the new leader, if one was elected in its
// absence, or when it campaigns.
//
// The layers above raft (in particular kvserver) do not handle the case
// where a raft node's state is StateFollower but its lead field points to
// itself. They expect the Lead field to correspond to the current leader,
// which we know we are not. For their benefit, we overwrite the Lead field
// to None.
//
// TODO(arul): the layers above should not conflate Lead with current
// leader. Fix that and get rid of this overwrite.
s.HardState.Lead = None
}


pkg/raft/raft_test.go line 3201 at r4 (raw file):

		}
		require.Equal(t, expState, r.state)
		require.Equal(t, uint64(3), r.Term)

Do you think it's worth adding a variant of this test where the node that receives the MsgTimeoutNow (n1) from n2, T2 hasn't heard about n3 being leader at T3?

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)


pkg/raft/raft.go line 372 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Consider adding some words about how this field is used when the leader is catching up the target. You mentioned in yesterday's pod meeting that it's to reject incoming proposals, right?

nit: maybe a [1] pattern that we use elsewhere to not distract from the main idea of the comment.

Done.


pkg/raft/raft.go line 376 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

// it has irrevocably compromised its leadership term by giving the target
// permission to overthrow it.

ChatGPT couldn't have dramatized this better 😂

There's no point writing code comments if you aren't going to make them dramatic 😂


pkg/raft/raft.go line 1760 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: should we expand this comment to say that if it doesn't, the outgoing leader is free to campaign at a higher term.

Done.


pkg/raft/raft.go line 1768 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Consider adding a comment here that explains how we handle the state transition when we're trying to transfer leadership to a follower that's behind on it's log. Might be worth including a pointer to MsgAppResp where we call transferLeader?

All of this was very easy to follow along because we talked about this yesterday and you've got a great explanation in your commit message, but if I were just reading the code, I'd think this was a bit under documented and subtle.

Done.


pkg/raft/raft.go line 2301 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we add an assertion to say r.state == StateLeader && r.lead == r.id?

Done.


pkg/raft/raft.go line 2306 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we add the assertion from above here as well?

We call this in reset as well, so it doesn't necessarily assume that a leadership transfer is in progress. I'll leave as is for now.


pkg/raft/raft_test.go line 3201 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Do you think it's worth adding a variant of this test where the node that receives the MsgTimeoutNow (n1) from n2, T2 hasn't heard about n3 being leader at T3?

Done.

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r5, 3 of 3 files at r6, 2 of 2 files at r7, 1 of 1 files at r8, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/raft/raft.go line 2306 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We call this in reset as well, so it doesn't necessarily assume that a leadership transfer is in progress. I'll leave as is for now.

Want to add a TODO to add this assertion in the future? Feel free to put my name on it.


pkg/raft/raft.go line 1773 at r6 (raw file):

		} else {
			// If the transfer target is not initially caught up on its log, we don't
			// sent it a MsgTimeoutNow immediately. Instead, we eagerly try to catch

nit: s/sent/send/

Instead of checking for a leader step down multiple times in the middle of
tickHeartbeat, check once on the code path that could cause the leader to
step down (check quorum). This logic dates back to 65b78dc and 3d04e27.

Epic: None
Release note: None
This commit adds assertions on `raft.state` at the beginning of
`tickElection` and `tickHeartbeat`.

Epic: None
Release note: None
Part of cockroachdb#129807.

This commit updates the handling of raft leadership transfers to have the
outgoing leader step down immediately after the MsgTimeoutNow is sent. We
retain the `leadTransferee` field because there is still a "pending",
reversible transfer state while the leader is catching the follower up on
its log. However, once the catch up is complete, the leader now steps down,
as it has irrevocably compromised its leadership term by giving the target
permission to overthrow it.

If the transfer fails after the MsgTimeoutNow has been sent, the leader
(who has stepped down to a follower) must call a new election at a new
term in order to reestablish leadership.

Release note: None
This commit adds a new `TestLeaderTransferDifferentTerms`, which verifies
that a MsgTimeoutNow will only be respected if it is from the current term
or from a new term.

Epic: None
Release note: None
This commit adds a new `TestLeaderTransferStaleFollower`, which verifies
that a MsgTimeoutNow received by a stale follower (a follower still at
an earlier term) will cause the follower to call an election which it
can not win.

Epic: None
Release note: None
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)


pkg/raft/raft.go line 2306 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Want to add a TODO to add this assertion in the future? Feel free to put my name on it.

Done.


pkg/raft/raft.go line 1773 at r6 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: s/sent/send/

Done.

@craig craig bot merged commit f3519c0 into cockroachdb:master Sep 6, 2024
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants