-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent disruptive rejoining node #9333
Comments
We expect the follower to receive a heartbeat from the leader within the heartbeat timeout. This is a reasonable assumption to make. If restarting nodes keep on disrupting the cluster, the user probably set heartbeat timeout to some low value. Also we will have pre vote enable enabled in the future. I do not think we should make what etcd server does today does today configurable unless I missed something. |
Agree. This should reduce disruptive elections and resolve most issues. |
Copying some things from the email thread here - I think we either misunderstand some logs or there is some bug in etcd:
|
We are hitting this issue from #5468. https://github.com/coreos/etcd/blob/df4aafbbdfe5a96699f01bee66b4ddb04d24444f/raft/raft.go#L806-L822 UPDATE: On second thought, #5468 makes sense in case there is an isolated candidate. Responding with Will enable pre-vote in etcd. |
@gyuho even without pre-vote, the restarted node should not ALWAYS disrupt the leadership. If it is the case, we need to figure out why. Enabling pre-vote might mask the bug. |
@xiang90 Right, loosening advance election timeout ticks should help, in addition to pre-vote. Otherwise, we are just hoping restarted follower receives heartbeat before last tick elapse.
This is clearly follower(or candidate) with higher term responding to leader heartbeat, thus forcing the leader to step down, which is expected in Raft to prevent isolated follower being stuck with repeating elections. Maybe add more logs around this? |
Not really. What we should do it to account for the connection/initialization time when setting the initial election timeout to a low value when the node is restarted. Right now, we set it low to the heartbeat timeout to shorten the initial leader election, where the timeout might be even faster than the initialization time as @wojtek-t pointed out. |
Hi, I was able to reproduce this issue (rejoining node triggers leader election) in version 3.1.13:
2fae52714728f955's logs:
f34ba41f615b4eff's logs:
8cc587412352bc16's logs:
Questions:
|
@wojtek-t @mborsz We will re-investigate on this.
Previously, we had only one tick left (
It's because the rebooting follower
And this is where the rejoining follower finds two other peers out of 3-node cluster, and fast-forwards 8 ticks, and with only 2 ticks left before it triggers an election. And our expectation was within the last 2 ticks, the rejoining follower is able to receive leader heartbeat and does not start a disruptive election. This is useful for cross-datacenter deploy with larger heartbeat intervals; when one tick is 1-second, without fast-forwarding you would wait the 10 seconds to find out there's no leader, with fast-forward you just wait 2-second. But this is still theoretical scenario, and if this happens very often in your use case, might make sense to make this configurable. Logs tell that rejoining follower was expecting leader heartbeat within 100ms x 2 (200 ms) at Before we make this configurable for low network latency environments, I want to be sure that it won't mask any other bugs. |
Defaults (heartbeat - 100ms, leader election - 1s). |
By default, etcd --initial-election-tick-advance=true, then local member fast-forwards election ticks to speed up "initial" leader election trigger. This benefits the case of larger election ticks. For instance, cross datacenter deployment may require longer election timeout of 10-second. If true, local node does not need wait up to 10-second. Instead, forwards its election ticks to 8-second, and have only 2-second left before leader election. Major assumptions are that: cluster has no active leader thus advancing ticks enables faster leader election. Or cluster already has an established leader, and rejoining follower is likely to receive heartbeats from the leader after tick advance and before election timeout. However, when network from leader to rejoining follower is congested, and the follower does not receive leader heartbeat within left election ticks, disruptive election has to happen thus affecting cluster availabilities. Disabling this would slow down initial bootstrap process for cross datacenter deployments. We don't care too much about the delay in cluster bootstrap, but we do care about the availability of etcd clusters. With "initial-election-tick-advance" set to false, a rejoining node has more chance to receive leader heartbeats before disrupting the cluster. etcd-io/etcd#9333
Original discussion https://groups.google.com/d/msg/etcd-dev/82bPTmzGVM0/EiTINb6dBQAJ
https://github.com/coreos/etcd/blob/b03fd4cbc30eaa9f66faca5df655eaaca56990a0/etcdserver/raft.go#L416-L421
https://github.com/coreos/etcd/blob/b03fd4cbc30eaa9f66faca5df655eaaca56990a0/etcdserver/raft.go#L450-L455
Ideally, the acting leader sends a heartbeat message to this follower before step 3, so that follower resets its election ticks and stays available without becoming a candidate.
advanceTicksForElection
helps cross-datacenter deployments with larger election timeouts, by speeding up its bootstrap process. However, some prefer high availabilities, and disablingadvanceTicksForElection
can increase the availability of restarting follower node./cc @jpbetz
The text was updated successfully, but these errors were encountered: