You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please check the FAQ documentation before raising an issue
Describe the bug (required)
We frequently observe that raft leader election take very long time(20xElectionTimeout). Log from our test:
I1207 14:47:32.645941 319 figure8.go:43] leader store1 isolated, down list: map[store1:{} store4:{} store5:{}]
I1207 14:47:32.645992 319 figure8.go:53] rejoing instance store4
I1207 14:47:32.774480 319 raft.go:150] found leader of term: 117, leader: store5
I1207 14:47:32.779685 319 raft.go:150] found leader of term: 123, leader: store1
I1207 14:47:32.801347 319 figure8.go:45] found isolated leader: store1, down list: map[store1:{} store5:{}], continue
I1207 14:47:32.808862 319 raft.go:150] found leader of term: 123, leader: store1
I1207 14:47:32.845548 319 figure8.go:45] found isolated leader: store1, down list: map[store1:{} store5:{}], continue
I1207 14:47:32.847410 319 raft.go:150] found leader of term: 117, leader: store5
I1207 14:47:32.898852 319 figure8.go:45] found isolated leader: store1, down list: map[store1:{} store5:{}], continue
I1207 14:47:32.902135 319 raft.go:150] found leader of term: 123, leader: store1
I1207 14:47:32.944534 319 raft.go:150] found leader of term: 117, leader: store5
...
I1207 14:47:50.117698 319 figure8.go:45] found isolated leader: store1, down list: map[store1:{} store5:{}], continue
I1207 14:47:50.163188 319 raft.go:150] found leader of term: 123, leader: store1
I1207 14:47:50.168412 319 figure8.go:45] found isolated leader: store1, down list: map[store1:{} store5:{}], continue
I1207 14:47:50.171734 319 raft.go:150] found leader of term: 123, leader: store1
I1207 14:47:51.712303 319 figure8.go:45] found isolated leader: store1, down list: map[store1:{} store5:{}], continue
I1207 14:47:51.723445 319 raft.go:150] found leader of term: 123, leader: store1
I1207 14:47:51.726491 319 raft.go:150] found leader of term: 137, leader: store2
The election timeout was set to 1s, it taks nearly 20s for store2, store3, store4 to elect a new leader after store1 and store5 was isolated. We dig into the log and find that in term 126, store3 and store4 had voted for store2, but store2 report no enough quorum:
store2 log:
1371 I1207 14:47:39.227897 50 RaftPart.cpp:1087] [Port: 9780, Space: 1, Part: 1] Did not get enough votes from election of term 126, isPreVote = 0
1372 I1207 14:47:39.959782 50 RaftPart.cpp:1261] [Port: 9780, Space: 1, Part: 1] Received a VOTING request: space = 1, partition = 1, candidateAddr = store4:9780, term = 127, lastLogId = 187, lastLogTerm = 112, isPreVote = 0
1373 I1207 14:47:39.960505 50 RaftPart.cpp:1289] [Port: 9780, Space: 1, Part: 1] The partition currently is a Candidate, lastLogId 191, lastLogTerm 123, committedLogId 189, term 126
1374 I1207 14:47:39.961179 50 RaftPart.cpp:1324] [Port: 9780, Space: 1, Part: 1] The partition's last term to receive a log is 123, which is newer than the candidate's log 112. So the candidate will be rejected
store3 log:
1366 I1207 14:47:38.695526 51 RaftPart.cpp:1261] [Port: 9780, Space: 1, Part: 1] Received a VOTING request: space = 1, partition = 1, candidateAddr = store2:9780, term = 126, lastLogId = 191, lastLogTerm = 123, isPreVote = 0
1367 I1207 14:47:38.699589 51 RaftPart.cpp:1289] [Port: 9780, Space: 1, Part: 1] The partition currently is a Candidate, lastLogId 190, lastLogTerm 117, committedLogId 189, term 123
1368 I1207 14:47:38.702976 51 RaftPart.cpp:1360] [Port: 9780, Space: 1, Part: 1] The partition will vote for the candidate "store2":9780, isPreVote = 0
1369 I1207 14:47:38.707033 55 RaftPart.cpp:1047] [Port: 9780, Space: 1, Part: 1] Partition's role has changed to Follower during the election, so discard the results
store4 log:
1436 I1207 14:47:38.690505 56 RaftPart.cpp:1261] [Port: 9780, Space: 1, Part: 1] Received a VOTING request: space = 1, partition = 1, candidateAddr = store2:9780, term = 126, lastLogId = 191, lastLogTerm = 123, isPreVote = 0
1437 I1207 14:47:38.691352 56 RaftPart.cpp:1289] [Port: 9780, Space: 1, Part: 1] The partition currently is a Follower, lastLogId 187, lastLogTerm 112, committedLogId 187, term 1261438 I1207 14:47:38.692145 56 RaftPart.cpp:1360] [Port: 9780, Space: 1, Part: 1] The partition will vote for the candidate "store2":9780, isPreVote = 0
1439 I1207 14:47:38.718819 56 RaftPart.cpp:1261] [Port: 9780, Space: 1, Part: 1] Received a VOTING request: space = 1, partition = 1, candidateAddr = store2:9780, term = 126, lastLogId = 191, lastLogTerm = 123, isPreVote = 0
1440 I1207 14:47:38.719914 56 RaftPart.cpp:1289] [Port: 9780, Space: 1, Part: 1] The partition currently is a Follower, lastLogId 187, lastLogTerm 112, committedLogId 187, term 1261441 I1207 14:47:38.720969 56 RaftPart.cpp:1360] [Port: 9780, Space: 1, Part: 1] The partition will vote for the candidate "store2":9780, isPreVote = 0
take a look at the collectNSucceeded() used by raft election:
<< ", Result list size=" << ctx->results.size() << "]";
ctx->promise.setValue(std::move(ctx->results));
}
}
});
}
return ctx->promise.getFuture();
}
} // namespace nebula
and set a breakpoint at line 49, we found that the thrift client of store3 throw an Timeout exception while waiting for AskVote response from store3. We suspect that store3's response might have been block by requests to store1 or store4.
so, the time to start election is way bigger than expected. 2. The collectN runs on same thread.
I dump the network traffic using tcpdump and found that the leader do send an election request to a partitioned raft instance(and no more request to other instances), so I think requests to other raft peers might actually blocked.
Please check the FAQ documentation before raising an issue
Describe the bug (required)
We frequently observe that raft leader election take very long time(20xElectionTimeout). Log from our test:
The election timeout was set to 1s, it taks nearly 20s for store2, store3, store4 to elect a new leader after store1 and store5 was isolated. We dig into the log and find that in term 126, store3 and store4 had voted for store2, but store2 report no enough quorum:
store2 log:
store3 log:
store4 log:
take a look at the
collectNSucceeded()
used by raft election:nebula/src/common/base/CollectNSucceeded-inl.h
Lines 1 to 69 in cd2f4a6
and set a breakpoint at line 49, we found that the thrift client of store3 throw an Timeout exception while waiting for AskVote response from store3. We suspect that store3's response might have been block by requests to store1 or store4.
Your Environments (required)
uname -a
g++ --version
orclang++ --version
lscpu
a3ffc7d8
)How To Reproduce(required)
Steps to reproduce the behavior:
Expected behavior
Additional context
The text was updated successfully, but these errors were encountered: