Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raft leader election taks very long time #3439

Closed
kikimo opened this issue Dec 9, 2021 · 2 comments
Closed

Raft leader election taks very long time #3439

kikimo opened this issue Dec 9, 2021 · 2 comments
Assignees
Labels
type/enhancement Type: make the code neat or more efficient
Milestone

Comments

@kikimo
Copy link
Contributor

kikimo commented Dec 9, 2021

Please check the FAQ documentation before raising an issue

Describe the bug (required)

We frequently observe that raft leader election take very long time(20xElectionTimeout). Log from our test:

I1207 14:47:32.645941     319 figure8.go:43] leader store1 isolated, down list: map[store1:{} store4:{} store5:{}]
I1207 14:47:32.645992     319 figure8.go:53] rejoing instance store4
I1207 14:47:32.774480     319 raft.go:150] found leader of term: 117, leader: store5
I1207 14:47:32.779685     319 raft.go:150] found leader of term: 123, leader: store1
I1207 14:47:32.801347     319 figure8.go:45] found isolated leader: store1, down list: map[store1:{} store5:{}], continue
I1207 14:47:32.808862     319 raft.go:150] found leader of term: 123, leader: store1
I1207 14:47:32.845548     319 figure8.go:45] found isolated leader: store1, down list: map[store1:{} store5:{}], continue
I1207 14:47:32.847410     319 raft.go:150] found leader of term: 117, leader: store5
I1207 14:47:32.898852     319 figure8.go:45] found isolated leader: store1, down list: map[store1:{} store5:{}], continue
I1207 14:47:32.902135     319 raft.go:150] found leader of term: 123, leader: store1
I1207 14:47:32.944534     319 raft.go:150] found leader of term: 117, leader: store5
...
I1207 14:47:50.117698     319 figure8.go:45] found isolated leader: store1, down list: map[store1:{} store5:{}], continue
I1207 14:47:50.163188     319 raft.go:150] found leader of term: 123, leader: store1
I1207 14:47:50.168412     319 figure8.go:45] found isolated leader: store1, down list: map[store1:{} store5:{}], continue
I1207 14:47:50.171734     319 raft.go:150] found leader of term: 123, leader: store1
I1207 14:47:51.712303     319 figure8.go:45] found isolated leader: store1, down list: map[store1:{} store5:{}], continue
I1207 14:47:51.723445     319 raft.go:150] found leader of term: 123, leader: store1
I1207 14:47:51.726491     319 raft.go:150] found leader of term: 137, leader: store2

The election timeout was set to 1s, it taks nearly 20s for store2, store3, store4 to elect a new leader after store1 and store5 was isolated. We dig into the log and find that in term 126, store3 and store4 had voted for store2, but store2 report no enough quorum:

store2 log:

1371 I1207 14:47:39.227897    50 RaftPart.cpp:1087] [Port: 9780, Space: 1, Part: 1] Did not get enough votes from election of term 126, isPreVote = 0
1372 I1207 14:47:39.959782    50 RaftPart.cpp:1261] [Port: 9780, Space: 1, Part: 1] Received a VOTING request: space = 1, partition = 1, candidateAddr = store4:9780, term = 127, lastLogId = 187, lastLogTerm = 112, isPreVote = 0
1373 I1207 14:47:39.960505    50 RaftPart.cpp:1289] [Port: 9780, Space: 1, Part: 1] The partition currently is a Candidate, lastLogId 191, lastLogTerm 123, committedLogId 189, term 126
1374 I1207 14:47:39.961179    50 RaftPart.cpp:1324] [Port: 9780, Space: 1, Part: 1] The partition's last term to receive a log is 123, which is newer than the candidate's log 112. So the candidate will be rejected

store3 log:

1366 I1207 14:47:38.695526    51 RaftPart.cpp:1261] [Port: 9780, Space: 1, Part: 1] Received a VOTING request: space = 1, partition = 1, candidateAddr = store2:9780, term = 126, lastLogId = 191, lastLogTerm = 123, isPreVote = 0
1367 I1207 14:47:38.699589    51 RaftPart.cpp:1289] [Port: 9780, Space: 1, Part: 1] The partition currently is a Candidate, lastLogId 190, lastLogTerm 117, committedLogId 189, term 123
1368 I1207 14:47:38.702976    51 RaftPart.cpp:1360] [Port: 9780, Space: 1, Part: 1] The partition will vote for the candidate "store2":9780, isPreVote = 0
1369 I1207 14:47:38.707033    55 RaftPart.cpp:1047] [Port: 9780, Space: 1, Part: 1] Partition's role has changed to Follower during the election, so discard the results

store4 log:

1436 I1207 14:47:38.690505    56 RaftPart.cpp:1261] [Port: 9780, Space: 1, Part: 1] Received a VOTING request: space = 1, partition = 1, candidateAddr = store2:9780, term = 126, lastLogId = 191, lastLogTerm = 123, isPreVote = 0
1437 I1207 14:47:38.691352    56 RaftPart.cpp:1289] [Port: 9780, Space: 1, Part: 1] The partition currently is a Follower, lastLogId 187, lastLogTerm 112, committedLogId 187, term 1261438 I1207 14:47:38.692145    56 RaftPart.cpp:1360] [Port: 9780, Space: 1, Part: 1] The partition will vote for the candidate "store2":9780, isPreVote = 0
1439 I1207 14:47:38.718819    56 RaftPart.cpp:1261] [Port: 9780, Space: 1, Part: 1] Received a VOTING request: space = 1, partition = 1, candidateAddr = store2:9780, term = 126, lastLogId = 191, lastLogTerm = 123, isPreVote = 0
1440 I1207 14:47:38.719914    56 RaftPart.cpp:1289] [Port: 9780, Space: 1, Part: 1] The partition currently is a Follower, lastLogId 187, lastLogTerm 112, committedLogId 187, term 1261441 I1207 14:47:38.720969    56 RaftPart.cpp:1360] [Port: 9780, Space: 1, Part: 1] The partition will vote for the candidate "store2":9780, isPreVote = 0

take a look at the collectNSucceeded() used by raft election:

/* Copyright (c) 2018 vesoft inc. All rights reserved.
*
* This source code is licensed under Apache 2.0 License.
*/
#pragma once
#include <folly/ExceptionWrapper.h>
namespace nebula {
// Not thread-safe, all futures need to be on the same executor
template <class FutureIter, typename ResultEval>
folly::Future<SucceededResultList<FutureIter>> collectNSucceeded(FutureIter first,
FutureIter last,
size_t n,
ResultEval&& eval) {
using Result = SucceededResultList<FutureIter>;
if (n == 0) {
return folly::Future<Result>(Result());
}
struct Context {
Context(size_t total, ResultEval&& e) : eval(std::forward<ResultEval>(e)), nTotal(total) {}
ResultEval eval;
Result results;
std::atomic<size_t> numCompleted = {0};
std::atomic<size_t> nSucceeded = {0};
folly::Promise<Result> promise;
size_t nTotal;
};
size_t total = size_t(std::distance(first, last));
DCHECK_GE(total, 0U);
if (total < n) {
return folly::Future<Result>(
folly::exception_wrapper(std::runtime_error("Not enough futures")));
}
auto ctx = std::make_shared<Context>(total, std::forward<ResultEval>(eval));
// for each succeeded Future, add to the result list, until
// we have required number of futures, at which point we fulfil
// the promise with the result list
for (size_t index = 0; first != last; ++first, ++index) {
first->setCallback_([n, ctx, index](auto, folly::Try<FutureReturnType<FutureIter>>&& t) {
if (!ctx->promise.isFulfilled()) {
if (!t.hasException()) {
if (ctx->eval(index, t.value())) {
++ctx->nSucceeded;
}
ctx->results.emplace_back(index, std::move(t.value()));
}
if ((++ctx->numCompleted) == ctx->nTotal || ctx->nSucceeded == n) {
// Done
VLOG(2) << "Set Value [completed=" << ctx->numCompleted << ", total=" << ctx->nTotal
<< ", Result list size=" << ctx->results.size() << "]";
ctx->promise.setValue(std::move(ctx->results));
}
}
});
}
return ctx->promise.getFuture();
}
} // namespace nebula

and set a breakpoint at line 49, we found that the thrift client of store3 throw an Timeout exception while waiting for AskVote response from store3. We suspect that store3's response might have been block by requests to store1 or store4.

(rr) p index
$99 = 0
(rr) p t.hasException()
$100 = true
(rr) p t.exception()
$102 = (folly::exception_wrapper &) @0x7f3c0de13158: {static uninit_ = {copy_ = 0x48b3790 <void folly::exception_wrapper::noop_<void, folly::exception_wrapper const*, folly::exception_wrapper*>(folly::exception_wrapper const*, folly::exception_wrapper*)>,
    move_ = 0x48b37a0 <void folly::exception_wrapper::noop_<void, folly::exception_wrapper*, folly::exception_wrapper*>(folly::exception_wrapper*, folly::exception_wrapper*)>,
    delete_ = 0x48b37b0 <void folly::exception_wrapper::noop_<void, folly::exception_wrapper*>(folly::exception_wrapper*)>, throw_ = 0x48b37c0 <void folly::exception_wrapper::noop_<void, folly::exception_wrapper const*>(folly::exception_wrapper const*)>,
    type_ = 0x48b36d0 <folly::exception_wrapper::uninit_type_(folly::exception_wrapper const*)>, get_exception_ = 0x48b37d0 <std::exception const* folly::exception_wrapper::noop_<std::exception const*, folly::exception_wrapper const*>(folly::exception_wrapper const*)>,
    get_exception_ptr_ = 0x48b37e0 <folly::exception_wrapper folly::exception_wrapper::noop_<folly::exception_wrapper, folly::exception_wrapper const*>(folly::exception_wrapper const*)>}, {buff_ = {buff_ = {__data = "0s\213\"\032$\000\000 s\213\"\032$\000",
        __align = {<No data fields>}}}, eptr_ = {ptr_ = {_M_exception_object = 0x241a228b7330}, exception_or_type_ = 39694667313952, static ops_ = {copy_ = 0x48b3800 <folly::exception_wrapper::ExceptionPtr::copy_(folly::exception_wrapper const*, folly::exception_wrapper*)>,
        move_ = 0x48b3890 <folly::exception_wrapper::ExceptionPtr::move_(folly::exception_wrapper*, folly::exception_wrapper*)>, delete_ = 0x48b3870 <folly::exception_wrapper::ExceptionPtr::delete_(folly::exception_wrapper*)>,
        throw_ = 0x48b3830 <folly::exception_wrapper::ExceptionPtr::throw_(folly::exception_wrapper const*)>, type_ = 0x48b36e0 <folly::exception_wrapper::ExceptionPtr::type_(folly::exception_wrapper const*)>,
        get_exception_ = 0x48b3700 <folly::exception_wrapper::ExceptionPtr::get_exception_(folly::exception_wrapper const*)>, get_exception_ptr_ = 0x48b3710 <folly::exception_wrapper::ExceptionPtr::get_exception_ptr_(folly::exception_wrapper const*)>}}, sptr_ = {
      ptr_ = {<std::__shared_ptr<folly::exception_wrapper::SharedPtr::Base, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<folly::exception_wrapper::SharedPtr::Base, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x241a228b7330, _M_refcount = {
            _M_pi = 0x241a228b7320}}, <No data fields>}, static ops_ = {copy_ = 0x48b38c0 <folly::exception_wrapper::SharedPtr::copy_(folly::exception_wrapper const*, folly::exception_wrapper*)>,
        move_ = 0x48b3940 <folly::exception_wrapper::SharedPtr::move_(folly::exception_wrapper*, folly::exception_wrapper*)>, delete_ = 0x48b3a10 <folly::exception_wrapper::SharedPtr::delete_(folly::exception_wrapper*)>,
        throw_ = 0x2b69d34 <folly::exception_wrapper::SharedPtr::throw_(folly::exception_wrapper const*)>, type_ = 0x48b3750 <folly::exception_wrapper::SharedPtr::type_(folly::exception_wrapper const*)>,
        get_exception_ = 0x48b3760 <folly::exception_wrapper::SharedPtr::get_exception_(folly::exception_wrapper const*)>, get_exception_ptr_ = 0x48b3770 <folly::exception_wrapper::SharedPtr::get_exception_ptr_(folly::exception_wrapper const*)>}}},
  vptr_ = 0x5d5cd40 <folly::exception_wrapper::SharedPtr::ops_>}
(rr) p t.exception().what()
$103 = {static npos = <error reading variable: Missing ELF symbol "folly::basic_fbstring<char, std::char_traits<char>, std::allocator<char>, folly::fbstring_core<char> >::npos".>, store_ = {{
      bytes_ = "І\365A\311\002\000\000:\000\000\000\000\000\000\000O\000\000\000\000\000\000\200", small_ = "І\365A\311\002\000\000:\000\000\000\000\000\000\000O\000\000\000\000\000\000\200", ml_ = {
        data_ = 0x2c941f586d0 "N6apache6thrift9transport19TTransportExceptionE: Timed Out", size_ = 58, capacity_ = 9223372036854775887}}, static lastChar = 23, static maxSmallSize = 23, static maxMediumSize = 254, static categoryExtractMask = 192 '\300',
    static kCategoryShift = 56, static capacityExtractMask = 4611686018427387903}}
(rr)

Your Environments (required)

  • OS: uname -a
  • Compiler: g++ --version or clang++ --version
  • CPU: lscpu
  • Commit id (e.g. a3ffc7d8)

How To Reproduce(required)

Steps to reproduce the behavior:

  1. Step 1
  2. Step 2
  3. Step 3

Expected behavior

Additional context

@kikimo kikimo added the type/bug Type: something is unexpected label Dec 9, 2021
@Sophie-Xie Sophie-Xie added this to the v3.0.0 milestone Dec 9, 2021
@critical27 critical27 added type/enhancement Type: make the code neat or more efficient and removed type/bug Type: something is unexpected labels Dec 28, 2021
@critical27 critical27 modified the milestones: v3.0.0, v3.1.0 Dec 28, 2021
@critical27
Copy link
Contributor

I have checked it two weeks ago, there are several place to enhance:

  1. In RaftPart::statusPolling, the interval between time to trigger election is not 1s. Many time is used, for example:
size_t delay = FLAGS_raft_heartbeat_interval_secs * 1000 / 3 + folly::Random::rand32(500);
...
delay = (folly::Random::rand32(1500) + 500);

so, the time to start election is way bigger than expected.
2. The collectN runs on same thread.

@kikimo
Copy link
Contributor Author

kikimo commented Dec 28, 2021

I have checked it two weeks ago, there are several place to enhance:

  1. In RaftPart::statusPolling, the interval between time to trigger election is not 1s. Many time is used, for example:
size_t delay = FLAGS_raft_heartbeat_interval_secs * 1000 / 3 + folly::Random::rand32(500);
...
delay = (folly::Random::rand32(1500) + 500);

so, the time to start election is way bigger than expected. 2. The collectN runs on same thread.

I dump the network traffic using tcpdump and found that the leader do send an election request to a partitioned raft instance(and no more request to other instances), so I think requests to other raft peers might actually blocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Type: make the code neat or more efficient
Projects
None yet
Development

No branches or pull requests

4 participants