Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yb-master process stuck during initialization #4410

Closed
hectorgcr opened this issue May 6, 2020 · 0 comments
Closed

yb-master process stuck during initialization #4410

hectorgcr opened this issue May 6, 2020 · 0 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug

Comments

@hectorgcr
Copy link
Contributor

hectorgcr commented May 6, 2020

We noticed that after upgrading a cluster to 2.1.6.0-b9, two of the yb-master processes never finished the initialization process. We found a couple of threads that were stuck:

Thread 20 (Thread 0x7f30353ab700 (LWP 1088)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f303c8eb85c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /home/yugabyte/yb-software/yugabyte-2.1.6.0-b9-centos-x86_64/linuxbrew/lib/libstdc++.so.6
#2  0x00007f30409f2133 in wait<yb::Synchronizer::WaitUntil(const time_point&)::<lambda()> > (__p=..., __lock=..., this=0x7f30353aa828) at /home/yugabyte/yb-software/yugabyte-2.1.6.0-b9-centos-x86_64/linuxbrew-xxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/condition_variable:98
#3  yb::Synchronizer::WaitUntil (this=this@entry=0x7f30353aa800, time=...) at ../../src/yb/util/async_util.cc:72
#4  0x00007f3047d1e42a in Wait (this=0x7f30353aa800) at ../../src/yb/util/async_util.h:83
#5  yb::client::YBClient::Data::SetMasterServerProxy (this=0x1292580, deadline=..., skip_resolution=<optimized out>, wait_for_leader_election=<optimized out>) at ../../src/yb/client/client-internal.cc:1663
#6  0x00007f3047d0374b in yb::client::YBClientBuilder::DoBuild (this=this@entry=0x1173180, messenger=<optimized out>, client=client@entry=0x7f30353aa9c0) at ../../src/yb/client/client.cc:378
#7  0x00007f3047d03e83 in yb::client::YBClientBuilder::Build (this=this@entry=0x1173180, messenger=<optimized out>) at ../../src/yb/client/client.cc:406
#8  0x00007f3047cecb8b in yb::client::AsyncClientInitialiser::InitClient (this=0x1173180) at ../../src/yb/client/async_initializer.cc:76
#9  0x00007f303c8f07a0 in ?? () from /home/yugabyte/yb-software/yugabyte-2.1.6.0-b9-centos-x86_64/linuxbrew/lib/libstdc++.so.6
#10 0x00007f303c105694 in start_thread (arg=0x7f30353ab700) at pthread_create.c:333
#11 0x00007f303b84241d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Thread 13 (Thread 0x7f3031ba4700 (LWP 1095)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f303c8ee748 in std::__atomic_futex_unsigned_base::_M_futex_wait_until(unsigned int*, unsigned int, bool, std::chrono::duration<long, std::ratio<1l, 1l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /home/yugabyte/yb-software/yugabyte-2.1.6.0-b9-centos-x86_64/linuxbrew/lib/libstdc++.so.6
#2  0x00007f3049ec80f8 in _M_load_and_test_until (__ns=..., __s=..., __has_timeout=<optimized out>, __mo=<optimized out>, __equal=<optimized out>, __operand=<optimized out>, __assumed=<optimized out>, this=<optimized out>)
    at /home/yugabyte/yb-software/yugabyte-2.1.6.0-b9-centos-x86_64/linuxbrew-xxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/bits/atomic_futex.h:104
#3  _M_load_and_test (__mo=<optimized out>, __equal=<optimized out>, __operand=<optimized out>, __assumed=<optimized out>, this=<optimized out>) at /home/yugabyte/yb-software/yugabyte-2.1.6.0-b9-centos-x86_64/linuxbrew-xxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/bits/atomic_futex.h:122
#4  _M_load_when_equal (__mo=std::memory_order_acquire, __val=1, this=0x128da90) at /home/yugabyte/yb-software/yugabyte-2.1.6.0-b9-centos-x86_64/linuxbrew-xxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/bits/atomic_futex.h:162
#5  wait (this=0x128da80) at /home/yugabyte/yb-software/yugabyte-2.1.6.0-b9-centos-x86_64/linuxbrew-xxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/future:322
#6  _M_get_result (this=<optimized out>) at /home/yugabyte/yb-software/yugabyte-2.1.6.0-b9-centos-x86_64/linuxbrew-xxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/future:681
#7  get (this=<optimized out>) at /home/yugabyte/yb-software/yugabyte-2.1.6.0-b9-centos-x86_64/linuxbrew-xxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/future:889
#8  yb::tablet::TransactionStatusResolver::Impl::Execute (this=this@entry=0x1ab1040) at ../../src/yb/tablet/transaction_status_resolver.cc:113
#9  0x00007f3049ec68e5 in Start (deadline=..., this=0x1ab1040) at ../../src/yb/tablet/transaction_status_resolver.cc:60
#10 yb::tablet::TransactionStatusResolver::Start (this=this@entry=0x1bf43b8, deadline=deadline@entry=...) at ../../src/yb/tablet/transaction_status_resolver.cc:233
#11 0x00007f3049eb1bc7 in TryStartCheckLoadedTransactionsStatus<std::atomic<bool>, bool> (flag_to_set=0x1bf4422, flag_to_check=0x1bf4420, this=0x1bf4000) at ../../src/yb/tablet/transaction_participant.cc:1159
#12 Start (this=0x1bf4000) at ../../src/yb/tablet/transaction_participant.cc:167
#13 yb::tablet::TransactionParticipant::Start (this=<optimized out>) at ../../src/yb/tablet/transaction_participant.cc:1327
#14 0x00007f3049e99db9 in yb::tablet::TabletPeer::InitTabletPeer (this=0x1292850, tablet=..., client_future=..., server_mem_tracker=..., messenger=messenger@entry=0x1294600, proxy_cache=0xf6d340, log=..., metric_entity=..., raft_pool=0x12c5200, tablet_prepare_pool=0x12c5800, retryable_requests=0x0, split_op_id=...)
    at ../../src/yb/tablet/tablet_peer.cc:295
#15 0x00007f304ac50a46 in yb::master::SysCatalogTable::OpenTablet (this=this@entry=0xf5e6c0, metadata=...) at ../../src/yb/master/sys_catalog.cc:535
#16 0x00007f304ac518e0 in yb::master::SysCatalogTable::SetupTablet (this=this@entry=0xf5e6c0, metadata=...) at ../../src/yb/master/sys_catalog.cc:486
#17 0x00007f304ac51fb2 in yb::master::SysCatalogTable::Load (this=0xf5e6c0, fs_manager=0x104c780) at ../../src/yb/master/sys_catalog.cc:266
#18 0x00007f304ab5cede in yb::master::CatalogManager::InitSysCatalogAsync (this=this@entry=0x12c0000, is_first_run=is_first_run@entry=false) at ../../src/yb/master/catalog_manager.cc:1288
#19 0x00007f304ab6a6bd in yb::master::CatalogManager::Init (this=0x12c0000, is_first_run=<optimized out>) at ../../src/yb/master/catalog_manager.cc:546
#20 0x00007f304ac0b91a in yb::master::Master::InitCatalogManager (this=this@entry=0x7ffdc92f1140) at ../../src/yb/master/master.cc:275
#21 0x00007f304ac0ba26 in yb::master::Master::InitCatalogManagerTask (this=0x7ffdc92f1140) at ../../src/yb/master/master.cc:264
#22 0x00007f30409c7654 in yb::ThreadPool::DispatchThread (this=0x12c8a00, permanent=false) at ../../src/yb/util/threadpool.cc:608
#23 0x00007f30409c3fdf in operator() (this=0xf6fb58) at /home/yugabyte/yb-software/yugabyte-2.1.6.0-b9-centos-x86_64/linuxbrew-xxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:2267
#24 yb::Thread::SuperviseThread (arg=0xf6fb00) at ../../src/yb/util/thread.cc:744
#25 0x00007f303c105694 in start_thread (arg=0x7f3031ba4700) at pthread_create.c:333
#26 0x00007f303b84241d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 20 is stuck in this method:

Status YBClient::Data::SetMasterServerProxy(CoarseTimePoint deadline,
                                            bool skip_resolution,
                                            bool wait_for_leader_election) {

  Synchronizer sync;
  SetMasterServerProxyAsync(deadline, skip_resolution,
      wait_for_leader_election, sync.AsStatusCallback());
  return sync.Wait();
}

In other words, it is waiting for a leader election to happen.

Thread 13 is stuck in

 if (!rpcs_.RegisterAndStart(
        client::GetTransactionStatus(
            std::min(deadline_, TransactionRpcDeadline()),
            nullptr /* tablet */,
            participant_context_.client_future().get(),
            &req,
            std::bind(&Impl::StatusReceived, this, _1, _2, request_size)),
        &handle_)) {
      Complete(STATUS(Aborted, "Aborted because cannot start RPC"));
    }

more specifically, it is stuck waiting for the client_future

But the client future gets set here:

void AsyncClientInitialiser::InitClient() {
  LOG(INFO) << "Starting to init ybclient";
  while (!stopping_) {
    auto result = client_builder_.Build(messenger_);
    if (result.ok()) {
      LOG(INFO) << "Successfully built ybclient";
      client_holder_.reset(result->release());
      client_promise_.set_value(client_holder_.get());
      return;
    }

    LOG(ERROR) << "Failed to initialize client: " << result.status();
    std::this_thread::sleep_for(1s);
  }

  client_promise_.set_value(nullptr);
}

which is after SetMasterServerProxy has returned. But no leader election will happen because it is waiting for the peer to be started (TabletPeer::Start) which won't happen until the TabletPeer constructor returns

So in other words, TransactionStatusResolver is waiting for the client future to be set, the client future won't get set until an election happen, an election cannot happen because it needs TabletPeer to be initialized first, and TabletPeer cannot be initialized because the thread is stuck in the constructor waiting for TransactionStatusResolver to return.

@hectorgcr hectorgcr added kind/bug This issue is a bug area/docdb YugabyteDB core features labels May 6, 2020
spolitov added a commit that referenced this issue May 14, 2020
Summary:
When transactional tablet is started, it loads active transactions and resolve their status.
Status resolution is initiated after transaction participant is started and all transactions were loaded.
When number of active transactions is small, they are loaded before transaction participant start.
In this case status resolution is initiated from TransactionParticipant::Start, that is called by Tablet::Start.

But status resolution requires client, so it will wait until client would be able to resolve master leader.
So when above scenario happens with master tablet we end up with deadlock.
Since tablet start waits until client construction, but it cannot complete it's construction because it require master leader for it.

Test Plan: ybd --gtest_filter PgMiniTest.DDLWithRestart

Reviewers: hector, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D8450
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug
Projects
None yet
Development

No branches or pull requests

3 participants