Peer::Close() can get stuck waiting for the pending request semaphore #350

mbautin · 2018-06-30T07:19:00Z

The following issue was observed on a client request.

GetTableLocations(system_redis.redis, hash_code: NaN, 5) failed: timed out after deadline expired: GetTableLocations(system_redis.redis, hash_code: NaN, 5) passed its deadline 7313896.654s (now: 7313897.511s): Remote error (yb/rpc/outbound_call.cc:376): Service unavailable (yb/rpc/service_pool.cc:170): GetTableLocations request on yb.master.MasterService from ip_address:53662 dropped due to backpressure. The service queue is full, it has 1000 items.

The master in question got into a state where it was not able to process any MasterService or Consensus requests.

List of master threads: https://gist.githubusercontent.com/mbautin/9b018f4ea37ee31aa80fed1b7f9f947f/raw

Looking at the thread dump of one of the threads waiting for the Raft state lock, we can see the thread holding that lock:

(gdb) where
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f409cd146e3 in __GI___pthread_mutex_lock (mutex=0x183a4f0) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f40a5b97b7e in __gthread_mutex_lock (__mutex=0x183a4f0) at /n/jenkins/linuxbrew/linuxbrew_2018-03-16T16_38_10/Cellar/gcc/5.5.0/include/c++/5.5.0/x86_64-unknown-linux-gnu/bits/gthr-default.h:748
#3  lock (this=0x183a4f0) at /n/jenkins/linuxbrew/linuxbrew_2018-03-16T16_38_10/Cellar/gcc/5.5.0/include/c++/5.5.0/mutex:135
#4  lock (this=<synthetic pointer>) at /n/jenkins/linuxbrew/linuxbrew_2018-03-16T16_38_10/Cellar/gcc/5.5.0/include/c++/5.5.0/mutex:485
#5  unique_lock (__m=..., this=<synthetic pointer>) at /n/jenkins/linuxbrew/linuxbrew_2018-03-16T16_38_10/Cellar/gcc/5.5.0/include/c++/5.5.0/mutex:415
#6  yb::consensus::ReplicaState::LockForRead (this=0x183a4e0, lock=lock@entry=0x7f4098a933f0) at ../../../../../src/yb/consensus/replica_state.cc:142
#7  0x00007f40a5b7dbac in yb::consensus::RaftConsensus::GetLastOpId (this=0x183a1b0, type=yb::consensus::COMMITTED_OPID, id=0x7f4098a934e0) at ../../../../../src/yb/consensus/raft_consensus.cc:2689
#8  0x00007f40a739d712 in yb::tablet::TabletPeer::GetEarliestNeededLogIndex (this=this@entry=0x183a000, min_index=min_index@entry=0x7f4098a93560) at ../../../../../src/yb/tablet/tablet_peer.cc:619
#9  0x00007f40a739d9c8 in yb::tablet::TabletPeer::GetGCableDataSize (this=0x183a000, retention_size=retention_size@entry=0x7f4098a935a0) at ../../../../../src/yb/tablet/tablet_peer.cc:641
#10 0x00007f40a73958d4 in yb::tablet::LogGCOp::UpdateStats (this=0x2c92e00, stats=0x1a15d88) at ../../../../../src/yb/tablet/tablet_peer_mm_ops.cc:80
#11 0x00007f40a73c3ff6 in yb::MaintenanceManager::FindBestOp (this=this@entry=0x121e100) at ../../../../../src/yb/tablet/maintenance_manager.cc:290
#12 0x00007f40a73c5d92 in yb::MaintenanceManager::RunSchedulerThread (this=0x121e100) at ../../../../../src/yb/tablet/maintenance_manager.cc:216
#13 0x00007f40a1551156 in operator() (this=0x14d0438) at /n/jenkins/linuxbrew/linuxbrew_2018-03-16T16_38_10/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:2267
#14 yb::Thread::SuperviseThread (arg=<optimized out>) at ../../../../../src/yb/util/thread.cc:602
#15 0x00007f409cd12694 in start_thread (arg=0x7f4098a94700) at pthread_create.c:333
#16 0x00007f409ca543cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) frame 4
#4  lock (this=<synthetic pointer>) at /n/jenkins/linuxbrew/linuxbrew_2018-03-16T16_38_10/Cellar/gcc/5.5.0/include/c++/5.5.0/mutex:485
485     /n/jenkins/linuxbrew/linuxbrew_2018-03-16T16_38_10/Cellar/gcc/5.5.0/include/c++/5.5.0/mutex: No such file or directory.
(gdb) locals
Undefined command: "locals".  Try "help".
(gdb) local
Undefined command: "local".  Try "help".
(gdb) info locals
No locals.
(gdb) frame 3
#3  lock (this=0x183a4f0) at /n/jenkins/linuxbrew/linuxbrew_2018-03-16T16_38_10/Cellar/gcc/5.5.0/include/c++/5.5.0/mutex:135
135     in /n/jenkins/linuxbrew/linuxbrew_2018-03-16T16_38_10/Cellar/gcc/5.5.0/include/c++/5.5.0/mutex
(gdb) info locals
No locals.
(gdb) print _M_mutex
$1 = {__data = {__lock = 2, __count = 0, __owner = 27886, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000\356l\000\000\001", '\000' <repeats 26 times>, __align = 2}
(gdb)

The second thread is handling Peer::Close and is waiting for the semaphore:

(gdb) 
#0  0x00007f409cd1918a in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x18e15b0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x18e15b0, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007f409cd19231 in __new_sem_wait_slow (sem=0x18e15b0, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007f40a1470e92 in yb::Semaphore::Acquire (this=this@entry=0x18e15b0) at ../../../../../src/yb/util/semaphore.cc:55
#4  0x00007f40a5b4a565 in lock (this=0x18e15b0) at ../../../../../src/yb/util/semaphore.h:70
#5  lock_guard (__m=..., this=<synthetic pointer>) at /n/jenkins/linuxbrew/linuxbrew_2018-03-16T16_38_10/Cellar/gcc/5.5.0/include/c++/5.5.0/mutex:386
#6  yb::consensus::Peer::Close (this=0x18e1400) at ../../../../../src/yb/consensus/consensus_peers.cc:393
#7  0x00007f40a5b73159 in yb::consensus::PeerManager::Close (this=0x2cd4480) at ../../../../../src/yb/consensus/peer_manager.cc:128
#8  0x00007f40a5b7c324 in yb::consensus::RaftConsensus::BecomeReplicaUnlocked (this=this@entry=0x183a1b0) at ../../../../../src/yb/consensus/raft_consensus.cc:836
#9  0x00007f40a5b7d185 in yb::consensus::RaftConsensus::HandleTermAdvanceUnlocked (this=this@entry=0x183a1b0, new_term=new_term@entry=6) at ../../../../../src/yb/consensus/raft_consensus.cc:2821
#10 0x00007f40a5b7f6d1 in yb::consensus::RaftConsensus::NotifyTermChange (this=0x183a1b0, term=6) at ../../../../../src/yb/consensus/raft_consensus.cc:1015
#11 0x00007f40a5b57e34 in yb::consensus::PeerMessageQueue::NotifyObserversOfTermChangeTask (this=<optimized out>, term=6) at ../../../../../src/yb/consensus/consensus_queue.cc:1037
#12 0x00007f40a15549c4 in yb::ThreadPool::DispatchThread (this=0x17e2a00, permanent=false) at ../../../../../src/yb/util/threadpool.cc:608
#13 0x00007f40a1551156 in operator() (this=0x14d1f08) at /n/jenkins/linuxbrew/linuxbrew_2018-03-16T16_38_10/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:2267
#14 yb::Thread::SuperviseThread (arg=<optimized out>) at ../../../../../src/yb/util/thread.cc:602
#15 0x00007f409cd12694 in start_thread (arg=0x7f40903a2700) at pthread_create.c:333
#16 0x00007f409ca543cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb)

This semaphore should be released when we receive a response from the peer, even if that response is a network error or a timeout. The time by which the response should be received has long passed. So we either failed to release the semaphore, or we never handled the response because it got stuck in one of the queues.

One possibility is that the thread running Peer::Close() is doing so in a Raft thread pool token, and that token is also needed to handle the response in Peer::ProcessResponse(). That's when the semaphore should be released but we never get to that point.

The text was updated successfully, but these errors were encountered:

Summary: This a port from KUDU-699 apache/kudu@a32ea34 Test Plan: New unit test. Reviewers: sergei, mikhail Reviewed By: mikhail Subscribers: kannan, ybase, bharat Differential Revision: https://phabricator.dev.yugabyte.com/D5095

Summary: This reverts commit 4f445a3. We saw a huge performance degradation with this commit. Reverting for now until we find a better solution. Test Plan: Verified that code builds correctly Reviewers: mikhail, sergei, kannan Reviewed By: kannan Subscribers: ybase, bharat Differential Revision: https://phabricator.dev.yugabyte.com/D5124

hectorgcr · 2018-07-09T05:37:14Z

We found a performance regression with commit 4f445a3. Reverting it and reopening the issue.

Summary: This a port from KUDU-699 apache/kudu@a32ea34 Test Plan: New unit test. Reviewers: sergei, mikhail Reviewed By: mikhail Subscribers: kannan, ybase, bharat Differential Revision: https://phabricator.dev.yugabyte.com/D5095

Summary: Currently we were submitting a call to the raft thread pool to call SendNextRequest. This is unnecessary if there is a pending request. This is a followup commit for github issue #350. Test Plan: Ran ``` /bin/yb-ctl destroy; sleep 2; ./bin/yb-ctl start; sleep 10; time java -jar java/yb-loadtester/target/yb-sample-apps.jar -workload CassandraKeyValue -num_threads_read 0 -num_threads_write 128 -nodes 127.0.0.1:9042,127.0.0.2:9042,127.0.0.3:9042 --num_writes 5000000 ``` 10 times, execution time went from 190s to 175s. Reviewers: pritam.damania, bogdan, mikhail Reviewed By: mikhail Subscribers: ybase, bharat Differential Revision: https://phabricator.dev.yugabyte.com/D5152

Summary: Revert "Improve performance when sending peer's requests (#350)" This reverts commit 5a062dd54a568c29903f2262e413d5f66c0f7cb8. Revert "ENG-3422: #350 Remove semaphore from Peer class" This reverts commit ebd5508b07dd541d48c425977cf4f56621b44357. Test Plan: Verify that it builds Reviewers: sergei, bogdan, mikhail Reviewed By: mikhail Subscribers: bharat, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D5159

Summary: With this change we avoid blocking in Peer::Close() by creating a reference to itself in case it can't acquire the semaphore. This guarantees that the object stays alive until the pending request can be processed. Test Plan: Unit tests. Java workloads Reviewers: sergei, bogdan, mikhail Reviewed By: mikhail Subscribers: kannan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D5158

Summary: Revert "Improve performance when sending peer's requests (#350)" This reverts commit 5a062dd54a568c29903f2262e413d5f66c0f7cb8. Revert "ENG-3422: #350 Remove semaphore from Peer class" This reverts commit ebd5508b07dd541d48c425977cf4f56621b44357. Test Plan: Verify that it builds Reviewers: sergei, bogdan, mikhail Reviewed By: mikhail Subscribers: bharat, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D5159

Summary: The Peer class uses a semaphore to enforce that there is only one outstanding request. Peer::Close() waits to acquire this semaphore before cleaning up its resources. By waiting on this semaphore, we guarantee that we won't deallocate resources while another thread is using them. But in some clusters we have found that sometimes we wait on this semaphore forever (most likely because we are waiting for a request response that never arrives or times out). Our first attempt was to port apache/kudu@a32ea34 but it caused a huge performance degradation (~30%) in write-only workloads, and our TSAN tests found that the fix is not correct because Peer::Close() could finish and the Peer object destroyed before proxy_->UpdateAsync() gets called since the lock has been released. With this change Peer::Close() marks the state of the peer as closed, and then waits for the semaphore only when the Peer is not waiting for a request response. This will not block indefinitely because as soon as the thread processing the requests sees that the Peer has been closed, it will refrain from sending more requests. If Peer::Close() gets called while we are waiting for a response (I have added waiting_for_response_ to track this), then we avoid waiting on the semaphore, and we free most of the resources as they are not being used by another thread. Before exiting Peer::Close() we will create an extra reference to the Peer object by calling shared_from_this() and storing the value in shared_from_this_. This will guarantee that the object stays alive until we process the response we are waiting for. As soon as Peer::ProcessResponse() or Peer::ProcessRemoteBootstrapResponse() see that the Peer has been closed, they erase the extra reference stored in shared_from_this_ and return. For correctness, we hold the spinlock while calling proxy_->UpdateAsync(), but this means that now we have to make this call on a different thread since UpdateAsync is not really an async call. In some cases it can call Peer::ProcessResponse in the same callstack which would cause a deadlock since ProcessResponse acquires the same lock that is being held while calling proxy_->UpdateAsync(). Test Plan: Unit tests. Java workloads Reviewers: sergei, bogdan, mikhail Reviewed By: mikhail Subscribers: kannan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D5158

mbautin self-assigned this Jun 30, 2018

kmuthukk added the kind/bug This issue is a bug label Jun 30, 2018

hectorgcr assigned hectorgcr and unassigned mbautin Jul 7, 2018

hectorgcr closed this as completed Jul 7, 2018

rao-vasireddy reopened this Jul 9, 2018

hectorgcr closed this as completed Aug 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peer::Close() can get stuck waiting for the pending request semaphore #350

Peer::Close() can get stuck waiting for the pending request semaphore #350

mbautin commented Jun 30, 2018

hectorgcr commented Jul 9, 2018

Peer::Close() can get stuck waiting for the pending request semaphore #350

Peer::Close() can get stuck waiting for the pending request semaphore #350

Comments

mbautin commented Jun 30, 2018

hectorgcr commented Jul 9, 2018