Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: merges can deadlock #27442

Closed
benesch opened this issue Jul 12, 2018 · 0 comments
Closed

storage: merges can deadlock #27442

benesch opened this issue Jul 12, 2018 · 0 comments
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing.

Comments

@benesch
Copy link
Contributor

benesch commented Jul 12, 2018

Consider the following sequence of events:

  1. LHS initiates merge with RHS.
  2. RHS loses its lease.
  3. RHS reacquires its lease, notices merge is in progress, and installs a mergeCompleteCh.
  4. User sends a request to RHS. Let's call it a Get. The request blocks on mergeCompleteCh.
  5. LHS sends GetSnapshotForMerge to RHS. GetSnapshotForMerge gets stuck in the command queue waiting for the Get to succeed.
  6. Deadlock!

@bdarnell has suggested the following solution. Instead of blocking on the mergeCompleteCh in the replica, return a MergeInProgressError. Handle this error in the store by reaching into the replica for its mergeCompleteCh, and block there. Once the channel closes, retry the request. This neatly avoids deadlock by allowing blocked commands to exit the command queue without requiring the caller to retry them.

@benesch benesch added the A-kv-distribution Relating to rebalancing and leasing. label Jul 12, 2018
@benesch benesch self-assigned this Jul 12, 2018
benesch added a commit to benesch/cockroach that referenced this issue Jul 12, 2018
The chaos added in 2ed1515 to force lease renewals while a merge is in
progress is too disruptive and makes the test occasionally time out. See
issue cockroachdb#27442 for an analysis. Skip the test for now.

Release note: None
craig bot pushed a commit that referenced this issue Jul 12, 2018
27441: storage: introduce a min idle time in the replica scanner r=spencerkimball a=spencerkimball

This prevents busy looping in the scanner in the event there are more
replicas on a store than can be processed in the scanner target
interval (default = `10s`). This change also adjusts the default max
idle time to `1s`; the previous value of `100ms` was unnecessarily
short.

Release note (performance improvement): on stores with 100s of
thousands of replicas, this limits the scanner from running incessantly.

27449: storage: skip TestStoreRangeMergeWithData again r=bdarnell,a-robinson a=benesch

The chaos added in 2ed1515 to force lease renewals while a merge is in
progress is too disruptive and makes the test occasionally time out. See
issue #27442 for an analysis. Skip the test for now.

Release note: None

Co-authored-by: Spencer Kimball <spencer.kimball@gmail.com>
Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
benesch added a commit to benesch/cockroach that referenced this issue Jul 13, 2018
Merges could deadlock if the lease on the right-hand range was
transferred or renewed during the merge transaction and the new
leaseholder received any request before the GetSnapshotForMerge request.
See cockroachdb#27442 and the comments within for details. Avoid the deadlock by
moving the responsibility for waiting for a merge to complete from the
replica to the store.

Add a unit test that runs several get requests concurrently with several
merges to reliably test for regressions. The problem previously only
presented when stressing TestStoreRangeMergeWithData.

Fix cockroachdb#27442.

Release note: None
benesch added a commit to benesch/cockroach that referenced this issue Jul 15, 2018
Merges could deadlock if the lease on the right-hand range was
transferred or renewed during the merge transaction and the new
leaseholder received any request before the GetSnapshotForMerge request.
See cockroachdb#27442 and the comments within for details. Avoid the deadlock by
moving the responsibility for waiting for a merge to complete from the
replica to the store.

Add a unit test that runs several get requests concurrently with several
merges to reliably test for regressions. The problem previously only
presented when stressing TestStoreRangeMergeWithData.

Fix cockroachdb#27442.

Release note: None
craig bot pushed a commit that referenced this issue Jul 16, 2018
27454: storage: avoid deadlock during merges r=bdarnell a=benesch

Merges could deadlock if the lease on the right-hand range was
transferred or renewed during the merge transaction and the new
leaseholder received any request before the GetSnapshotForMerge request.
See #27442 and the comments within for details. Avoid the deadlock by
moving the responsibility for waiting for a merge to complete from the
replica to the store.

Add a unit test that runs several get requests concurrently with several
merges to reliably test for regressions. The problem previously only
presented when stressing TestStoreRangeMergeWithData.

Fix #27442.
Fix #27454.

Release note: None

27519: lint: speed up TestCrlfmt r=tschottdorf a=benesch

Pass the -fast flag to crlfmt, which instructs it to skip running
goimports/gofmt. It takes over 90s to run crlfmt without the -fast flag;
with the flag, it takes only 2s.

Note that this patch does not reduce lint coverage. The linter already
runs gofmt in a separate test. Goimports does not enforce any particular
ordering on the imports, beyond the sorting that gofmt mandates, so
there's nothing to be lost by skipping it.

Release note: None

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
@craig craig bot closed this as completed in #27454 Jul 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing.
Projects
None yet
Development

No branches or pull requests

1 participant