Fix deadlock in InboundLedgers and NetworkOPs #5124

Bronek · 2024-09-09T10:01:12Z

High Level Overview of Change

Fixes a deadlock bug in 2.2.2 release.

Context of Change

Functions acquireAsync and NetworkOPsImp::recvValidation contains lock.lock() which will deadlock if an exception is thrown while the (non-recursive) mutex is owned by the calling thread, that is before lock.unlock()

In 2.2.2 release we switched some operations from synchronous to asynchronous, guarded by mutexes, but failed to account for exception safety of the critical section, which may result in a deadlock if an exception is thrown before lock.unlock() inside the try block. It is extremely unlikely that a deadlock will actually occur in practice, since the possible exceptions in this section of code are scarce.

The solution adopted in this PR is to move ScopedUnlock from LedgerMaster.cpp to basics/scope.h (adjusting casing to match other utilities in this file, to scope_unlock) and to use this type to create a RAII unlock, removing the problematic lock.lock().

An alternative solution might be to abstract away the "pending operation" with the corresponding mutex, and then use it where appropriate, but I wanted this change to generalise an existing lower-level utility (i.e. scoped unlock) rather than write a new one. The "pending operation" approach could be further generalised into structured asynchronous operations, which would place it in a much larger PR.

Type of Change

Bug fix (non-breaking change which fixes an issue)

codecov · 2024-09-09T12:36:39Z

Codecov Report

Attention: Patch coverage is 80.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 77.9%. Comparing base (c5c0e70) to head (b19505a).
Report is 2 commits behind head on develop.

Files with missing lines	Patch %	Lines
src/xrpld/app/misc/NetworkOPs.cpp	0.0%	2 Missing ⚠️
src/xrpld/app/ledger/detail/LedgerMaster.cpp	75.0%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##           develop   #5124     +/-   ##
=========================================
+ Coverage     77.7%   77.9%   +0.2%     
=========================================
  Files          779     782      +3     
  Lines        66015   66614    +599     
  Branches      8156    8140     -16     
=========================================
+ Hits         51261   51887    +626     
+ Misses       14754   14727     -27

Files with missing lines	Coverage Δ
include/xrpl/basics/scope.h	`100.0% <100.0%> (ø)`
src/xrpld/app/ledger/detail/InboundLedgers.cpp	`38.6% <100.0%> (ø)`
src/xrpld/app/ledger/detail/LedgerMaster.cpp	`43.9% <75.0%> (-0.4%)`	⬇️
src/xrpld/app/misc/NetworkOPs.cpp	`70.1% <0.0%> (+<0.1%)`	⬆️

... and 12 files with indirect coverage changes

thejohnfreeman

Is it just me, or do these changes not actually change any behavior?

In InboundLedgers::acquireAsync(), we have this sequence before:

142: unlock().
143: call acquire(). If it does not throw, continue to line 157. If it does throw, one of the exception handlers is entered (no exception can skip both handlers), neither throws (they just log warnings), and execution continues to line 157. Either way, we end up at line 157.
157: lock().

Now, I prefer a RAII type like scope_unlock, but I think the only difference in behavior it produces here is that the call to lock() is moved before the exception handler is entered (if one is entered). But that is not a material difference. Is that right?

Very similar story for NetworkOPsImp::recvValidation.

include/xrpl/basics/scope.h

Review feedback Co-authored-by: John Freeman <jfreeman08@gmail.com>

Bronek · 2024-09-11T11:20:37Z

Is it just me, or do these changes not actually change any behavior?

In InboundLedgers::acquireAsync(), we have this sequence before:

142: unlock().

143: call acquire(). If it does not throw, continue to line 157. If it does throw, one of the exception handlers is entered (no exception can skip both handlers), neither throws (they just log warnings), and execution continues to line 157. Either way, we end up at line 157.

157: lock().

The change in behaviour is if we have an exception before unlock() at the line 142, e.g. inside insert(). If this happens, then the lock() outside of the try-catch block will deadlock (because the mutex is already owned, by the very same thread).

Now, I prefer a RAII type like scope_unlock, but I think the only difference in behavior it produces here is that the call to lock() is moved before the exception handler is entered (if one is entered). But that is not a material difference. Is that right?

Very similar story for NetworkOPsImp::recvValidation.

Yes, same applies - there's an insert() before unlock() and if it throws, then we will deadlock.

The RAII unlock_scope fixes both possible deadlocks, because it removes the problematic lock() below the try-catch blocks.

ximinez

This may be redundant with the CanProcess class in #5126. I'm biased, but I prefer my solution because it hides all the locking in a single class. I'm open to being convinced otherwise, though.

Bronek · 2024-09-13T11:57:23Z

This may be redundant with the CanProcess class in #5126. I'm biased, but I prefer my solution because it hides all the locking in a single class. I'm open to being convinced otherwise, though.

you are right that your PR solves this problem, however given the number of other issues that your PR solves, and the associated code churn, it might be a little until it is tested and approved. In the meantime, if we get this merged and, some time after, your PR merged as well, we will eventually end up with unlock_scope moved from LedgerMaster.cpp to scope.h which is in itself (I think) an improvement.

ximinez

LGTM

Edit: I left one small suggested change to a comment, but whether you take it or not, this is good to go.

ximinez · 2024-11-06T18:28:58Z

include/xrpl/basics/scope.h

+        }  // mut gets locked here.
+
+    }  // mut gets unlocked here
+    @endcode


Just to drive the point home:

Suggested change

} // mut gets locked here.

} // mut gets unlocked here

@endcode

} // mut gets locked here.

... do some more stuff with it locked ...

} // mut gets unlocked here

@endcode

Bronek requested review from mtrippled and ximinez September 9, 2024 10:07

Bronek marked this pull request as ready for review September 9, 2024 10:08

Bronek marked this pull request as draft September 9, 2024 10:49

Bronek added 2 commits September 9, 2024 13:11

Fix deadlock in InboundLedgers and NetworkOPs

c1ad525

Replace ScopedUnlock with scope_unlock

b68f842

Bronek force-pushed the bugfix/deadlock_fix branch from b07ab0f to b68f842 Compare September 9, 2024 12:11

Bronek marked this pull request as ready for review September 9, 2024 12:13

thejohnfreeman requested changes Sep 11, 2024

View reviewed changes

include/xrpl/basics/scope.h Outdated Show resolved Hide resolved

Update include/xrpl/basics/scope.h

8a7a883

Review feedback Co-authored-by: John Freeman <jfreeman08@gmail.com>

thejohnfreeman approved these changes Sep 11, 2024

View reviewed changes

ximinez requested changes Sep 12, 2024

View reviewed changes

Bronek and others added 2 commits November 5, 2024 19:34

Merge branch 'develop' into bugfix/deadlock_fix

d676b45

Merge branch 'develop' into bugfix/deadlock_fix

a5f08fb

ximinez approved these changes Nov 6, 2024

View reviewed changes

Merge branch 'develop' into bugfix/deadlock_fix

b19505a

ximinez added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Nov 6, 2024

ximinez merged commit 9e48fc0 into XRPLF:develop Nov 6, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock in InboundLedgers and NetworkOPs #5124

Fix deadlock in InboundLedgers and NetworkOPs #5124

Bronek commented Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading

thejohnfreeman left a comment

Bronek commented Sep 11, 2024 •

edited

Loading

ximinez left a comment

Bronek commented Sep 13, 2024 •

edited

Loading

ximinez left a comment •

edited

Loading

ximinez Nov 6, 2024

Fix deadlock in InboundLedgers and NetworkOPs #5124

Fix deadlock in InboundLedgers and NetworkOPs #5124

Conversation

Bronek commented Sep 9, 2024 • edited Loading

High Level Overview of Change

Context of Change

Type of Change

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

thejohnfreeman left a comment

Choose a reason for hiding this comment

Bronek commented Sep 11, 2024 • edited Loading

ximinez left a comment

Choose a reason for hiding this comment

Bronek commented Sep 13, 2024 • edited Loading

ximinez left a comment • edited Loading

Choose a reason for hiding this comment

ximinez Nov 6, 2024

Choose a reason for hiding this comment

Bronek commented Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading

Bronek commented Sep 11, 2024 •

edited

Loading

Bronek commented Sep 13, 2024 •

edited

Loading

ximinez left a comment •

edited

Loading