refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

ximinez · 2025-02-04T21:47:35Z

High Level Overview of Change

Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."

This rewrites the code so that the lock is not held during the callback. Instead it locks twice, once before, and once after. This is safe due to the structure of the code, but is checked after the second lock. This allows mutex_ to be changed back to a regular mutex.

Context of Change

From #4989:

The rotateWithLock function holds a lock while it calls a callback function that's passed in by the caller. This is a problematic design that needs to be used very carefully. In this case, at least one caller passed in a callback that eventually relocks the mutex on the same thread, causing UB (a deadlock was observed). The caller was from SHAMapStoreImpl, and it called clearCaches. This clearCaches can potentially call fetchNodeObject, which tried to relock the mutex.

This patch resolves the issue by changing the mutex type to a recursive_mutex. Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex.

Type of Change

Refactor (non-breaking change that only restructures code)

Test Plan

Testing can be the same as that for #4989, plus ensure that there are no regressions.

- Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."

codecov · 2025-02-04T22:09:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.2%. Comparing base (97e3dae) to head (a38d43d).
Report is 1 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff           @@
##           develop   #5276   +/-   ##
=======================================
  Coverage     78.1%   78.2%           
=======================================
  Files          790     790           
  Lines        67646   67653    +7     
  Branches      8165    8160    -5     
=======================================
+ Hits         52863   52881   +18     
+ Misses       14783   14772   -11

Files with missing lines	Coverage Δ
src/xrpld/app/misc/SHAMapStoreImp.cpp	`77.1% <100.0%> (ø)`
src/xrpld/nodestore/DatabaseRotating.h	`100.0% <ø> (ø)`
src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp	`67.0% <100.0%> (+6.3%)`	⬆️
src/xrpld/nodestore/detail/DatabaseRotatingImp.h	`66.7% <ø> (ø)`

... and 2 files with indirect coverage changes

src/xrpld/nodestore/DatabaseRotating.h

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp

* Use a second mutex to protect the backends from modification * Remove a bunch of warning comments

bthomee · 2025-02-06T15:51:24Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.h

+    // backendMutex_ is only needed when the *Backend_ members are modified.
+    // Reads are protected by the general mutex_.
+    std::mutex backendMutex_;


As this sounds like a typical single-write and one-or-more-read scenario, is it possible to use a single shared_mutex here instead of these two mutexes?

It's possible, but there are risks. The biggest one is that I'd have to take a shared_lock at the start of rotateWithLock, and upgrade it to a unique_lock after the callback. If there is somehow ever a second caller to that function, or even a different caller that upgrades the lock, there is a potential deadlock.

@bthomee @vvysokikh1 Ok, it took waaaaaaay longer than it should have because I kept trying clever things that didn't work or turned out unsupported, but I rewrote the locking, and changed to a shared mutex, and I think I've got a pretty foolproof solution here. And a unit test to exercise it.

But don't take my word for it. The point of code reviews is to spot the stuff I didn't consider.

vvysokikh1

I think your solution is not completely solving the issue. It's still technically possible to deadlock (calling rotateWithLock from inside of the callback, this will cause a deadlock on your new mutex).

If it's good enough for now, please leave some comments to rotateWithLock() to warn any user of calling rotateWithLock() directly or indirectly from callback.

* upstream/develop: Updates Conan dependencies (5256)

- Rewrite the locking in DatabaseRotatingImp::rotateWithLock to use a shared_lock, and write a unit test to show (as much as possible) that it won't deadlock.

* upstream/develop: fix: Do not allow creating Permissioned Domains if credentials are not enabled (5275) fix: issues in `simulate` RPC (5265)

vvysokikh1 · 2025-02-10T13:14:59Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp

+        std::unique_lock writeLock(mutex_);
+        if (!rotating)
+        {
+            // Once this flag is set, we're committed to doing the work and
+            // returning true.
+            rotating = true;
+        }
+        else
+        {
+            // This should only be reachable through unit tests.
+            XRPL_ASSERT(
+                unitTest_,
+                "ripple::NodeStore::DatabaseRotatingImp::rotateWithLock "
+                "unit testing");
+            return false;
+        }


Why do we need to lock mutex here? I would assume we can make rotating atomic bool and use compare_exchange to switch this flag safely.

Why do we need to lock mutex here? I would assume we can make rotating atomic bool and use compare_exchange to switch this flag safely.

I got rid of rotating entirely.

vvysokikh1 · 2025-02-10T13:25:10Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp

+    auto const writableBackend = [&] {
+        std::shared_lock readLock(mutex_);
+        XRPL_ASSERT(
+            rotating,
+            "ripple::NodeStore::DatabaseRotatingImp::rotateWithLock rotating "
+            "flag set");
+
+        return writableBackend_;
+    }();
+
+    auto newBackend = f(writableBackend->getName());


I don't think these lambda and read lock are actually required with current implementation. We are only using write lock before (which might be switched to atomic) and after. Assuming previous synchronization block switches rotating flag, there should be no other 'write' thread being able to proceed and capture writeLock while we are here.

src/xrpld/app/misc/SHAMapStoreImp.cpp

vvysokikh1 · 2025-02-10T13:37:27Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp

+            // This should only be reachable through unit tests.
+            XRPL_ASSERT(
+                unitTest_,
+                "ripple::NodeStore::DatabaseRotatingImp::rotateWithLock "
+                "unit testing");
+            return false;


I think this comment doesn't work. It can be reached not only with unit tests, but also by accidental concurrent call to rotateWithLock or indirect call to rotateWithLock from the callback.

I think this comment doesn't work. It can be reached not only with unit tests, but also by accidental concurrent call to rotateWithLock or indirect call to rotateWithLock from the callback.

It will only be intentionally reachable through tests. An accidental concurrent call is not intentional. Will update.

vvysokikh1 · 2025-02-10T13:50:57Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.h

+    // "Shared mutexes do not support direct transition from shared to unique
+    // ownership mode: the shared lock has to be relinquished with
+    // unlock_shared() before exclusive ownership may be obtained with lock()."
+    mutable std::shared_timed_mutex mutex_;


what is the reason for choosing timed mutex here? I believe shared_mutex would be enough here

what is the reason for choosing timed mutex here? I believe shared_mutex would be enough here

No longer relevant, but I had been working with try_lock_for for a while, and missed changing this back.

bthomee · 2025-02-10T15:46:28Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp

+
+    // Because of the "rotating" flag, there should be no way any other thread
+    // is waiting at this point. As long as they all release the shared_lock
+    // before taking the unique_lock (which they have to, because upgrading is


For read/write threads, you can also consider using boost::upgrade_lock [1] wrapping a boost::upgrade_mutex [2] here instead of std::shared_lock with a std::shared_(timed)_mutex, as it supports upgrading a read lock to a write lock without unlocking in between. Read-only threads can then use boost::shared_lock while wrapping that same boost::upgrade_mutex.

[1] https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.locks.upgrade_lock
[2] https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_types.upgrade_mutex

Thanks for those links. That's exactly what I was looking for previously, but (obviously) didn't find. I'll update again ASAP.

* upstream/develop: fix: Amendment to add transaction flag checking functionality for Credentials (5250) fix: Omit superfluous setCurrentThreadName call in GRPCServer.cpp (5280)

- Use a boost::upgrade_mutex, which implements a clean read -> write lock upgrade.

vvysokikh1 · 2025-02-12T16:50:02Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp


+    boost::upgrade_lock readLock(mutex_, boost::defer_lock);


nit: I would suggest renaming this from readLock (as this is not actually a read lock, but a unique lock that allows other shared_locks). The name is confusing a bit as it suggests multiple threads can enter, while this is not the case

nit: I would suggest renaming this from readLock (as this is not actually a read lock, but a unique lock that allows other shared_locks). The name is confusing a bit as it suggests multiple threads can enter, while this is not the case

Done. Also added a comment, since a lot of folks are unfamiliar with this class.

bthomee · 2025-02-12T16:33:08Z

src/xrpld/app/misc/SHAMapStoreImp.cpp

+                        savedState.lastRotated = lastRotated;
+                        state_db_.setState(savedState);
+
+                        clearCaches(validatedSeq);


In line 363 the clearCaches(validatedSeq) is already called - is calling it here again needed?

In line 363 the clearCaches(validatedSeq) is already called - is calling it here again needed?

I believe so because some time could have passed between the two calls, and this helps clear out any entries that were added in that time.

bthomee · 2025-02-12T16:53:12Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.h

+    bool const unitTest_;
+
+    // Implements the "UpgradeLockable" concept
+    // https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable


Suggested change

// https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable

// https://www.boost.org/doc/libs/1_83_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable

We're currently using 1.83. Keeping it as-is is fine, but this is simply more accurate.

We're currently using 1.83. Keeping it as-is is fine, but this is simply more accurate.

I updated it to link to the latest release because I don't expect it to change much, and don't want to have to chase version numbers.

really don't know how to say this without sounding like a complete joy-kill, but 1.87 is latest...

bthomee · 2025-02-12T16:55:39Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp

+    // This function should be the only one taking any kind of unique/write
+    // lock, and should only be called once at a time by its syncronous caller.


What are the consequences if the synchronous caller calls it multiple times? Can we protect against that, or at least detect it, and would it make sense to do so?

What are the consequences if the synchronous caller calls it multiple times? Can we protect against that, or at least detect it, and would it make sense to do so?

If it calls them sequentially, then it'll just rotate multiple times, which would be dumb. If it calls them in parallel, all but one will fail. If it calls it recursively through the callback, the recursive call will fail.

I've added test cases that demonstrate all three of those possibilities, though the focus is on the latter two.

bthomee · 2025-02-12T16:57:23Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp

+    // boost::upgrade_mutex guarantees that only one thread can have "upgrade
+    // ownership" at a time, so this is 100% safe, and guaranteed to avoid
+    // deadlock.
+    boost::upgrade_to_unique_lock writeLock(readLock);


Is there a risk that readers keep arriving such that the shared lock is always held for reading by at least one of them, and therefore starving the writer?

Yeah, I think such situation is theoretically possible. I'm not sure this is actually the case here but if we decide on preventing this, there's no standard C++ or Boost locks that could provide writer fairness. If we decide to ensure this, something like this would be required:

boost::shared_mutex mutex_; std::atomic<bool> writerWaiting_{false}; void anyReadFunction() { while(writerWaiting_.load(std::memory_order_acquire)) { std::this_thread::yield(); // Back off if writer is waiting } boost::shared_lock lock(mutex_); ... } void rotateWithLock(....) { boost::upgrade_lock<boost::shared_mutex> upgradeLock(mutex_); ... writerWaiting_.store(true, std::memory_order_release); // should do it RAII-way, this is an example boost::upgrade_to_unique_lock<boost::shared_mutex> writeLock(upgradeLock); ... writerWaiting_.store(false, std::memory_order_release); } ```

The description for shared_mutex includes:

Note the the lack of reader-writer priority policies in shared_mutex. This is due to an algorithm credited to Alexander Terekhov which lets the OS decide which thread is the next to get the lock without caring whether a unique lock or shared lock is being sought. This results in a complete lack of reader or writer starvation. It is simply fair.

Since upgrade_mutex is an extension I think it's safe to assume it uses the same algorithm.

I don't think this is applicable. Bart mentions the situation where writer would never have a chance to acquire lock since there's always some read threads holding the lock (they can always enter as other read locks won't prevent it).

This is highly theoretical situation though, I don't think we can hit this here.

EDIT: looking into the implementation of upgrade_mutex, I can see that this seems to be addressed here. Not a issue.

Bronek · 2025-02-12T18:08:24Z

I do not like that now the code which is meant to be more robust, has apparently become more brittle by the switch to shared_lock which needs to be upgraded in the critical section. Maybe I simply do not understand the rationale.

Bronek · 2025-02-12T19:57:06Z

I think that we do not need to pass a lambda to rotateWithLock at all. I do not see a reason that there should be a second call to clearCaches or update of state_db_ in SHAMapStoreImp::run, both in a critical section controlled by DatabaseRotatingImp. We most likely do not need that second call to clearCaches, and the update of state_db_ could be most likely moved outside of this critical section. See in #5286 what I mean @ximinez

* upstream/develop: chore: Fix small typos in protocol files (5279) chore: Rename missing-commits job, and combine nix job files (5268) docs: Add a summary of the git commit message rules (5283)

- Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex." - Change the order of the rotation operation. Before: Write the state table, then update the backend pointers, all under lock. If rippled crashes between these two steps, the old archive DB folder would be orphaned and need to be cleaned up manually. After: Update the backend pointers under lock, then write the state table outside of lock. This opens a small window where data can be written to the new DB before the state is updated. If rippled crashes between these two steps, the new writable DB folder would be orphaned, and any data written will be lost. That data will need to be downloaded from peers on startup along with any other data missed while rippled is offline.

src/test/app/SHAMapStore_test.cpp

bthomee · 2025-02-13T18:00:00Z

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp

 {
-    std::lock_guard lock(mutex_);
+    auto const


The rotate function accepts a function, and then creates and uses another temporary function, which seems needlessly complicated. Can this be simplified?

Yes. I could revert 4986662 (#5276), which I wrote that way to make all those local variables const.

Co-authored-by: Bart <bthomee@users.noreply.github.com>

This reverts commit 4986662.

* upstream/develop: fix: Replace charge() by fee_.update() in OnMessage functions (5269) docs: ensure build_type and CMAKE_BUILD_TYPE match (5274)

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp

ce650ad

- Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."

ximinez mentioned this pull request Feb 4, 2025

Periodically pause copying ledger nodes during online_delete #4907

Closed

2 tasks

ximinez added this to the 2.4.0 (2025) milestone Feb 4, 2025

Bronek reviewed Feb 5, 2025

View reviewed changes

src/xrpld/nodestore/DatabaseRotating.h Outdated Show resolved Hide resolved

Bronek reviewed Feb 5, 2025

View reviewed changes

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp Outdated Show resolved Hide resolved

Bronek reviewed Feb 5, 2025

View reviewed changes

src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp Outdated Show resolved Hide resolved

Review feedback from @Bronek:

b8413ae

* Use a second mutex to protect the backends from modification * Remove a bunch of warning comments

ximinez requested a review from Bronek February 6, 2025 00:01

Bronek approved these changes Feb 6, 2025

View reviewed changes

Merge branch 'develop' into ximinez/db-lock

063e881

bthomee reviewed Feb 6, 2025

View reviewed changes

vvysokikh1 reviewed Feb 6, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/develop' into ximinez/db-lock

9f564bc

* upstream/develop: Updates Conan dependencies (5256)

ximinez force-pushed the ximinez/db-lock branch from 913df26 to 9f564bc Compare February 7, 2025 16:04

Review feedback from @bthomee and @vvysokikh1:

d912b50

- Rewrite the locking in DatabaseRotatingImp::rotateWithLock to use a shared_lock, and write a unit test to show (as much as possible) that it won't deadlock.

ximinez force-pushed the ximinez/db-lock branch from 13fb47c to d912b50 Compare February 7, 2025 22:18

ximinez added 3 commits February 7, 2025 17:26

Update levelization tracking

3f7fb66

Merge remote-tracking branch 'upstream/develop' into ximinez/db-lock

4de9be2

* upstream/develop: fix: Do not allow creating Permissioned Domains if credentials are not enabled (5275) fix: issues in `simulate` RPC (5265)

fixup! Review feedback from @bthomee and @vvysokikh1:

3b6984c

vvysokikh1 reviewed Feb 10, 2025

View reviewed changes

bthomee reviewed Feb 10, 2025

View reviewed changes

ximinez added 3 commits February 11, 2025 12:12

Merge remote-tracking branch 'upstream/develop' into ximinez/db-lock

52b68da

* upstream/develop: fix: Amendment to add transaction flag checking functionality for Credentials (5250) fix: Omit superfluous setCurrentThreadName call in GRPCServer.cpp (5280)

Rewrite DatabaseRotatingImp again:

b90f65f

- Use a boost::upgrade_mutex, which implements a clean read -> write lock upgrade.

Add another test case, use boost locks, update comments

925c311

vvysokikh1 reviewed Feb 12, 2025

View reviewed changes

bthomee reviewed Feb 12, 2025

View reviewed changes

vvysokikh1 approved these changes Feb 12, 2025

View reviewed changes

Review feedback: rename readlock, add and update comments

217bc1d

ximinez force-pushed the ximinez/db-lock branch from 8b90537 to 217bc1d Compare February 12, 2025 17:57

Bronek self-requested a review February 12, 2025 18:05

Bronek mentioned this pull request Feb 12, 2025

Remove lambda from rotateWithLock #5286

Closed

ximinez added 6 commits February 13, 2025 09:12

Merge remote-tracking branch 'upstream/develop' into ximinez/db-lock

46e35b8

* upstream/develop: chore: Fix small typos in protocol files (5279) chore: Rename missing-commits job, and combine nix job files (5268) docs: Add a summary of the git commit message rules (5283)

Revert back to upstream/develop, except unit tests

a2fa56e

Fix the unit test for original logic

a983cc9

[FOLD] Clean up #include's

27cef47

Make the rotate() internals const

4986662

ximinez force-pushed the ximinez/db-lock branch from e668acd to 4986662 Compare February 13, 2025 16:26

bthomee reviewed Feb 13, 2025

View reviewed changes

ximinez and others added 5 commits February 13, 2025 13:13

Update src/test/app/SHAMapStore_test.cpp

01cf3c4

Co-authored-by: Bart <bthomee@users.noreply.github.com>

Revert "Make the rotate() internals const"

3994ae3

This reverts commit 4986662.

Add comments, some const

c88668e

Fix levelization (again?)

0f032cc

fixup! Add comments, some const

ba94c0e

Bronek approved these changes Feb 13, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/develop' into ximinez/db-lock

a38d43d

* upstream/develop: fix: Replace charge() by fee_.update() in OnMessage functions (5269) docs: ensure build_type and CMAKE_BUILD_TYPE match (5274)

ximinez added the Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required. label Feb 13, 2025

bthomee merged commit 01fe947 into develop Feb 13, 2025
24 checks passed

bthomee deleted the ximinez/db-lock branch February 13, 2025 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

ximinez commented Feb 4, 2025

codecov bot commented Feb 4, 2025 •

edited

Loading

bthomee Feb 6, 2025

ximinez Feb 6, 2025

ximinez Feb 7, 2025

vvysokikh1 left a comment

vvysokikh1 Feb 10, 2025

ximinez Feb 12, 2025

vvysokikh1 Feb 10, 2025

vvysokikh1 Feb 10, 2025

ximinez Feb 12, 2025

vvysokikh1 Feb 10, 2025

ximinez Feb 12, 2025

bthomee Feb 10, 2025 •

edited

Loading

ximinez Feb 11, 2025

vvysokikh1 Feb 12, 2025

ximinez Feb 12, 2025

bthomee Feb 12, 2025

ximinez Feb 12, 2025

bthomee Feb 12, 2025

ximinez Feb 12, 2025

vvysokikh1 Feb 12, 2025

bthomee Feb 12, 2025

ximinez Feb 12, 2025

bthomee Feb 12, 2025

vvysokikh1 Feb 12, 2025

ximinez Feb 12, 2025

vvysokikh1 Feb 12, 2025 •

edited

Loading

Bronek commented Feb 12, 2025 •

edited

Loading

Bronek commented Feb 12, 2025

bthomee Feb 13, 2025

ximinez Feb 13, 2025 •

edited

Loading

	// https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable
	// https://www.boost.org/doc/libs/1_83_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable

		// This function should be the only one taking any kind of unique/write
		// lock, and should only be called once at a time by its syncronous caller.

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276

Conversation

ximinez commented Feb 4, 2025

High Level Overview of Change

Context of Change

Type of Change

Test Plan

codecov bot commented Feb 4, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vvysokikh1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bthomee Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vvysokikh1 Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Bronek commented Feb 12, 2025 • edited Loading

Bronek commented Feb 12, 2025

Choose a reason for hiding this comment

ximinez Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Feb 4, 2025 •

edited

Loading

bthomee Feb 10, 2025 •

edited

Loading

vvysokikh1 Feb 12, 2025 •

edited

Loading

Bronek commented Feb 12, 2025 •

edited

Loading

ximinez Feb 13, 2025 •

edited

Loading