-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Change recursive_mutex to mutex in DatabaseRotatingImp #5276
Conversation
- Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #5276 +/- ##
=======================================
Coverage 78.1% 78.2%
=======================================
Files 790 790
Lines 67646 67653 +7
Branches 8165 8160 -5
=======================================
+ Hits 52863 52881 +18
+ Misses 14783 14772 -11
|
* Use a second mutex to protect the backends from modification * Remove a bunch of warning comments
// backendMutex_ is only needed when the *Backend_ members are modified. | ||
// Reads are protected by the general mutex_. | ||
std::mutex backendMutex_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this sounds like a typical single-write and one-or-more-read scenario, is it possible to use a single shared_mutex
here instead of these two mutexes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible, but there are risks. The biggest one is that I'd have to take a shared_lock
at the start of rotateWithLock
, and upgrade it to a unique_lock
after the callback. If there is somehow ever a second caller to that function, or even a different caller that upgrades the lock, there is a potential deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bthomee @vvysokikh1 Ok, it took waaaaaaay longer than it should have because I kept trying clever things that didn't work or turned out unsupported, but I rewrote the locking, and changed to a shared mutex, and I think I've got a pretty foolproof solution here. And a unit test to exercise it.
But don't take my word for it. The point of code reviews is to spot the stuff I didn't consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your solution is not completely solving the issue. It's still technically possible to deadlock (calling rotateWithLock from inside of the callback, this will cause a deadlock on your new mutex).
If it's good enough for now, please leave some comments to rotateWithLock()
to warn any user of calling rotateWithLock()
directly or indirectly from callback.
* upstream/develop: Updates Conan dependencies (5256)
913df26
to
9f564bc
Compare
- Rewrite the locking in DatabaseRotatingImp::rotateWithLock to use a shared_lock, and write a unit test to show (as much as possible) that it won't deadlock.
13fb47c
to
d912b50
Compare
* upstream/develop: fix: Do not allow creating Permissioned Domains if credentials are not enabled (5275) fix: issues in `simulate` RPC (5265)
std::unique_lock writeLock(mutex_); | ||
if (!rotating) | ||
{ | ||
// Once this flag is set, we're committed to doing the work and | ||
// returning true. | ||
rotating = true; | ||
} | ||
else | ||
{ | ||
// This should only be reachable through unit tests. | ||
XRPL_ASSERT( | ||
unitTest_, | ||
"ripple::NodeStore::DatabaseRotatingImp::rotateWithLock " | ||
"unit testing"); | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to lock mutex here? I would assume we can make rotating
atomic bool and use compare_exchange to switch this flag safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to lock mutex here? I would assume we can make
rotating
atomic bool and use compare_exchange to switch this flag safely.
I got rid of rotating
entirely.
auto const writableBackend = [&] { | ||
std::shared_lock readLock(mutex_); | ||
XRPL_ASSERT( | ||
rotating, | ||
"ripple::NodeStore::DatabaseRotatingImp::rotateWithLock rotating " | ||
"flag set"); | ||
|
||
return writableBackend_; | ||
}(); | ||
|
||
auto newBackend = f(writableBackend->getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these lambda and read lock are actually required with current implementation. We are only using write lock before (which might be switched to atomic) and after. Assuming previous synchronization block switches rotating flag, there should be no other 'write' thread being able to proceed and capture writeLock
while we are here.
// This should only be reachable through unit tests. | ||
XRPL_ASSERT( | ||
unitTest_, | ||
"ripple::NodeStore::DatabaseRotatingImp::rotateWithLock " | ||
"unit testing"); | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment doesn't work. It can be reached not only with unit tests, but also by accidental concurrent call to rotateWithLock or indirect call to rotateWithLock from the callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment doesn't work. It can be reached not only with unit tests, but also by accidental concurrent call to rotateWithLock or indirect call to rotateWithLock from the callback.
It will only be intentionally reachable through tests. An accidental concurrent call is not intentional. Will update.
// "Shared mutexes do not support direct transition from shared to unique | ||
// ownership mode: the shared lock has to be relinquished with | ||
// unlock_shared() before exclusive ownership may be obtained with lock()." | ||
mutable std::shared_timed_mutex mutex_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason for choosing timed mutex here? I believe shared_mutex would be enough here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason for choosing timed mutex here? I believe shared_mutex would be enough here
No longer relevant, but I had been working with try_lock_for
for a while, and missed changing this back.
|
||
// Because of the "rotating" flag, there should be no way any other thread | ||
// is waiting at this point. As long as they all release the shared_lock | ||
// before taking the unique_lock (which they have to, because upgrading is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For read/write threads, you can also consider using boost::upgrade_lock
[1] wrapping a boost::upgrade_mutex
[2] here instead of std::shared_lock
with a std::shared_(timed)_mutex
, as it supports upgrading a read lock to a write lock without unlocking in between. Read-only threads can then use boost::shared_lock
while wrapping that same boost::upgrade_mutex
.
[1] https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.locks.upgrade_lock
[2] https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_types.upgrade_mutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for those links. That's exactly what I was looking for previously, but (obviously) didn't find. I'll update again ASAP.
* upstream/develop: fix: Amendment to add transaction flag checking functionality for Credentials (5250) fix: Omit superfluous setCurrentThreadName call in GRPCServer.cpp (5280)
- Use a boost::upgrade_mutex, which implements a clean read -> write lock upgrade.
|
||
boost::upgrade_lock readLock(mutex_, boost::defer_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would suggest renaming this from readLock (as this is not actually a read lock, but a unique lock that allows other shared_locks). The name is confusing a bit as it suggests multiple threads can enter, while this is not the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would suggest renaming this from readLock (as this is not actually a read lock, but a unique lock that allows other shared_locks). The name is confusing a bit as it suggests multiple threads can enter, while this is not the case
Done. Also added a comment, since a lot of folks are unfamiliar with this class.
savedState.lastRotated = lastRotated; | ||
state_db_.setState(savedState); | ||
|
||
clearCaches(validatedSeq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line 363 the clearCaches(validatedSeq)
is already called - is calling it here again needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line 363 the
clearCaches(validatedSeq)
is already called - is calling it here again needed?
I believe so because some time could have passed between the two calls, and this helps clear out any entries that were added in that time.
bool const unitTest_; | ||
|
||
// Implements the "UpgradeLockable" concept | ||
// https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// https://www.boost.org/doc/libs/1_86_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable | |
// https://www.boost.org/doc/libs/1_83_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable |
We're currently using 1.83. Keeping it as-is is fine, but this is simply more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're currently using 1.83. Keeping it as-is is fine, but this is simply more accurate.
I updated it to link to the latest release because I don't expect it to change much, and don't want to have to chase version numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really don't know how to say this without sounding like a complete joy-kill, but 1.87 is latest...
// This function should be the only one taking any kind of unique/write | ||
// lock, and should only be called once at a time by its syncronous caller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the consequences if the synchronous caller calls it multiple times? Can we protect against that, or at least detect it, and would it make sense to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the consequences if the synchronous caller calls it multiple times? Can we protect against that, or at least detect it, and would it make sense to do so?
If it calls them sequentially, then it'll just rotate multiple times, which would be dumb. If it calls them in parallel, all but one will fail. If it calls it recursively through the callback, the recursive call will fail.
I've added test cases that demonstrate all three of those possibilities, though the focus is on the latter two.
// boost::upgrade_mutex guarantees that only one thread can have "upgrade | ||
// ownership" at a time, so this is 100% safe, and guaranteed to avoid | ||
// deadlock. | ||
boost::upgrade_to_unique_lock writeLock(readLock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a risk that readers keep arriving such that the shared lock is always held for reading by at least one of them, and therefore starving the writer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think such situation is theoretically possible. I'm not sure this is actually the case here but if we decide on preventing this, there's no standard C++ or Boost locks that could provide writer fairness. If we decide to ensure this, something like this would be required:
boost::shared_mutex mutex_;
std::atomic<bool> writerWaiting_{false};
void anyReadFunction() {
while(writerWaiting_.load(std::memory_order_acquire)) {
std::this_thread::yield(); // Back off if writer is waiting
}
boost::shared_lock lock(mutex_);
...
}
void rotateWithLock(....) {
boost::upgrade_lock<boost::shared_mutex> upgradeLock(mutex_);
...
writerWaiting_.store(true, std::memory_order_release); // should do it RAII-way, this is an example
boost::upgrade_to_unique_lock<boost::shared_mutex> writeLock(upgradeLock);
...
writerWaiting_.store(false, std::memory_order_release);
}
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description for shared_mutex includes:
Note the the lack of reader-writer priority policies in shared_mutex. This is due to an algorithm credited to Alexander Terekhov which lets the OS decide which thread is the next to get the lock without caring whether a unique lock or shared lock is being sought. This results in a complete lack of reader or writer starvation. It is simply fair.
Since upgrade_mutex
is an extension I think it's safe to assume it uses the same algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is applicable. Bart mentions the situation where writer would never have a chance to acquire lock since there's always some read threads holding the lock (they can always enter as other read locks won't prevent it).
This is highly theoretical situation though, I don't think we can hit this here.
EDIT: looking into the implementation of upgrade_mutex, I can see that this seems to be addressed here. Not a issue.
8b90537
to
217bc1d
Compare
I do not like that now the code which is meant to be more robust, has apparently become more brittle by the switch to |
I think that we do not need to pass a lambda to |
* upstream/develop: chore: Fix small typos in protocol files (5279) chore: Rename missing-commits job, and combine nix job files (5268) docs: Add a summary of the git commit message rules (5283)
- Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex." - Change the order of the rotation operation. Before: Write the state table, then update the backend pointers, all under lock. If rippled crashes between these two steps, the old archive DB folder would be orphaned and need to be cleaned up manually. After: Update the backend pointers under lock, then write the state table outside of lock. This opens a small window where data can be written to the new DB before the state is updated. If rippled crashes between these two steps, the new writable DB folder would be orphaned, and any data written will be lost. That data will need to be downloaded from peers on startup along with any other data missed while rippled is offline.
e668acd
to
4986662
Compare
{ | ||
std::lock_guard lock(mutex_); | ||
auto const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rotate
function accepts a function, and then creates and uses another temporary function, which seems needlessly complicated. Can this be simplified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I could revert 4986662
(#5276), which I wrote that way to make all those local variables const
.
Co-authored-by: Bart <bthomee@users.noreply.github.com>
This reverts commit 4986662.
* upstream/develop: fix: Replace charge() by fee_.update() in OnMessage functions (5269) docs: ensure build_type and CMAKE_BUILD_TYPE match (5274)
High Level Overview of Change
Follow-up to #4989, which stated "Ideally, the code should be rewritten so it doesn't hold the mutex during the callback and the mutex should be changed back to a regular mutex."
This rewrites the code so that the lock is not held during the callback. Instead it locks twice, once before, and once after. This is safe due to the structure of the code, but is checked after the second lock. This allows
mutex_
to be changed back to a regular mutex.Context of Change
From #4989:
Type of Change
Test Plan
Testing can be the same as that for #4989, plus ensure that there are no regressions.