-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: prevent setDatabase starvation which leads to no updates of currently loaded database #613
Conversation
This is a preview of the changelog of the next release. If this branch is not up-to-date with the current main branch, the changelog may not be accurate. Rebase your branch on the main branch to get the most accurate changelog. Note that this might contain changes that are on main, but not yet released. Changelog: 0.2.24 (2024-10-14)Bug Fixes
|
56eaafa
to
1898c73
Compare
1898c73
to
644ce7b
Compare
…ently loaded database
644ce7b
to
8a6197d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I still have a few questions:
- Is my understanding correct that what happened on PPX is that: a) SILO didn't manage to switch to new version (for a reason I don't understand) and b) we somehow removed the data dir that was hence still in use (assuming the new one would be used but which wasn't actually).
- How does the switch of versions happen? (Maybe there are some docs, in that case no need to reexplain). It's cool that you allow loading of new data without downtime, it's non trivial, there are quote a few pitfalls as a result.
- What happens to new requests that are received while the db is being loaded? Are they rejected until the load is successful (you mention that the old version stays alive until the last request of the old version has completed)
- What exactly happened in the PPX situation? The switch simply never happened? Why wasn't it retried? Will it be retried in the future, given this PR?
- We should not remove old data version dirs until the change has happened successfully, this is why we hit a SILO error on PPX instead of the issue being just to show old data IIUC. How can we tell the old data version is no longer used, that we can hence safely delete the unused one? (there's an issue for adding locks: Implement lock for currently loaded directories #612)
We should create an issue here in SILO to link to, and link to the Loculus slack discussion (instead of just saying that there is a discussion without explicit link) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Let's either create an issue that the changelog can reference or squash-merge this, such that the PR number ends up in the changelog.
SPDLOG_INFO( | ||
"New database with version {} successfully loaded.", | ||
"Did not load new database with version {} successfully.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have log statements in every catch
. Should we maybe simply add a clarifying "failed to load new database with version xyz" to the SPDLOG_ERROR(ex.what());
?
SILO checks a certain directory every two seconds. It expects data versions in there (folders named after a timestamp). If it finds a folder with a higher number than the currently loaded, it attempts to load the contents.
They simply keep running. Loading the new DB happens in the background. Once the DB is loaded, the currently instance is swapped and new requests will see the new DB.
Is there something in the logs? It would actually be good to know.
Currently, you cannot tell from the filesystem which data version is loaded. You could ask the running SILO API. |
Yes, that is correct.
To switch the dataset, the "switcher" requests a unique lock. It is guaranteed that when it gets the lock, all other threads that also want this lock will block until we released our unique lock.
Requests are still being served from the old state. The load and switch are two separate things.
The load of the old dataset was successful as can be seen in the log messages, but the switcher was starved while waiting for its lock. It would retry if it failed, but the thread wanting to switch never really failed, it was just indefinitely waiting for a lock. The requirement of a unique lock was unnecessarily expensive and complicated. Now, we no longer want a lock, but instead just atomically switch out the old dataset and the new dataset. We do not really care if a racy request still gets the old dataset or not (the header will show the updated dataset version after the switch was successful). Currently running queries rely on the fact that the data will stay in memory until they finished running. This is achieved by changing the return type to a
I added #612 for this. |
@@ -137,6 +137,11 @@ void silo_api::DatabaseDirectoryWatcher::checkDirectoryForData(Poco::Timer& /*ti | |||
try { | |||
database_mutex.setDatabase(silo::Database::loadDatabaseState(most_recent_database_state->first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a timeout on this? When the incident happened, is this where it got stuck indefinitely? Just to help me understand what happened.
Does this mean that swapping can be blocked by if there happens to be continuously more than one active request? This would require a lot of requests and/or long ones but could happen at high load. Update: IIUC, this sort of happened, at least the fact that something prevented the lock from being acquired.
Does the new shared pointer implementation avoid the problem altogether in that new requests are served with the new version as soon as it has loaded successfully, even if there are still active old version requests? Just checking my understanding here. |
void silo_api::DatabaseMutex::setDatabase(silo::Database&& new_database) { | ||
const std::unique_lock lock(mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the line where it got stuck before. A log message before would have helped a lot: "Acquiring unique lock to switch out datasets"
This works as long as there's only one instance running, if there are multiple pods being load balanced this would not be fool proof. But in our current PPX setup seems like a useful thing to try. |
Would it make sense to rename the classes to remove the |
Yes, precisely. Instead of waiting for all requests to finish, new requests now immediately get the new version. The lifetime management is done by the shared_ptr. (Without it, we would not know, when we would be able to release (i.e. remove from memory) the old database) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be too hard to do, but just as an idea, it'd be neat if it was possible to test that the version switch now works as expected:
- Loading version 1
- Faking/making running request A, that blocks until a later version becomes available
- Make version 2 available
- Make new request B and assert it uses version 2
- Assert that request A completes successfully with data from version 1
I just did that, it works. I added a 20 seconds sleep to the info endpoint and called it with curl, while preprocessing a new data version:
You can see that the request arrived before the new data version was loaded. Then the new data version was loaded successfully, then the request finished and still shows the old data version. |
Nice @fengelniederhammer! I guess it's not straightforward to turn this into an automatic test 😄 |
I don't know how. You need to delay requests and make sure that the reload happens (and is finished!) during a running request. That seems pretty impossible to me. |
Summary
resolves issue mentioned on slack, where the swap-over to the latest folder can sometimes fail.
Before, the setDatabase call requested a write-mutex. This was unnecessarily expensive. Instead, it is enough that we just atomically load and store the current database. Databases lifetimes for currently running transactions is now managed by a shared_ptr, which will correctly deallocate the old state once the last reference to the database is dropped (i.e. the last query using the old state has finished)
PR Checklist
- [ ] All necessary documentation has been adapted or there is an issue to do so.- [ ] The implemented feature is covered by an appropriate test.