Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent setDatabase starvation which leads to no updates of currently loaded database #613

Merged
merged 1 commit into from
Oct 15, 2024

Conversation

Taepper
Copy link
Collaborator

@Taepper Taepper commented Oct 14, 2024

Summary

resolves issue mentioned on slack, where the swap-over to the latest folder can sometimes fail.

Before, the setDatabase call requested a write-mutex. This was unnecessarily expensive. Instead, it is enough that we just atomically load and store the current database. Databases lifetimes for currently running transactions is now managed by a shared_ptr, which will correctly deallocate the old state once the last reference to the database is dropped (i.e. the last query using the old state has finished)

PR Checklist

- [ ] All necessary documentation has been adapted or there is an issue to do so.
- [ ] The implemented feature is covered by an appropriate test.

Copy link
Contributor

github-actions bot commented Oct 14, 2024

This is a preview of the changelog of the next release. If this branch is not up-to-date with the current main branch, the changelog may not be accurate. Rebase your branch on the main branch to get the most accurate changelog.

Note that this might contain changes that are on main, but not yet released.

Changelog:

0.2.24 (2024-10-14)

Bug Fixes

  • prevent setDatabase starvation which leads to no updates of currently loaded database (8a6197d)

Copy link
Contributor

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I still have a few questions:

  1. Is my understanding correct that what happened on PPX is that: a) SILO didn't manage to switch to new version (for a reason I don't understand) and b) we somehow removed the data dir that was hence still in use (assuming the new one would be used but which wasn't actually).
  2. How does the switch of versions happen? (Maybe there are some docs, in that case no need to reexplain). It's cool that you allow loading of new data without downtime, it's non trivial, there are quote a few pitfalls as a result.
  3. What happens to new requests that are received while the db is being loaded? Are they rejected until the load is successful (you mention that the old version stays alive until the last request of the old version has completed)
  4. What exactly happened in the PPX situation? The switch simply never happened? Why wasn't it retried? Will it be retried in the future, given this PR?
  5. We should not remove old data version dirs until the change has happened successfully, this is why we hit a SILO error on PPX instead of the issue being just to show old data IIUC. How can we tell the old data version is no longer used, that we can hence safely delete the unused one? (there's an issue for adding locks: Implement lock for currently loaded directories #612)

src/silo_api/database_directory_watcher.cpp Show resolved Hide resolved
src/silo_api/database_directory_watcher.cpp Show resolved Hide resolved
@corneliusroemer
Copy link
Contributor

We should create an issue here in SILO to link to, and link to the Loculus slack discussion (instead of just saying that there is a discussion without explicit link)

Copy link
Contributor

@fengelniederhammer fengelniederhammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Let's either create an issue that the changelog can reference or squash-merge this, such that the PR number ends up in the changelog.

src/silo_api/database_directory_watcher.cpp Show resolved Hide resolved
src/silo_api/database_directory_watcher.cpp Show resolved Hide resolved
Comment on lines 157 to +158
SPDLOG_INFO(
"New database with version {} successfully loaded.",
"Did not load new database with version {} successfully.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have log statements in every catch. Should we maybe simply add a clarifying "failed to load new database with version xyz" to the SPDLOG_ERROR(ex.what());?

@fengelniederhammer
Copy link
Contributor

2. How does the switch of versions happen? (Maybe there are some docs, in that case no need to reexplain). It's cool that you allow loading of new data without downtime, it's non trivial, there are quote a few pitfalls as a result.

SILO checks a certain directory every two seconds. It expects data versions in there (folders named after a timestamp). If it finds a folder with a higher number than the currently loaded, it attempts to load the contents.

3. What happens to new requests that are received while the db is being loaded? Are they rejected until the load is successful (you mention that the old version stays alive until the last request of the old version has completed)

They simply keep running. Loading the new DB happens in the background. Once the DB is loaded, the currently instance is swapped and new requests will see the new DB.

4. What exactly happened in the PPX situation? The switch simply never happened? Why wasn't it retried? Will it be retried in the future, given this PR?

Is there something in the logs? It would actually be good to know.

5. We should not remove old data version dirs until the change has happened successfully, this is why we hit a SILO error on PPX instead of the issue being just to show old data IIUC. How can we tell the old data version is no longer used, that we can hence safely delete the unused one?

Currently, you cannot tell from the filesystem which data version is loaded. You could ask the running SILO API.

@Taepper
Copy link
Collaborator Author

Taepper commented Oct 15, 2024

Is my understanding correct that what happened on PPX is that: a) SILO didn't manage to switch to new version (for a reason I don't understand) and b) we somehow removed the data dir that was hence still in use (assuming the new one would be used but which wasn't actually).

Yes, that is correct.

How does the switch of versions happen? (Maybe there are some docs, in that case no need to reexplain). It's cool that you allow loading of new data without downtime, it's non trivial, there are quote a few pitfalls as a result.

To switch the dataset, the "switcher" requests a unique lock. It is guaranteed that when it gets the lock, all other threads that also want this lock will block until we released our unique lock.

What happens to new requests that are received while the db is being loaded? Are they rejected until the load is successful (you mention that the old version stays alive until the last request of the old version has completed)

Requests are still being served from the old state. The load and switch are two separate things.

What exactly happened in the PPX situation? The switch simply never happened? Why wasn't it retried? Will it be retried in the future, given this PR?

The load of the old dataset was successful as can be seen in the log messages, but the switcher was starved while waiting for its lock. It would retry if it failed, but the thread wanting to switch never really failed, it was just indefinitely waiting for a lock.

The requirement of a unique lock was unnecessarily expensive and complicated. Now, we no longer want a lock, but instead just atomically switch out the old dataset and the new dataset. We do not really care if a racy request still gets the old dataset or not (the header will show the updated dataset version after the switch was successful). Currently running queries rely on the fact that the data will stay in memory until they finished running. This is achieved by changing the return type to a shared_ptr, which will only release its memory once all references to the old dataset are gone

We should not remove old data version dirs until the change has happened successfully, this is why we hit a SILO error on PPX instead of the issue being just to show old data IIUC. How can we tell the old data version is no longer used, that we can hence safely delete the unused one?

I added #612 for this.

@@ -137,6 +137,11 @@ void silo_api::DatabaseDirectoryWatcher::checkDirectoryForData(Poco::Timer& /*ti
try {
database_mutex.setDatabase(silo::Database::loadDatabaseState(most_recent_database_state->first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a timeout on this? When the incident happened, is this where it got stuck indefinitely? Just to help me understand what happened.

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Oct 15, 2024

Loading the new DB happens in the background. Once the DB is loaded, the currently instance is swapped and new requests will see the new DB.

Does this mean that swapping can be blocked by if there happens to be continuously more than one active request? This would require a lot of requests and/or long ones but could happen at high load.

Update:

IIUC, this sort of happened, at least the fact that something prevented the lock from being acquired.

The requirement of a unique lock was unnecessarily expensive and complicated. Now, we no longer want a lock, but instead just atomically switch out the old dataset and the new dataset. We do not really care if a racy request still gets the old dataset or not (the header will show the updated dataset version after the switch was successful). Currently running queries rely on the fact that the data will stay in memory until they finished running. This is achieved by changing the return type to a shared_ptr, which will only release its memory once all references to the old dataset are gone

Does the new shared pointer implementation avoid the problem altogether in that new requests are served with the new version as soon as it has loaded successfully, even if there are still active old version requests? Just checking my understanding here.

void silo_api::DatabaseMutex::setDatabase(silo::Database&& new_database) {
const std::unique_lock lock(mutex);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the line where it got stuck before. A log message before would have helped a lot: "Acquiring unique lock to switch out datasets"

@corneliusroemer
Copy link
Contributor

You could ask the running SILO API.

This works as long as there's only one instance running, if there are multiple pods being load balanced this would not be fool proof. But in our current PPX setup seems like a useful thing to try.

@corneliusroemer
Copy link
Contributor

Would it make sense to rename the classes to remove the mutex in it? IIUC, this PR drops the use of a muted.

@Taepper
Copy link
Collaborator Author

Taepper commented Oct 15, 2024

Does the new shared pointer implementation avoid the problem altogether in that new requests are served with the new version as soon as it has loaded successfully, even if there are still active old version requests? Just checking my understanding here.

Yes, precisely. Instead of waiting for all requests to finish, new requests now immediately get the new version. The lifetime management is done by the shared_ptr. (Without it, we would not know, when we would be able to release (i.e. remove from memory) the old database)

Copy link
Contributor

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be too hard to do, but just as an idea, it'd be neat if it was possible to test that the version switch now works as expected:

  1. Loading version 1
  2. Faking/making running request A, that blocks until a later version becomes available
  3. Make version 2 available
  4. Make new request B and assert it uses version 2
  5. Assert that request A completes successfully with data from version 1

@fengelniederhammer
Copy link
Contributor

This might be too hard to do, but just as an idea, it'd be neat if it was possible to test that the version switch now works as expected:

1. Loading version 1

2. Faking/making running request A, that blocks until a later version becomes available

3. Make version 2 available

4. Make new request B and assert it uses version 2

5. Assert that request A completes successfully with data from version 1

I just did that, it works. I added a 20 seconds sleep to the info endpoint and called it with curl, while preprocessing a new data version:

~curl 'localhost:8081/info' -vv
*   Trying 127.0.0.1:8081...
* Connected to localhost (127.0.0.1) port 8081 (#0)
> GET /info HTTP/1.1
> Host: localhost:8081
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Tue, 15 Oct 2024 08:11:02 GMT
< Connection: Close
< X-Request-Id: 8693f837-3608-45d1-86a9-16d57f1af6b3
< data-version: 1728979701
< Content-Type: application/json
< 
* Closing connection 0
{"nBitmapsSize":3931,"numberOfPartitions":11,"sequenceCount":100,"totalSize":26589508}⏎  
[2024-10-15 10:10:30.935] [logger] [info] [database_directory_watcher.cpp:140] New database with version output/1728979701 successfully loaded.
[2024-10-15 10:11:02.467] [logger] [info] [logging_request_handler.cpp:15] Request Id [8693f837-3608-45d1-86a9-16d57f1af6b3] - Handling GET /info
[2024-10-15 10:11:14.127] [logger] [info] [database_directory_watcher.cpp:136] New data version detected: output/1728979861
[2024-10-15 10:11:14.127] [logger] [info] [database_config.cpp:211] Reading database config from output/1728979861/database_config.yaml
[2024-10-15 10:11:14.137] [logger] [info] [database.cpp:550] Finished loading partitions from output/1728979861/aa_sequences.silo
[2024-10-15 10:11:14.137] [logger] [debug] [database.cpp:668] build - initializing nucleotide sequences
[2024-10-15 10:11:14.631] [logger] [debug] [database.cpp:712] build - initializing amino acid sequences
[2024-10-15 10:11:14.882] [logger] [debug] [database.cpp:557] Loading partition data
[2024-10-15 10:11:16.962] [logger] [info] [database.cpp:580] Finished loading partition data
[2024-10-15 10:11:16.962] [logger] [debug] [database.cpp:728] Set data version to {timestamp: 1728979861, serializationVersion: 3}
[2024-10-15 10:11:16.962] [logger] [info] [database.cpp:583] Finished loading data_version from output/1728979861/data_version.silo
[2024-10-15 10:11:16.974] [logger] [info] [database.cpp:586] Database info after loading: sequence count: 100, total size: 26'589'508, N bitmaps size: 3'931
[2024-10-15 10:11:16.974] [logger] [info] [database_directory_watcher.cpp:140] New database with version output/1728979861 successfully loaded.
[2024-10-15 10:11:23.069] [logger] [info] [logging_request_handler.cpp:21] Request Id [8693f837-3608-45d1-86a9-16d57f1af6b3] - Responding with status code 200

You can see that the request arrived before the new data version was loaded. Then the new data version was loaded successfully, then the request finished and still shows the old data version.

@Taepper Taepper merged commit 24e4db6 into main Oct 15, 2024
9 of 10 checks passed
@Taepper Taepper deleted the database-mutex-fix branch October 15, 2024 09:31
@corneliusroemer
Copy link
Contributor

Nice @fengelniederhammer! I guess it's not straightforward to turn this into an automatic test 😄

@fengelniederhammer
Copy link
Contributor

Nice @fengelniederhammer! I guess it's not straightforward to turn this into an automatic test 😄

I don't know how. You need to delay requests and make sure that the reload happens (and is finished!) during a running request. That seems pretty impossible to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants