fix: prevent setDatabase starvation which leads to no updates of currently loaded database #613

Taepper · 2024-10-14T14:54:06Z

Summary

resolves issue mentioned on slack, where the swap-over to the latest folder can sometimes fail.

Before, the setDatabase call requested a write-mutex. This was unnecessarily expensive. Instead, it is enough that we just atomically load and store the current database. Databases lifetimes for currently running transactions is now managed by a shared_ptr, which will correctly deallocate the old state once the last reference to the database is dropped (i.e. the last query using the old state has finished)

PR Checklist

~~- [ ] All necessary documentation has been adapted or there is an issue to do so.~~
~~- [ ] The implemented feature is covered by an appropriate test.~~

github-actions · 2024-10-14T14:54:36Z

This is a preview of the changelog of the next release. If this branch is not up-to-date with the current main branch, the changelog may not be accurate. Rebase your branch on the main branch to get the most accurate changelog.

Note that this might contain changes that are on main, but not yet released.

Changelog:

0.2.24 (2024-10-14)

Bug Fixes

prevent setDatabase starvation which leads to no updates of currently loaded database (8a6197d)

…ently loaded database

corneliusroemer

Thanks! I still have a few questions:

Is my understanding correct that what happened on PPX is that: a) SILO didn't manage to switch to new version (for a reason I don't understand) and b) we somehow removed the data dir that was hence still in use (assuming the new one would be used but which wasn't actually).
How does the switch of versions happen? (Maybe there are some docs, in that case no need to reexplain). It's cool that you allow loading of new data without downtime, it's non trivial, there are quote a few pitfalls as a result.
What happens to new requests that are received while the db is being loaded? Are they rejected until the load is successful (you mention that the old version stays alive until the last request of the old version has completed)
What exactly happened in the PPX situation? The switch simply never happened? Why wasn't it retried? Will it be retried in the future, given this PR?
We should not remove old data version dirs until the change has happened successfully, this is why we hit a SILO error on PPX instead of the issue being just to show old data IIUC. How can we tell the old data version is no longer used, that we can hence safely delete the unused one? (there's an issue for adding locks: Implement lock for currently loaded directories #612)

src/silo_api/database_directory_watcher.cpp

corneliusroemer · 2024-10-15T05:55:32Z

We should create an issue here in SILO to link to, and link to the Loculus slack discussion (instead of just saying that there is a discussion without explicit link)

fengelniederhammer

LGTM

Let's either create an issue that the changelog can reference or squash-merge this, such that the PR number ends up in the changelog.

src/silo_api/database_directory_watcher.cpp

fengelniederhammer · 2024-10-15T06:14:49Z

src/silo_api/database_directory_watcher.cpp

   SPDLOG_INFO(
-      "New database with version {} successfully loaded.",
+      "Did not load new database with version {} successfully.",


We have log statements in every catch. Should we maybe simply add a clarifying "failed to load new database with version xyz" to the SPDLOG_ERROR(ex.what());?

fengelniederhammer · 2024-10-15T06:29:30Z

2. How does the switch of versions happen? (Maybe there are some docs, in that case no need to reexplain). It's cool that you allow loading of new data without downtime, it's non trivial, there are quote a few pitfalls as a result.

SILO checks a certain directory every two seconds. It expects data versions in there (folders named after a timestamp). If it finds a folder with a higher number than the currently loaded, it attempts to load the contents.

3. What happens to new requests that are received while the db is being loaded? Are they rejected until the load is successful (you mention that the old version stays alive until the last request of the old version has completed)

They simply keep running. Loading the new DB happens in the background. Once the DB is loaded, the currently instance is swapped and new requests will see the new DB.

4. What exactly happened in the PPX situation? The switch simply never happened? Why wasn't it retried? Will it be retried in the future, given this PR?

Is there something in the logs? It would actually be good to know.

5. We should not remove old data version dirs until the change has happened successfully, this is why we hit a SILO error on PPX instead of the issue being just to show old data IIUC. How can we tell the old data version is no longer used, that we can hence safely delete the unused one?

Currently, you cannot tell from the filesystem which data version is loaded. You could ask the running SILO API.

Taepper · 2024-10-15T06:38:00Z

Is my understanding correct that what happened on PPX is that: a) SILO didn't manage to switch to new version (for a reason I don't understand) and b) we somehow removed the data dir that was hence still in use (assuming the new one would be used but which wasn't actually).

Yes, that is correct.

How does the switch of versions happen? (Maybe there are some docs, in that case no need to reexplain). It's cool that you allow loading of new data without downtime, it's non trivial, there are quote a few pitfalls as a result.

To switch the dataset, the "switcher" requests a unique lock. It is guaranteed that when it gets the lock, all other threads that also want this lock will block until we released our unique lock.

What happens to new requests that are received while the db is being loaded? Are they rejected until the load is successful (you mention that the old version stays alive until the last request of the old version has completed)

Requests are still being served from the old state. The load and switch are two separate things.

What exactly happened in the PPX situation? The switch simply never happened? Why wasn't it retried? Will it be retried in the future, given this PR?

The load of the old dataset was successful as can be seen in the log messages, but the switcher was starved while waiting for its lock. It would retry if it failed, but the thread wanting to switch never really failed, it was just indefinitely waiting for a lock.

The requirement of a unique lock was unnecessarily expensive and complicated. Now, we no longer want a lock, but instead just atomically switch out the old dataset and the new dataset. We do not really care if a racy request still gets the old dataset or not (the header will show the updated dataset version after the switch was successful). Currently running queries rely on the fact that the data will stay in memory until they finished running. This is achieved by changing the return type to a shared_ptr, which will only release its memory once all references to the old dataset are gone

We should not remove old data version dirs until the change has happened successfully, this is why we hit a SILO error on PPX instead of the issue being just to show old data IIUC. How can we tell the old data version is no longer used, that we can hence safely delete the unused one?

I added #612 for this.

corneliusroemer · 2024-10-15T06:39:45Z

src/silo_api/database_directory_watcher.cpp

@@ -137,6 +137,11 @@ void silo_api::DatabaseDirectoryWatcher::checkDirectoryForData(Poco::Timer& /*ti
   try {
      database_mutex.setDatabase(silo::Database::loadDatabaseState(most_recent_database_state->first


Is there a timeout on this? When the incident happened, is this where it got stuck indefinitely? Just to help me understand what happened.

corneliusroemer · 2024-10-15T06:44:38Z

Loading the new DB happens in the background. Once the DB is loaded, the currently instance is swapped and new requests will see the new DB.

Does this mean that swapping can be blocked by if there happens to be continuously more than one active request? This would require a lot of requests and/or long ones but could happen at high load.

Update:

IIUC, this sort of happened, at least the fact that something prevented the lock from being acquired.

The requirement of a unique lock was unnecessarily expensive and complicated. Now, we no longer want a lock, but instead just atomically switch out the old dataset and the new dataset. We do not really care if a racy request still gets the old dataset or not (the header will show the updated dataset version after the switch was successful). Currently running queries rely on the fact that the data will stay in memory until they finished running. This is achieved by changing the return type to a shared_ptr, which will only release its memory once all references to the old dataset are gone

Does the new shared pointer implementation avoid the problem altogether in that new requests are served with the new version as soon as it has loaded successfully, even if there are still active old version requests? Just checking my understanding here.

Taepper · 2024-10-15T06:45:49Z

src/silo_api/database_mutex.cpp

 void silo_api::DatabaseMutex::setDatabase(silo::Database&& new_database) {
-   const std::unique_lock lock(mutex);


This is the line where it got stuck before. A log message before would have helped a lot: "Acquiring unique lock to switch out datasets"

corneliusroemer · 2024-10-15T06:47:09Z

You could ask the running SILO API.

This works as long as there's only one instance running, if there are multiple pods being load balanced this would not be fool proof. But in our current PPX setup seems like a useful thing to try.

corneliusroemer · 2024-10-15T06:56:50Z

Would it make sense to rename the classes to remove the mutex in it? IIUC, this PR drops the use of a muted.

Taepper · 2024-10-15T07:34:48Z

Does the new shared pointer implementation avoid the problem altogether in that new requests are served with the new version as soon as it has loaded successfully, even if there are still active old version requests? Just checking my understanding here.

Yes, precisely. Instead of waiting for all requests to finish, new requests now immediately get the new version. The lifetime management is done by the shared_ptr. (Without it, we would not know, when we would be able to release (i.e. remove from memory) the old database)

corneliusroemer

This might be too hard to do, but just as an idea, it'd be neat if it was possible to test that the version switch now works as expected:

Loading version 1
Faking/making running request A, that blocks until a later version becomes available
Make version 2 available
Make new request B and assert it uses version 2
Assert that request A completes successfully with data from version 1

fengelniederhammer · 2024-10-15T08:13:49Z

This might be too hard to do, but just as an idea, it'd be neat if it was possible to test that the version switch now works as expected:
1. Loading version 1

2. Faking/making running request A, that blocks until a later version becomes available

3. Make version 2 available

4. Make new request B and assert it uses version 2

5. Assert that request A completes successfully with data from version 1

I just did that, it works. I added a 20 seconds sleep to the info endpoint and called it with curl, while preprocessing a new data version:

~curl 'localhost:8081/info' -vv
*   Trying 127.0.0.1:8081...
* Connected to localhost (127.0.0.1) port 8081 (#0)
> GET /info HTTP/1.1
> Host: localhost:8081
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Tue, 15 Oct 2024 08:11:02 GMT
< Connection: Close
< X-Request-Id: 8693f837-3608-45d1-86a9-16d57f1af6b3
< data-version: 1728979701
< Content-Type: application/json
< 
* Closing connection 0
{"nBitmapsSize":3931,"numberOfPartitions":11,"sequenceCount":100,"totalSize":26589508}⏎

[2024-10-15 10:10:30.935] [logger] [info] [database_directory_watcher.cpp:140] New database with version output/1728979701 successfully loaded.
[2024-10-15 10:11:02.467] [logger] [info] [logging_request_handler.cpp:15] Request Id [8693f837-3608-45d1-86a9-16d57f1af6b3] - Handling GET /info
[2024-10-15 10:11:14.127] [logger] [info] [database_directory_watcher.cpp:136] New data version detected: output/1728979861
[2024-10-15 10:11:14.127] [logger] [info] [database_config.cpp:211] Reading database config from output/1728979861/database_config.yaml
[2024-10-15 10:11:14.137] [logger] [info] [database.cpp:550] Finished loading partitions from output/1728979861/aa_sequences.silo
[2024-10-15 10:11:14.137] [logger] [debug] [database.cpp:668] build - initializing nucleotide sequences
[2024-10-15 10:11:14.631] [logger] [debug] [database.cpp:712] build - initializing amino acid sequences
[2024-10-15 10:11:14.882] [logger] [debug] [database.cpp:557] Loading partition data
[2024-10-15 10:11:16.962] [logger] [info] [database.cpp:580] Finished loading partition data
[2024-10-15 10:11:16.962] [logger] [debug] [database.cpp:728] Set data version to {timestamp: 1728979861, serializationVersion: 3}
[2024-10-15 10:11:16.962] [logger] [info] [database.cpp:583] Finished loading data_version from output/1728979861/data_version.silo
[2024-10-15 10:11:16.974] [logger] [info] [database.cpp:586] Database info after loading: sequence count: 100, total size: 26'589'508, N bitmaps size: 3'931
[2024-10-15 10:11:16.974] [logger] [info] [database_directory_watcher.cpp:140] New database with version output/1728979861 successfully loaded.
[2024-10-15 10:11:23.069] [logger] [info] [logging_request_handler.cpp:21] Request Id [8693f837-3608-45d1-86a9-16d57f1af6b3] - Responding with status code 200

You can see that the request arrived before the new data version was loaded. Then the new data version was loaded successfully, then the request finished and still shows the old data version.

corneliusroemer · 2024-10-15T15:08:55Z

Nice @fengelniederhammer! I guess it's not straightforward to turn this into an automatic test 😄

fengelniederhammer · 2024-10-15T16:19:04Z

Nice @fengelniederhammer! I guess it's not straightforward to turn this into an automatic test 😄

I don't know how. You need to delay requests and make sure that the reload happens (and is finished!) during a running request. That seems pretty impossible to me.

Taepper force-pushed the database-mutex-fix branch from 56eaafa to 1898c73 Compare October 14, 2024 15:00

Taepper requested a review from fengelniederhammer October 14, 2024 15:00

Taepper force-pushed the database-mutex-fix branch from 1898c73 to 644ce7b Compare October 14, 2024 15:10

fix: prevent setDatabase starvation which leads to no updates of curr…

8a6197d

…ently loaded database

Taepper force-pushed the database-mutex-fix branch from 644ce7b to 8a6197d Compare October 14, 2024 16:24

corneliusroemer reviewed Oct 15, 2024

View reviewed changes

src/silo_api/database_directory_watcher.cpp Show resolved Hide resolved

src/silo_api/database_directory_watcher.cpp Show resolved Hide resolved

fengelniederhammer approved these changes Oct 15, 2024

View reviewed changes

corneliusroemer reviewed Oct 15, 2024

View reviewed changes

Taepper commented Oct 15, 2024

View reviewed changes

corneliusroemer reviewed Oct 15, 2024

View reviewed changes

Taepper merged commit 24e4db6 into main Oct 15, 2024
9 of 10 checks passed

Taepper deleted the database-mutex-fix branch October 15, 2024 09:31

Taepper mentioned this pull request Oct 23, 2024

SILO (or LAPIS) is down after loading new data #470

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent setDatabase starvation which leads to no updates of currently loaded database #613

fix: prevent setDatabase starvation which leads to no updates of currently loaded database #613

Taepper commented Oct 14, 2024

github-actions bot commented Oct 14, 2024 •

edited

Loading

corneliusroemer left a comment •

edited

Loading

corneliusroemer commented Oct 15, 2024

fengelniederhammer left a comment

fengelniederhammer Oct 15, 2024

fengelniederhammer commented Oct 15, 2024

Taepper commented Oct 15, 2024

corneliusroemer Oct 15, 2024

corneliusroemer commented Oct 15, 2024 •

edited

Loading

Taepper Oct 15, 2024

corneliusroemer commented Oct 15, 2024

corneliusroemer commented Oct 15, 2024

Taepper commented Oct 15, 2024

corneliusroemer left a comment

fengelniederhammer commented Oct 15, 2024

corneliusroemer commented Oct 15, 2024

fengelniederhammer commented Oct 15, 2024

		@@ -137,6 +137,11 @@ void silo_api::DatabaseDirectoryWatcher::checkDirectoryForData(Poco::Timer& /*ti
		try {
		database_mutex.setDatabase(silo::Database::loadDatabaseState(most_recent_database_state->first

		void silo_api::DatabaseMutex::setDatabase(silo::Database&& new_database) {
		const std::unique_lock lock(mutex);

fix: prevent setDatabase starvation which leads to no updates of currently loaded database #613

fix: prevent setDatabase starvation which leads to no updates of currently loaded database #613

Conversation

Taepper commented Oct 14, 2024

Summary

PR Checklist

github-actions bot commented Oct 14, 2024 • edited Loading

0.2.24 (2024-10-14)

Bug Fixes

corneliusroemer left a comment • edited Loading

Choose a reason for hiding this comment

corneliusroemer commented Oct 15, 2024

fengelniederhammer left a comment

Choose a reason for hiding this comment

fengelniederhammer Oct 15, 2024

Choose a reason for hiding this comment

fengelniederhammer commented Oct 15, 2024

Taepper commented Oct 15, 2024

corneliusroemer Oct 15, 2024

Choose a reason for hiding this comment

corneliusroemer commented Oct 15, 2024 • edited Loading

Taepper Oct 15, 2024

Choose a reason for hiding this comment

corneliusroemer commented Oct 15, 2024

corneliusroemer commented Oct 15, 2024

Taepper commented Oct 15, 2024

corneliusroemer left a comment

Choose a reason for hiding this comment

fengelniederhammer commented Oct 15, 2024

corneliusroemer commented Oct 15, 2024

fengelniederhammer commented Oct 15, 2024

github-actions bot commented Oct 14, 2024 •

edited

Loading

corneliusroemer left a comment •

edited

Loading

corneliusroemer commented Oct 15, 2024 •

edited

Loading