Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in Pool #1293

Closed
madadam opened this issue Jun 22, 2021 · 1 comment · Fixed by #1320
Closed

Race condition in Pool #1293

madadam opened this issue Jun 22, 2021 · 1 comment · Fixed by #1320

Comments

@madadam
Copy link
Contributor

madadam commented Jun 22, 2021

I think I found a bug in Pool which sometimes causes unexpected PoolTimedOut errors. Here is a minimal reproducible test case:

#[tokio::test(flavor = "multi_thread")]
async fn race_condition_reproduction_case() {
    let pool: sqlx::Pool<Sqlite> = sqlx::pool::PoolOptions::new()
        .max_connections(1)
        .connect_timeout(std::time::Duration::from_secs(1))
        .connect(":memory:")
        .await
        .unwrap();

    sqlx::query("CREATE TABLE foo (name TEXT, value TEXT)")
        .execute(&pool)
        .await
        .unwrap();

    sqlx::query("CREATE TABLE bar (name TEXT, value TEXT)")
        .execute(&pool)
        .await
        .unwrap();
}

The failure is not deterministic - the test might need to be run several times before the failure is observed.

Note: I decreased the connect timeout to make the test fail faster, but the failure still occurs even with the default timeout. Also, this failure is most likely when max_connections is 1, but I think as long as the number of subsequent queries is higher than max_connections then the failure can still occur.

What happens I think is that there is a race condition. The first query acquires a connection and the releases it. But the release happens in a spawned task, so the second query might call acquire first. It gets to this point before a context switch happens and the release of the first query finishes, waking a Waiter if there is one (there are none in this case). Then context gets switched back to the acquire and it runs the poll_fn which pushes a new waiter to the queue. But it never gets woken up because the waking up already happened and eventually the call timeouts.

A simple fix which seems to work for me is to check whether idle_conns is still empty before pushing the waiter:

                future::poll_fn(|cx| -> Poll<()> {
                    if !self.idle_conns.is_empty() {
                        return Poll::Ready(());
                    }

                    let waiter = waiter.get_or_insert_with(|| Waiter::push_new(cx, &self.waiters));

                    // ...
                }),

EDIT: after a bit more testing, this doesn't seem to be a sufficient fix. The only synchronization between idle_conns.push and idle_conns.is_empty is via the waiters, but in cases like I described there are no waiters yet, so depending on timing, the idle_conns.is_empty() call might still return true and the problem persists ☹️

EDIT 2: seems that moving the idle_conns.is_empty() check after the let waiter = waiter.get_or_insert(... line does the trick. That seems to eliminate the last race condition - namely when the is_empty check runs first, then release runs and notifies the waiters and finally the Waiter::push_new is called.

@abonander
Copy link
Collaborator

I opened #1320 to just redo the pool internals using futures_intrusive::Semaphore like I should've done a while ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants