Race condition in Pool #1293

madadam · 2021-06-22T15:13:07Z

I think I found a bug in Pool which sometimes causes unexpected PoolTimedOut errors. Here is a minimal reproducible test case:

#[tokio::test(flavor = "multi_thread")]
async fn race_condition_reproduction_case() {
    let pool: sqlx::Pool<Sqlite> = sqlx::pool::PoolOptions::new()
        .max_connections(1)
        .connect_timeout(std::time::Duration::from_secs(1))
        .connect(":memory:")
        .await
        .unwrap();

    sqlx::query("CREATE TABLE foo (name TEXT, value TEXT)")
        .execute(&pool)
        .await
        .unwrap();

    sqlx::query("CREATE TABLE bar (name TEXT, value TEXT)")
        .execute(&pool)
        .await
        .unwrap();
}

The failure is not deterministic - the test might need to be run several times before the failure is observed.

Note: I decreased the connect timeout to make the test fail faster, but the failure still occurs even with the default timeout. Also, this failure is most likely when max_connections is 1, but I think as long as the number of subsequent queries is higher than max_connections then the failure can still occur.

What happens I think is that there is a race condition. The first query acquires a connection and the releases it. But the release happens in a spawned task, so the second query might call acquire first. It gets to this point before a context switch happens and the release of the first query finishes, waking a Waiter if there is one (there are none in this case). Then context gets switched back to the acquire and it runs the poll_fn which pushes a new waiter to the queue. But it never gets woken up because the waking up already happened and eventually the call timeouts.

A simple fix which seems to work for me is to check whether idle_conns is still empty before pushing the waiter:

                future::poll_fn(|cx| -> Poll<()> {
                    if !self.idle_conns.is_empty() {
                        return Poll::Ready(());
                    }

                    let waiter = waiter.get_or_insert_with(|| Waiter::push_new(cx, &self.waiters));

                    // ...
                }),

EDIT: after a bit more testing, this doesn't seem to be a sufficient fix. The only synchronization between idle_conns.push and idle_conns.is_empty is via the waiters, but in cases like I described there are no waiters yet, so depending on timing, the idle_conns.is_empty() call might still return true and the problem persists ☹️

EDIT 2: seems that moving the idle_conns.is_empty() check after the let waiter = waiter.get_or_insert(... line does the trick. That seems to eliminate the last race condition - namely when the is_empty check runs first, then release runs and notifies the waiters and finally the Waiter::push_new is called.

The text was updated successfully, but these errors were encountered:

abonander · 2021-07-10T03:20:17Z

I opened #1320 to just redo the pool internals using futures_intrusive::Semaphore like I should've done a while ago.

madadam mentioned this issue Jul 7, 2021

Fix race condition in Pool::acquire #1315

Closed

abonander mentioned this issue Jul 10, 2021

fix(pool): reimplement pool internals with futures-intrusive #1320

Merged

abonander closed this as completed in #1320 Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in Pool #1293

Race condition in Pool #1293

madadam commented Jun 22, 2021 •

edited

Loading

abonander commented Jul 10, 2021

Race condition in Pool #1293

Race condition in Pool #1293

Comments

madadam commented Jun 22, 2021 • edited Loading

abonander commented Jul 10, 2021

madadam commented Jun 22, 2021 •

edited

Loading