revert recent VirtualFile asyncification changes #5291

problame · 2023-09-12T14:44:20Z

Motivation

We observed two "indigestion" events on staging, each shortly after restarting pageserver-0.eu-west-1.aws.neon.build. It has ~8k tenants.

The indigestion manifests as Timeline::get calls failing with exceeded evict iter limit .
The error is from page_cache.rs; it was unable to find a free page and hence failed with the error.

The indigestion events started occuring after we started deploying builds that contained the following commits:

[~/src/neon]: git log --oneline c0ed362790caa368aa65ba57d352a2f1562fd6bf..15eaf78083ecff62b7669
091da1a1c8b4f60ebf8
15eaf7808 Disallow block_in_place and Handle::block_on (#5101)
a18d6d9ae Make File opening in VirtualFile async-compatible (#5280)
76cc87398 Use tokio locks in VirtualFile and turn with_file into macro (#5247)

The second and third commit are interesting.
They add .await points to the VirtualFile code.

Background

On the read path, which is the dominant user of page cache & VirtualFile during pageserver restart, Timeline::get page_cache and VirtualFile interact as follows:

Timeline::get tries to read from a layer
This read goes through the page cache.
If we have a page miss (which is known to be common after restart), page_cache uses find_victim to find an empty slot, and once it has found a slot, it gives exclusive ownership of it to the caller through a PageWriteGuard.
The caller is supposed to fill the write guard with data from the underlying backing store, i.e., the layer VirtualFile.
So, we call into `VirtualFile::read_at`` to fill the write guard.

The find_victim method finds an empty slot using a basic implementation of clock page replacement algorithm.
Slots that are currently in use (PageReadGuard / PageWriteGuard) cannot become victims.
If there have been too many iterations, find_victim gives up with error exceeded evict iter limit.

Root Cause For Indigestion

The second and third commit quoted in the "Motivation" section introduced .await points in the VirtualFile code.
These enable tokio to preempt us and schedule another future while we hold the PageWriteGuard and are calling VirtualFile::read_at.
This was not possible before these commits, because there simply were no await points that weren't Poll::Ready immediately.
With the offending commits, there is now actual usage of tokio::sync::RwLock to protect the VirtualFile file descriptor cache.
And we know from other experiments that, during the post-restart "rush", the VirtualFile fd cache is too small, i.e., all slots are taken by ongoing VirtualFile operations and cannot be victims.
So, assume that VirtualFile's find_victim_slot's RwLock::write().await calls will yield control to the executor.

The above can lead to the pathological situation if we have N runnable tokio tasks, each wanting to do Timeline::get, but only M slots, N >> M.
Suppose M of the N tasks win a PageWriteGuard and get preempted at some .await point inside VirtualFile::read_at.
Now suppose tokio schedules the remaining N-M tasks for fairness, then schedules the first M tasks again.
Each of the N-M tasks will run find_victim() until it hits the exceeded evict iter limit.
Why? Because the first M tasks took all the slots and are still holding them tight through their PageWriteGuard.

The result is massive wastage of CPU time in find_victim().
The effort to find a page is futile, but each of the N-M tasks still attempts it.

This delays the time when tokio gets around to schedule the first M tasks again.
Eventually, tokio will schedule them, they will make progress, fill the PageWriteGuard, release it.
But in the meantime, the N-M tasks have already bailed with error exceeded evict iter limit.

Eventually, higher level mechanisms will retry for the N-M tasks, and this time, there won't be as many concurrent tasks wanting to do Timeline::get.
So, it will shake out.

But, it's a massive indigestion until then.

This PR

This PR reverts the offending commits until we find a proper fix.

    Revert "Use tokio locks in VirtualFile and turn with_file into macro (#5247)"
    
    This reverts commit 76cc87398c58aa8856083bd3b17403af56715b17.


    Revert "Make File opening in VirtualFile async-compatible (#5280)"
    
    This reverts commit a18d6d9ae3e1f395042eed39f06685dcaeb2ecc6.

This reverts commit a18d6d9.

…5247)" This reverts commit 76cc873.

github-actions · 2023-09-12T15:15:52Z

2466 tests run: 2353 passed, 0 failed, 113 skipped (full report)

Flaky tests (3)

Postgres 16

test_partial_evict_tenant: release

Postgres 14

test_download_remote_layers_api[local_fs]: debug
test_get_tenant_size_with_multiple_branches: release

Code coverage (full report)

functions: 53.1% (7668 of 14453 functions)
lines: 81.0% (44780 of 55281 lines)

_{The comment gets automatically updated with the latest test results
f781df5 at 2023-09-12T15:15:51.308Z :recycle:}

arpad-m

I've read through the writeup and I think I understand it well enough to agree that this is the probably cause, most importantly the slot_guard = slot.inner.write().await; case.

We added new await points but they revealed this issue when we are in highly I/O bound situations with many parallel tasks.

@arpad-m

#5319) It is wasteful to cycle through the page cache slots trying to find a victim slot if all the slots are currently un-evictable because a read / write guard is alive. We suspect this wasteful cycling to be the root cause for an "indigestion" we observed in staging (#5291). The hypothesis is that we `.await` after we get ahold of a read / write guard, and that tokio actually deschedules us in favor of another future. If that other future then needs a page slot, it can't get ours because we're holding the guard. Repeat this, and eventually, the other future(s) will find themselves doing `find_victim` until they hit `exceeded evict iter limit`. The `find_victim` is wasteful and CPU-starves the futures that are already holding the read/write guard. A `yield` inside `find_victim` could mitigate the starvation, but wouldn't fix the wasting of CPU cycles. So instead, this PR queues waiters behind a tokio semaphore that counts evictable slots. The downside is that this stops the clock page replacement if we have 0 evictable slots. Also, as explained by the big block comment in `find_victims`, the semaphore doesn't fully prevent starvation because because we can't make tokio prioritize those tasks executing `find_victim` that have been trying the longest. Implementation =============== We need to acquire the semaphore permit before locking the slot. Otherwise, we could deadlock / discover that all permits are gone and would have to relinquish the slot, having moved forward the Clock LRU without making progress. The downside is that, we never get full throughput for read-heavy workloads, because, until the reader coalesces onto an existing permit, it'll hold its own permit. Addendum To Root-Cause Analysis In #5291 ======================================== Since merging that PR, @arpad-m pointed out that we couldn't have reached the `slot.write().await` with his patches because the VirtualFile slots can't have all been write-locked, because we only hold them locked while the IO is ongoing, and the IO is still done with synchronous system calls in that patch set, so, we can have had at most $number_of_executor_threads locked at any given time. I count 3 tokio runtimes that do `Timeline::get`, each with 8 executor threads in our deployment => $number_of_executor_threads = 3*8 = 24 . But the virtual file cache has 100 slots. We both agree that nothing changed about the core hypothesis, i.e., additional await points inside VirtualFile caused higher concurrency resulting in exhaustion of page cache slots. But we'll need to reproduce the issue and investigate further to truly understand the root cause, or find out that & why we were indeed using 100 VirtualFile slots. TODO: could it be compaction that needs to hold guards of many VirtualFile's in its iterators?

This reverts commit ab1f37e.