-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
revert recent VirtualFile asyncification changes #5291
Conversation
This reverts commit a18d6d9.
2466 tests run: 2353 passed, 0 failed, 113 skipped (full report)Code coverage (full report)
The comment gets automatically updated with the latest test results
f781df5 at 2023-09-12T15:15:51.308Z :recycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read through the writeup and I think I understand it well enough to agree that this is the probably cause, most importantly the slot_guard = slot.inner.write().await;
case.
We added new await points but they revealed this issue when we are in highly I/O bound situations with many parallel tasks.
#5319) It is wasteful to cycle through the page cache slots trying to find a victim slot if all the slots are currently un-evictable because a read / write guard is alive. We suspect this wasteful cycling to be the root cause for an "indigestion" we observed in staging (#5291). The hypothesis is that we `.await` after we get ahold of a read / write guard, and that tokio actually deschedules us in favor of another future. If that other future then needs a page slot, it can't get ours because we're holding the guard. Repeat this, and eventually, the other future(s) will find themselves doing `find_victim` until they hit `exceeded evict iter limit`. The `find_victim` is wasteful and CPU-starves the futures that are already holding the read/write guard. A `yield` inside `find_victim` could mitigate the starvation, but wouldn't fix the wasting of CPU cycles. So instead, this PR queues waiters behind a tokio semaphore that counts evictable slots. The downside is that this stops the clock page replacement if we have 0 evictable slots. Also, as explained by the big block comment in `find_victims`, the semaphore doesn't fully prevent starvation because because we can't make tokio prioritize those tasks executing `find_victim` that have been trying the longest. Implementation =============== We need to acquire the semaphore permit before locking the slot. Otherwise, we could deadlock / discover that all permits are gone and would have to relinquish the slot, having moved forward the Clock LRU without making progress. The downside is that, we never get full throughput for read-heavy workloads, because, until the reader coalesces onto an existing permit, it'll hold its own permit. Addendum To Root-Cause Analysis In #5291 ======================================== Since merging that PR, @arpad-m pointed out that we couldn't have reached the `slot.write().await` with his patches because the VirtualFile slots can't have all been write-locked, because we only hold them locked while the IO is ongoing, and the IO is still done with synchronous system calls in that patch set, so, we can have had at most $number_of_executor_threads locked at any given time. I count 3 tokio runtimes that do `Timeline::get`, each with 8 executor threads in our deployment => $number_of_executor_threads = 3*8 = 24 . But the virtual file cache has 100 slots. We both agree that nothing changed about the core hypothesis, i.e., additional await points inside VirtualFile caused higher concurrency resulting in exhaustion of page cache slots. But we'll need to reproduce the issue and investigate further to truly understand the root cause, or find out that & why we were indeed using 100 VirtualFile slots. TODO: could it be compaction that needs to hold guards of many VirtualFile's in its iterators?
…)"" This reverts commit 0b30036.
…)"" This reverts commit fef7018.
Squashed commit of the following: commit 5ec61ce Author: Christian Schwarz <christian@neon.tech> Date: Wed Nov 29 16:17:12 2023 +0000 bump commit 34c33d1 Author: Christian Schwarz <me@cschwarz.com> Date: Mon Nov 20 14:38:29 2023 +0000 bump commit 8fa6b76 Author: Christian Schwarz <me@cschwarz.com> Date: Mon Nov 20 11:47:19 2023 +0000 bump commit 6c359a4 Author: Christian Schwarz <me@cschwarz.com> Date: Mon Nov 20 11:33:58 2023 +0000 use neondatabase/tokio-epoll-uring#25 commit 7d484b0 Author: Christian Schwarz <me@cschwarz.com> Date: Tue Aug 29 19:13:38 2023 +0000 use WIP tokio_epoll_uring open_at for async VirtualFile::open This makes Delta/Image ::load fns fully tokio-epoll-uring commit 51b26b1 Author: Christian Schwarz <me@cschwarz.com> Date: Tue Aug 29 12:24:30 2023 +0000 use `tokio_epoll_uring` for read path commit a4e6f0c Author: Christian Schwarz <me@cschwarz.com> Date: Wed Nov 8 12:36:34 2023 +0000 Revert "revert recent VirtualFile asyncification changes (#5291)" This reverts commit ab1f37e. fixes #5479
…)"" This reverts commit c06279a.
…)"" This reverts commit c06279a.
) This reverts commit ab1f37e. Thereby fixes #5479 Updated Analysis ================ The problem with the original patch was that it, for the first time, exposed the `VirtualFile` code to tokio task concurrency instead of just thread-based concurrency. That caused the VirtualFile file descriptor cache to start thrashing, effectively grinding the system to a halt. Details ------- At the time of the original patch, we had a _lot_ of runnable tasks in the pageserver. The symptom that prompted the revert (now being reverted in this PR) is that our production systems fell into a valley of zero goodput, high CPU, and zero disk IOPS shortly after PS restart. We lay out the root cause for that behavior in this subsection. At the time, there was no concurrency limit on the number of concurrent initial logical size calculations. Initial size calculation was initiated for all timelines within the first 10 minutes as part of consumption metrics collection. On a PS with 20k timelines, we'd thus have 20k runnable tasks. Before the original patch, the `VirtualFile` code never returned `Poll::Pending`. That meant that once we entered it, the calling tokio task would not yield to the tokio executor until we were done performing the VirtualFile operation, i.e., doing a blocking IO system call. The original patch switched the VirtualFile file descriptor cache's synchronization primitives to those from `tokio::sync`. It did not change that we were doing synchronous IO system calls. And the cache had more slots than we have tokio executor threads. So, these primitives never actually needed to return `Poll::Pending`. But, the tokio scheduler makes tokio sync primitives return `Pending` *artificially*, as a mechanism for the scheduler to get back into control more often ([example](https://docs.rs/tokio/1.35.1/src/tokio/sync/batch_semaphore.rs.html#570)). So, the new reality was that VirtualFile calls could now yield to the tokio executor. Tokio would pick one of the other 19999 runnable tasks to run. These tasks were also using VirtualFile. So, we now had a lot more concurrency in that area of the code. The problem with more concurrency was that caches started thrashing, most notably the VirtualFile file descriptor cache: each time a task would be rescheduled, it would want to do its next VirtualFile operation. For that, it would first need to evict another (task's) VirtualFile fd from the cache to make room for its own fd. It would then do one VirtualFile operation before hitting an await point and yielding to the executor again. The executor would run the other 19999 tasks for fairness before circling back to the first task, which would find its fd evicted. The other cache that would theoretically be impacted in a similar way is the pageserver's `PageCache`. However, for initial logical size calculation, it seems much less relevant in experiments, likely because of the random access nature of initial logical size calculation. Fixes ===== We fixed the above problems by - raising VirtualFile cache sizes - neondatabase/cloud#8351 - changing code to ensure forward-progress once cache slots have been acquired - #5480 - #5482 - tbd: #6065 - reducing the amount of runnable tokio tasks - #5578 - #6000 - fix bugs that caused unnecessary concurrency induced by connection handlers - #5993 I manually verified that this PR doesn't negatively affect startup performance as follows: create a pageserver in production configuration, with 20k tenants/timelines, 9 tiny L0 layer files each; Start it, and observe ``` INFO Startup complete (368.009s since start) elapsed_ms=368009 ``` I further verified in that same setup that, when using `pagebench`'s getpage benchmark at as-fast-as-possible request rate against 5k of the 20k tenants, the achieved throughput is identical. The VirtualFile cache isn't thrashing in that case. Future Work =========== We will still exposed to the cache thrashing risk from outside factors, e.g., request concurrency is unbounded, and initial size calculation skips the concurrency limiter when we establish a walreceiver connection. Once we start thrashing, we will degrade non-gracefully, i.e., encounter a valley as was seen with the original patch. However, we have sufficient means to deal with that unlikely situation: 1. we have dashboards & metrics to monitor & alert on cache thrashing 2. we can react by scaling the bottleneck resources (cache size) or by manually shedding load through tenant relocation Potential systematic solutions are future work: * global concurrency limiting * per-tenant rate limiting => #5899 * pageserver-initiated load shedding Related Issues ============== This PR unblocks the introduction of tokio-epoll-uring for asynchronous disk IO ([Epic](#4744)).
…)"" This reverts commit d6598e0.
Motivation
We observed two "indigestion" events on staging, each shortly after restarting
pageserver-0.eu-west-1.aws.neon.build
. It has ~8k tenants.The indigestion manifests as
Timeline::get
calls failing withexceeded evict iter limit
.The error is from
page_cache.rs
; it was unable to find a free page and hence failed with the error.The indigestion events started occuring after we started deploying builds that contained the following commits:
The second and third commit are interesting.
They add .await points to the VirtualFile code.
Background
On the read path, which is the dominant user of page cache & VirtualFile during pageserver restart,
Timeline::get
page_cache
and VirtualFile interact as follows:find_victim
to find an empty slot, and once it has found a slot, it gives exclusive ownership of it to the caller through aPageWriteGuard
.VirtualFile
.The
find_victim
method finds an empty slot using a basic implementation of clock page replacement algorithm.Slots that are currently in use (
PageReadGuard
/PageWriteGuard
) cannot become victims.If there have been too many iterations,
find_victim
gives up with errorexceeded evict iter limit
.Root Cause For Indigestion
The second and third commit quoted in the "Motivation" section introduced
.await
points in the VirtualFile code.These enable tokio to preempt us and schedule another future while we hold the
PageWriteGuard
and are callingVirtualFile::read_at
.This was not possible before these commits, because there simply were no await points that weren't Poll::Ready immediately.
With the offending commits, there is now actual usage of
tokio::sync::RwLock
to protect the VirtualFile file descriptor cache.And we know from other experiments that, during the post-restart "rush", the VirtualFile fd cache is too small, i.e., all slots are taken by ongoing VirtualFile operations and cannot be victims.
So, assume that VirtualFile's
find_victim_slot
'sRwLock::write().await
calls will yield control to the executor.The above can lead to the pathological situation if we have N runnable tokio tasks, each wanting to do
Timeline::get
, but only M slots, N >> M.Suppose M of the N tasks win a PageWriteGuard and get preempted at some .await point inside
VirtualFile::read_at
.Now suppose tokio schedules the remaining N-M tasks for fairness, then schedules the first M tasks again.
Each of the N-M tasks will run
find_victim()
until it hits theexceeded evict iter limit
.Why? Because the first M tasks took all the slots and are still holding them tight through their
PageWriteGuard
.The result is massive wastage of CPU time in
find_victim()
.The effort to find a page is futile, but each of the N-M tasks still attempts it.
This delays the time when tokio gets around to schedule the first M tasks again.
Eventually, tokio will schedule them, they will make progress, fill the
PageWriteGuard
, release it.But in the meantime, the N-M tasks have already bailed with error
exceeded evict iter limit
.Eventually, higher level mechanisms will retry for the N-M tasks, and this time, there won't be as many concurrent tasks wanting to do
Timeline::get
.So, it will shake out.
But, it's a massive indigestion until then.
This PR
This PR reverts the offending commits until we find a proper fix.