`VirtualFile::open_with_options` is wasteful under pressure #6065

problame · 2023-12-07T13:54:10Z

VirtualFile::open_with_options claims the victim slot, does the open system call, populates the slot, but then doesn't return FileGuard but instead a VirtualFile. I.e., it drops its SlotGuard immediately.

Under pressure, it's likely the slot will get re-used and the underlying file closed before they get around to using it.

This is kind of low priority because we want to increase VirtualFile cache size significantly: https://github.com/neondatabase/cloud/issues/8351

The text was updated successfully, but these errors were encountered:

This helps with identifying thrashing. While reading the code, I also found a case where we waste work in a cache pressure situation: #6065 refs neondatabase/cloud#8351

This helps with identifying thrashing. I don't love the name, but, there is already "close-by-replace". While reading the code, I also found a case where we waste work in a cache pressure situation: #6065 refs neondatabase/cloud#8351

#6066) This helps with identifying thrashing. I don't love the name, but, there is already "close-by-replace". While reading the code, I also found a case where we waste work in a cache pressure situation: #6065 refs neondatabase/cloud#8351

) This reverts commit ab1f37e. Thereby fixes #5479 Updated Analysis ================ The problem with the original patch was that it, for the first time, exposed the `VirtualFile` code to tokio task concurrency instead of just thread-based concurrency. That caused the VirtualFile file descriptor cache to start thrashing, effectively grinding the system to a halt. Details ------- At the time of the original patch, we had a _lot_ of runnable tasks in the pageserver. The symptom that prompted the revert (now being reverted in this PR) is that our production systems fell into a valley of zero goodput, high CPU, and zero disk IOPS shortly after PS restart. We lay out the root cause for that behavior in this subsection. At the time, there was no concurrency limit on the number of concurrent initial logical size calculations. Initial size calculation was initiated for all timelines within the first 10 minutes as part of consumption metrics collection. On a PS with 20k timelines, we'd thus have 20k runnable tasks. Before the original patch, the `VirtualFile` code never returned `Poll::Pending`. That meant that once we entered it, the calling tokio task would not yield to the tokio executor until we were done performing the VirtualFile operation, i.e., doing a blocking IO system call. The original patch switched the VirtualFile file descriptor cache's synchronization primitives to those from `tokio::sync`. It did not change that we were doing synchronous IO system calls. And the cache had more slots than we have tokio executor threads. So, these primitives never actually needed to return `Poll::Pending`. But, the tokio scheduler makes tokio sync primitives return `Pending` *artificially*, as a mechanism for the scheduler to get back into control more often ([example](https://docs.rs/tokio/1.35.1/src/tokio/sync/batch_semaphore.rs.html#570)). So, the new reality was that VirtualFile calls could now yield to the tokio executor. Tokio would pick one of the other 19999 runnable tasks to run. These tasks were also using VirtualFile. So, we now had a lot more concurrency in that area of the code. The problem with more concurrency was that caches started thrashing, most notably the VirtualFile file descriptor cache: each time a task would be rescheduled, it would want to do its next VirtualFile operation. For that, it would first need to evict another (task's) VirtualFile fd from the cache to make room for its own fd. It would then do one VirtualFile operation before hitting an await point and yielding to the executor again. The executor would run the other 19999 tasks for fairness before circling back to the first task, which would find its fd evicted. The other cache that would theoretically be impacted in a similar way is the pageserver's `PageCache`. However, for initial logical size calculation, it seems much less relevant in experiments, likely because of the random access nature of initial logical size calculation. Fixes ===== We fixed the above problems by - raising VirtualFile cache sizes - neondatabase/cloud#8351 - changing code to ensure forward-progress once cache slots have been acquired - #5480 - #5482 - tbd: #6065 - reducing the amount of runnable tokio tasks - #5578 - #6000 - fix bugs that caused unnecessary concurrency induced by connection handlers - #5993 I manually verified that this PR doesn't negatively affect startup performance as follows: create a pageserver in production configuration, with 20k tenants/timelines, 9 tiny L0 layer files each; Start it, and observe ``` INFO Startup complete (368.009s since start) elapsed_ms=368009 ``` I further verified in that same setup that, when using `pagebench`'s getpage benchmark at as-fast-as-possible request rate against 5k of the 20k tenants, the achieved throughput is identical. The VirtualFile cache isn't thrashing in that case. Future Work =========== We will still exposed to the cache thrashing risk from outside factors, e.g., request concurrency is unbounded, and initial size calculation skips the concurrency limiter when we establish a walreceiver connection. Once we start thrashing, we will degrade non-gracefully, i.e., encounter a valley as was seen with the original patch. However, we have sufficient means to deal with that unlikely situation: 1. we have dashboards & metrics to monitor & alert on cache thrashing 2. we can react by scaling the bottleneck resources (cache size) or by manually shedding load through tenant relocation Potential systematic solutions are future work: * global concurrency limiting * per-tenant rate limiting => #5899 * pageserver-initiated load shedding Related Issues ============== This PR unblocks the introduction of tokio-epoll-uring for asynchronous disk IO ([Epic](#4744)).

problame mentioned this issue Dec 7, 2023

revert "revert recent VirtualFile asyncification changes #5291" #5479

Closed

1 task

problame self-assigned this Dec 7, 2023

problame added c/storage/safekeeper Component: storage: safekeeper c/storage/pageserver Component: storage: pageserver and removed c/storage/safekeeper Component: storage: safekeeper labels Dec 7, 2023

problame mentioned this issue Dec 7, 2023

virtual_file metrics: distinguish first and subsequent open() syscalls #6066

Merged

problame added m/good_first_issue Moment: when doing your first Neon contributions a/performance Area: relates to performance of the system and removed m/good_first_issue Moment: when doing your first Neon contributions labels Dec 18, 2023

problame mentioned this issue Jan 10, 2024

Revert "revert recent VirtualFile asyncification changes (#5291)" #6309

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`VirtualFile::open_with_options` is wasteful under pressure #6065

`VirtualFile::open_with_options` is wasteful under pressure #6065

problame commented Dec 7, 2023 •

edited

Loading

VirtualFile::open_with_options is wasteful under pressure #6065

VirtualFile::open_with_options is wasteful under pressure #6065

Comments

problame commented Dec 7, 2023 • edited Loading

`VirtualFile::open_with_options` is wasteful under pressure #6065

`VirtualFile::open_with_options` is wasteful under pressure #6065

problame commented Dec 7, 2023 •

edited

Loading