Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revert recent VirtualFile asyncification changes #5291

Merged
merged 2 commits into from
Sep 12, 2023

Conversation

problame
Copy link
Contributor

@problame problame commented Sep 12, 2023

Motivation

We observed two "indigestion" events on staging, each shortly after restarting pageserver-0.eu-west-1.aws.neon.build. It has ~8k tenants.

The indigestion manifests as Timeline::get calls failing with exceeded evict iter limit .
The error is from page_cache.rs; it was unable to find a free page and hence failed with the error.

The indigestion events started occuring after we started deploying builds that contained the following commits:

[~/src/neon]: git log --oneline c0ed362790caa368aa65ba57d352a2f1562fd6bf..15eaf78083ecff62b7669
091da1a1c8b4f60ebf8
15eaf7808 Disallow block_in_place and Handle::block_on (#5101)
a18d6d9ae Make File opening in VirtualFile async-compatible (#5280)
76cc87398 Use tokio locks in VirtualFile and turn with_file into macro (#5247)

The second and third commit are interesting.
They add .await points to the VirtualFile code.

Background

On the read path, which is the dominant user of page cache & VirtualFile during pageserver restart, Timeline::get page_cache and VirtualFile interact as follows:

  1. Timeline::get tries to read from a layer
  2. This read goes through the page cache.
  3. If we have a page miss (which is known to be common after restart), page_cache uses find_victim to find an empty slot, and once it has found a slot, it gives exclusive ownership of it to the caller through a PageWriteGuard.
  4. The caller is supposed to fill the write guard with data from the underlying backing store, i.e., the layer VirtualFile.
  5. So, we call into `VirtualFile::read_at`` to fill the write guard.

The find_victim method finds an empty slot using a basic implementation of clock page replacement algorithm.
Slots that are currently in use (PageReadGuard / PageWriteGuard) cannot become victims.
If there have been too many iterations, find_victim gives up with error exceeded evict iter limit.

Root Cause For Indigestion

The second and third commit quoted in the "Motivation" section introduced .await points in the VirtualFile code.
These enable tokio to preempt us and schedule another future while we hold the PageWriteGuard and are calling VirtualFile::read_at.
This was not possible before these commits, because there simply were no await points that weren't Poll::Ready immediately.
With the offending commits, there is now actual usage of tokio::sync::RwLock to protect the VirtualFile file descriptor cache.
And we know from other experiments that, during the post-restart "rush", the VirtualFile fd cache is too small, i.e., all slots are taken by ongoing VirtualFile operations and cannot be victims.
So, assume that VirtualFile's find_victim_slot's RwLock::write().await calls will yield control to the executor.

The above can lead to the pathological situation if we have N runnable tokio tasks, each wanting to do Timeline::get, but only M slots, N >> M.
Suppose M of the N tasks win a PageWriteGuard and get preempted at some .await point inside VirtualFile::read_at.
Now suppose tokio schedules the remaining N-M tasks for fairness, then schedules the first M tasks again.
Each of the N-M tasks will run find_victim() until it hits the exceeded evict iter limit.
Why? Because the first M tasks took all the slots and are still holding them tight through their PageWriteGuard.

The result is massive wastage of CPU time in find_victim().
The effort to find a page is futile, but each of the N-M tasks still attempts it.

This delays the time when tokio gets around to schedule the first M tasks again.
Eventually, tokio will schedule them, they will make progress, fill the PageWriteGuard, release it.
But in the meantime, the N-M tasks have already bailed with error exceeded evict iter limit.

Eventually, higher level mechanisms will retry for the N-M tasks, and this time, there won't be as many concurrent tasks wanting to do Timeline::get.
So, it will shake out.

But, it's a massive indigestion until then.

This PR

This PR reverts the offending commits until we find a proper fix.

    Revert "Use tokio locks in VirtualFile and turn with_file into macro (#5247)"
    
    This reverts commit 76cc87398c58aa8856083bd3b17403af56715b17.


    Revert "Make File opening in VirtualFile async-compatible (#5280)"
    
    This reverts commit a18d6d9ae3e1f395042eed39f06685dcaeb2ecc6.

@problame problame marked this pull request as ready for review September 12, 2023 15:12
@problame problame requested review from a team as code owners September 12, 2023 15:12
@problame problame requested review from bojanserafimov, hlinnaka, koivunej and arpad-m and removed request for a team and bojanserafimov September 12, 2023 15:12
@github-actions
Copy link

2466 tests run: 2353 passed, 0 failed, 113 skipped (full report)


Flaky tests (3)

Postgres 16

  • test_partial_evict_tenant: release

Postgres 14

  • test_download_remote_layers_api[local_fs]: debug
  • test_get_tenant_size_with_multiple_branches: release

Code coverage (full report)

  • functions: 53.1% (7668 of 14453 functions)
  • lines: 81.0% (44780 of 55281 lines)

The comment gets automatically updated with the latest test results
f781df5 at 2023-09-12T15:15:51.308Z :recycle:

Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read through the writeup and I think I understand it well enough to agree that this is the probably cause, most importantly the slot_guard = slot.inner.write().await; case.

We added new await points but they revealed this issue when we are in highly I/O bound situations with many parallel tasks.

@problame problame merged commit ab1f37e into main Sep 12, 2023
@problame problame deleted the problame/revert-virtualfile-async-changes branch September 12, 2023 15:38
arpad-m pushed a commit that referenced this pull request Sep 29, 2023
#5319)

It is wasteful to cycle through the page cache slots trying to find a
victim slot if all the slots are currently un-evictable because a read /
write guard is alive.

We suspect this wasteful cycling to be the root cause for an
"indigestion" we observed in staging (#5291).
The hypothesis is that we `.await` after we get ahold of a read / write
guard, and that tokio actually deschedules us in favor of another
future.
If that other future then needs a page slot, it can't get ours because
we're holding the guard.
Repeat this, and eventually, the other future(s) will find themselves
doing `find_victim` until they hit `exceeded evict iter limit`.

The `find_victim` is wasteful and CPU-starves the futures that are
already holding the read/write guard. A `yield` inside `find_victim`
could mitigate the starvation, but wouldn't fix the wasting of CPU
cycles.

So instead, this PR queues waiters behind a tokio semaphore that counts
evictable slots.
The downside is that this stops the clock page replacement if we have 0
evictable slots.

Also, as explained by the big block comment in `find_victims`, the
semaphore doesn't fully prevent starvation because because we can't make
tokio prioritize those tasks executing `find_victim` that have been
trying the longest.

Implementation
===============
We need to acquire the semaphore permit before locking the slot.
Otherwise, we could deadlock / discover that all permits are gone and
would have to relinquish the slot, having moved forward the Clock LRU
without making progress.

The downside is that, we never get full throughput for read-heavy
workloads, because, until the reader coalesces onto an existing permit,
it'll hold its own permit.


Addendum To Root-Cause Analysis In #5291
========================================

Since merging that PR, @arpad-m pointed out that we couldn't have
reached the `slot.write().await` with his patches because the
VirtualFile slots can't have all been write-locked, because we only hold
them locked while the IO is ongoing, and the IO is still done with
synchronous system calls in that patch set, so, we can have had at most
$number_of_executor_threads locked at any given time.
I count 3 tokio runtimes that do `Timeline::get`, each with 8 executor
threads in our deployment => $number_of_executor_threads = 3*8 = 24 .
But the virtual file cache has 100 slots.

We both agree that nothing changed about the core hypothesis, i.e.,
additional await points inside VirtualFile caused higher concurrency
resulting in exhaustion of page cache slots.
But we'll need to reproduce the issue and investigate further to truly
understand the root cause, or find out that & why we were indeed using
100 VirtualFile slots.

TODO: could it be compaction that needs to hold guards of many
VirtualFile's in its iterators?
problame added a commit that referenced this pull request Oct 4, 2023
problame added a commit that referenced this pull request Oct 5, 2023
problame added a commit that referenced this pull request Oct 10, 2023
problame added a commit that referenced this pull request Oct 26, 2023
problame added a commit that referenced this pull request Nov 8, 2023
problame added a commit that referenced this pull request Nov 8, 2023
problame added a commit that referenced this pull request Nov 20, 2023
problame added a commit that referenced this pull request Nov 28, 2023
problame added a commit that referenced this pull request Nov 28, 2023
problame added a commit that referenced this pull request Nov 29, 2023
problame added a commit that referenced this pull request Nov 29, 2023
problame added a commit that referenced this pull request Nov 29, 2023
Squashed commit of the following:

commit 5ec61ce
Author: Christian Schwarz <christian@neon.tech>
Date:   Wed Nov 29 16:17:12 2023 +0000

    bump

commit 34c33d1
Author: Christian Schwarz <me@cschwarz.com>
Date:   Mon Nov 20 14:38:29 2023 +0000

    bump

commit 8fa6b76
Author: Christian Schwarz <me@cschwarz.com>
Date:   Mon Nov 20 11:47:19 2023 +0000

    bump

commit 6c359a4
Author: Christian Schwarz <me@cschwarz.com>
Date:   Mon Nov 20 11:33:58 2023 +0000

    use neondatabase/tokio-epoll-uring#25

commit 7d484b0
Author: Christian Schwarz <me@cschwarz.com>
Date:   Tue Aug 29 19:13:38 2023 +0000

    use WIP tokio_epoll_uring open_at for async VirtualFile::open

    This makes Delta/Image ::load fns fully tokio-epoll-uring

commit 51b26b1
Author: Christian Schwarz <me@cschwarz.com>
Date:   Tue Aug 29 12:24:30 2023 +0000

    use `tokio_epoll_uring` for read path

commit a4e6f0c
Author: Christian Schwarz <me@cschwarz.com>
Date:   Wed Nov 8 12:36:34 2023 +0000

    Revert "revert recent VirtualFile asyncification changes (#5291)"

    This reverts commit ab1f37e.

    fixes #5479
problame added a commit that referenced this pull request Dec 1, 2023
problame added a commit that referenced this pull request Dec 1, 2023
This reverts commit ab1f37e.

fixes #5479

(cherry picked from commit 25e005c)
problame added a commit that referenced this pull request Dec 7, 2023
problame added a commit that referenced this pull request Dec 8, 2023
problame added a commit that referenced this pull request Dec 11, 2023
problame added a commit that referenced this pull request Dec 11, 2023
problame added a commit that referenced this pull request Jan 9, 2024
problame added a commit that referenced this pull request Jan 9, 2024
problame added a commit that referenced this pull request Jan 9, 2024
problame added a commit that referenced this pull request Jan 9, 2024
problame added a commit that referenced this pull request Jan 9, 2024
problame added a commit that referenced this pull request Jan 11, 2024
)

This reverts commit ab1f37e.
Thereby
fixes #5479

Updated Analysis
================

The problem with the original patch was that it, for the first time,
exposed the `VirtualFile` code to tokio task concurrency instead of just
thread-based concurrency. That caused the VirtualFile file descriptor
cache to start thrashing, effectively grinding the system to a halt.

Details
-------

At the time of the original patch, we had a _lot_ of runnable tasks in
the pageserver.
The symptom that prompted the revert (now being reverted in this PR) is
that our production systems fell into a valley of zero goodput, high
CPU, and zero disk IOPS shortly after PS restart.
We lay out the root cause for that behavior in this subsection.

At the time, there was no concurrency limit on the number of concurrent
initial logical size calculations.
Initial size calculation was initiated for all timelines within the
first 10 minutes as part of consumption metrics collection.
On a PS with 20k timelines, we'd thus have 20k runnable tasks.

Before the original patch, the `VirtualFile` code never returned
`Poll::Pending`.
That meant that once we entered it, the calling tokio task would not
yield to the tokio executor until we were done performing the
VirtualFile operation, i.e., doing a blocking IO system call.

The original patch switched the VirtualFile file descriptor cache's
synchronization primitives to those from `tokio::sync`.
It did not change that we were doing synchronous IO system calls.
And the cache had more slots than we have tokio executor threads.
So, these primitives never actually needed to return `Poll::Pending`.
But, the tokio scheduler makes tokio sync primitives return `Pending`
*artificially*, as a mechanism for the scheduler to get back into
control more often
([example](https://docs.rs/tokio/1.35.1/src/tokio/sync/batch_semaphore.rs.html#570)).

So, the new reality was that VirtualFile calls could now yield to the
tokio executor.
Tokio would pick one of the other 19999 runnable tasks to run.
These tasks were also using VirtualFile.
So, we now had a lot more concurrency in that area of the code.

The problem with more concurrency was that caches started thrashing,
most notably the VirtualFile file descriptor cache: each time a task
would be rescheduled, it would want to do its next VirtualFile
operation. For that, it would first need to evict another (task's)
VirtualFile fd from the cache to make room for its own fd. It would then
do one VirtualFile operation before hitting an await point and yielding
to the executor again. The executor would run the other 19999 tasks for
fairness before circling back to the first task, which would find its fd
evicted.

The other cache that would theoretically be impacted in a similar way is
the pageserver's `PageCache`.
However, for initial logical size calculation, it seems much less
relevant in experiments, likely because of the random access nature of
initial logical size calculation.

Fixes
=====

We fixed the above problems by
- raising VirtualFile cache sizes
  - neondatabase/cloud#8351
- changing code to ensure forward-progress once cache slots have been
acquired
  - #5480
  - #5482
  - tbd: #6065
- reducing the amount of runnable tokio tasks
  - #5578
  - #6000
- fix bugs that caused unnecessary concurrency induced by connection
handlers
  - #5993

I manually verified that this PR doesn't negatively affect startup
performance as follows:
create a pageserver in production configuration, with 20k
tenants/timelines, 9 tiny L0 layer files each; Start it, and observe

```
INFO Startup complete (368.009s since start) elapsed_ms=368009
```

I further verified in that same setup that, when using `pagebench`'s
getpage benchmark at as-fast-as-possible request rate against 5k of the
20k tenants, the achieved throughput is identical. The VirtualFile cache
isn't thrashing in that case.

Future Work
===========

We will still exposed to the cache thrashing risk from outside factors,
e.g., request concurrency is unbounded, and initial size calculation
skips the concurrency limiter when we establish a walreceiver
connection.

Once we start thrashing, we will degrade non-gracefully, i.e., encounter
a valley as was seen with the original patch.

However, we have sufficient means to deal with that unlikely situation:
1. we have dashboards & metrics to monitor & alert on cache thrashing
2. we can react by scaling the bottleneck resources (cache size) or by
manually shedding load through tenant relocation

Potential systematic solutions are future work:
* global concurrency limiting
* per-tenant rate limiting => #5899
* pageserver-initiated load shedding

Related Issues
==============

This PR unblocks the introduction of tokio-epoll-uring for asynchronous
disk IO ([Epic](#4744)).
problame added a commit that referenced this pull request Jan 11, 2024
problame added a commit that referenced this pull request Jan 11, 2024
problame added a commit that referenced this pull request Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants