Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tokio locks in VirtualFile and turn with_file into macro #5247

Merged
merged 11 commits into from
Sep 11, 2023

Conversation

arpad-m
Copy link
Member

@arpad-m arpad-m commented Sep 7, 2023

Problem

For #4743, we want to convert everything up to the actual I/O operations of VirtualFile to async fn.

Summary of changes

This PR is the last change in a series of changes to VirtualFile: #5189, #5190, #5195, #5203, and #5224.

It does the last preparations before the I/O operations are actually made async. We are doing the following things:

  • First, we change the locks for the file descriptor cache to tokio's locks that support Send. This is important when one wants to hold locks across await points (which we want to do), otherwise the Future won't be Send. Also, one shouldn't generally block in async code as executors don't like that.
  • Due to the lock change, we now take an approach for the VirtualFile destructors similar to the one proposed by page_cache: use try_write in drop_buffers_for_immutable #5122 for the page cache, to use try_write. Similarly to the situation in the linked PR, one can make an argument that if we are in the destructor and the slot has not been reused yet, we are the only user accessing the slot due to owning the lock mutably. It is still possible that we are not obtaining the lock, but the only cause for that is the clock algorithm touching the slot, which should be quite an unlikely occurence. For the instance of try_write failing, we spawn an async task to destroy the lock. As just argued however, most of the time the code path where we spawn the task should not be visited.
  • Lastly, we split with_file into a macro part, and a function part that contains most of the logic. The function part returns a lock object, that the macro uses. The macro exists to perform the operation in a more compact fashion, saving code from putting the lock into a variable and then doing the operation while measuring the time to run it. We take the locks approach because Rust has no support for async closures. One can make normal closures return a future, but that approach gets into lifetime issues the moment you want to pass data to these closures via parameters that has a lifetime (captures work). For details, see this and this link. In Make VirtualFile::{open, open_with_options, create,sync_all,with_file} async fn #5224, we ran into a similar problem with the test_files function, and we ended up passing the path and the OpenOptions by-value instead of by-ref, at the expense of a few extra copies. This can be done as the data is cheaply copyable, and we are in test code. But here, we are not, and while File::try_clone exists, it issues system calls internally. Also, it would allocate an entirely new file descriptor, something that the fd cache was built to prevent.
  • We change the STORAGE_IO_TIME metrics to support async.

Part of #4743.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

Sadly Rust doesn't support closures yet that take on parameters with
lifetimes.
@arpad-m arpad-m requested review from a team as code owners September 7, 2023 23:46
@arpad-m arpad-m requested review from knizhnik and shanyp and removed request for a team September 7, 2023 23:46
@github-actions
Copy link

github-actions bot commented Sep 7, 2023

1644 tests run: 1570 passed, 0 failed, 74 skipped (full report)


Code coverage (full report)

  • functions: 53.0% (7583 of 14311 functions)
  • lines: 81.5% (44787 of 54967 lines)

The comment gets automatically updated with the latest test results
e1518e2 at 2023-09-11T14:58:24.907Z :recycle:

@hlinnaka
Copy link
Contributor

hlinnaka commented Sep 8, 2023

Lastly, we switch with_file to a macro. Mostly we do this because Rust has no support for async closures. One can make normal closures return a future, but that approach gets into lifetime issues the moment you want to pass data to these closures that has a lifetime.

Another alternative to closures is to introduce a new guard object. So instead of:

self.with_file("foo", |file|.sync_all()).await?

The callers would look like this:

{
    let file_guard = self.with_file();
    file_guard.sync_all().await?
    // `file_guard` is dropped
}

I guess in this case, the guard object would be just slot_guard, or a wrapper around it.

@arpad-m
Copy link
Member Author

arpad-m commented Sep 9, 2023

The idea with the guards was really great, I've moved most of the logic away from the macro into a function again that returns locks. There is still a macro to save repetition. I've also added back the metrics.

In the worst case, if it turns out that we really can't use the macro with async ops, we can always remove it again: now it's pretty cheap to do so (there is still some minor level of repetition that it saves).

Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding request changes for this thread at least: https://github.com/neondatabase/neon/pull/5247/files#r1321117325

@koivunej koivunej dismissed their stale review September 11, 2023 12:39

I still have issues with the tt-muncher and lack of Arc (unless I missed it) but now the metrics are back, and this needs to be rebased.

This is closer to what we had before. Again, sad that there is no
support for async closures :).
@arpad-m
Copy link
Member Author

arpad-m commented Sep 11, 2023

I still have issues with the tt-muncher and lack of Arc (unless I missed it) but now the metrics are back, and this needs to be rebased.

regarding the tt muncher I have now implemented your suggestion to not use the dot syntax any more. For the lack of Arc, it's not really needed to use Arcs here, instead I used lock objects as suggested by Heikki (already done that at the time you made that comment). For the rebase, I've done it via a git merge.

Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, it worked! I'm okay with this now, even if the Arc changes might be still required. Also drop spawns a task, it should be good.

@arpad-m arpad-m merged commit 76cc873 into main Sep 11, 2023
@arpad-m arpad-m deleted the arpad/virtual_file_async_6 branch September 11, 2023 15:35
arpad-m added a commit that referenced this pull request Sep 11, 2023
## Problem

Previously, we were using `observe_closure_duration` in `VirtualFile`
file opening code, but this doesn't support async open operations, which
we want to use as part of #4743.

## Summary of changes

* Move the duration measurement from the `with_file` macro into a
`observe_duration` macro.
* Some smaller drive-by fixes to replace the old strings with the new
variant names introduced by #5273

Part of #4743, follow-up of #5247.
problame added a commit that referenced this pull request Sep 12, 2023
problame added a commit that referenced this pull request Sep 12, 2023
Motivation
==========

We observed two "indigestion" events on staging, each shortly after
restarting `pageserver-0.eu-west-1.aws.neon.build`. It has ~8k tenants.

The indigestion manifests as `Timeline::get` calls failing with
`exceeded evict iter limit` .
The error is from `page_cache.rs`; it was unable to find a free page and
hence failed with the error.

The indigestion events started occuring after we started deploying
builds that contained the following commits:

```
[~/src/neon]: git log --oneline c0ed362..15eaf78
091da1a1c8b4f60ebf8
15eaf78 Disallow block_in_place and Handle::block_on (#5101)
a18d6d9 Make File opening in VirtualFile async-compatible (#5280)
76cc873 Use tokio locks in VirtualFile and turn with_file into macro (#5247)
```

The second and third commit are interesting.
They add .await points to the VirtualFile code.

Background
==========

On the read path, which is the dominant user of page cache & VirtualFile
during pageserver restart, `Timeline::get` `page_cache` and VirtualFile
interact as follows:

1. Timeline::get tries to read from a layer
2. This read goes through the page cache.
3. If we have a page miss (which is known to be common after restart),
page_cache uses `find_victim` to find an empty slot, and once it has
found a slot, it gives exclusive ownership of it to the caller through a
`PageWriteGuard`.
4. The caller is supposed to fill the write guard with data from the
underlying backing store, i.e., the layer `VirtualFile`.
5. So, we call into `VirtualFile::read_at`` to fill the write guard.

The `find_victim` method finds an empty slot using a basic
implementation of clock page replacement algorithm.
Slots that are currently in use (`PageReadGuard` / `PageWriteGuard`)
cannot become victims.
If there have been too many iterations, `find_victim` gives up with
error `exceeded evict iter limit`.

Root Cause For Indigestion
==========================

The second and third commit quoted in the "Motivation" section
introduced `.await` points in the VirtualFile code.
These enable tokio to preempt us and schedule another future __while__
we hold the `PageWriteGuard` and are calling `VirtualFile::read_at`.
This was not possible before these commits, because there simply were no
await points that weren't Poll::Ready immediately.
With the offending commits, there is now actual usage of
`tokio::sync::RwLock` to protect the VirtualFile file descriptor cache.
And we __know__ from other experiments that, during the post-restart
"rush", the VirtualFile fd cache __is__ too small, i.e., all slots are
taken by _ongoing_ VirtualFile operations and cannot be victims.
So, assume that VirtualFile's `find_victim_slot`'s
`RwLock::write().await` calls _will_ yield control to the executor.

The above can lead to the pathological situation if we have N runnable
tokio tasks, each wanting to do `Timeline::get`, but only M slots, N >>
M.
Suppose M of the N tasks win a PageWriteGuard and get preempted at some
.await point inside `VirtualFile::read_at`.
Now suppose tokio schedules the remaining N-M tasks for fairness, then
schedules the first M tasks again.
Each of the N-M tasks will run `find_victim()` until it hits the
`exceeded evict iter limit`.
Why? Because the first M tasks took all the slots and are still holding
them tight through their `PageWriteGuard`.

The result is massive wastage of CPU time in `find_victim()`.
The effort to find a page is futile, but each of the N-M tasks still
attempts it.

This delays the time when tokio gets around to schedule the first M
tasks again.
Eventually, tokio will schedule them, they will make progress, fill the
`PageWriteGuard`, release it.
But in the meantime, the N-M tasks have already bailed with error
`exceeded evict iter limit`.

Eventually, higher level mechanisms will retry for the N-M tasks, and
this time, there won't be as many concurrent tasks wanting to do
`Timeline::get`.
So, it will shake out.

But, it's a massive indigestion until then.

This PR
=======

This PR reverts the offending commits until we find a proper fix.

```
    Revert "Use tokio locks in VirtualFile and turn with_file into macro (#5247)"
    
    This reverts commit 76cc873.


    Revert "Make File opening in VirtualFile async-compatible (#5280)"
    
    This reverts commit a18d6d9.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants