Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_bulk_insert / walingest generally is slower with tokio-epoll-uring #7124

Closed
7 tasks done
problame opened this issue Mar 14, 2024 · 5 comments
Closed
7 tasks done
Assignees
Labels
a/performance Area: relates to performance of the system c/storage/pageserver Component: storage: pageserver

Comments

@problame
Copy link
Contributor

problame commented Mar 14, 2024

High-Level

Problem

The test_bulk_ingest benchmark shows about 2x lower throughput with tokio-epoll-uring compared to std-fs.
That's why we temporarily disabled it in #7238.

The reason for this regression is that the benchmark runs on a system without memory pressure and thus std-fs writes don't block on disk IO but only copy the data into the kernel page cache.
The tokio-epoll-uring cannot beat that at this time, and possibly never.
(However, under memory pressure, std-fs would stall the executor thread on kernel page cache writeback disk IO. That's why we want to use tokio-epoll-uring. And we likely want to use O_DIRECT in the future, at which point std-fs becomes an absolute show-stopper.)

Further, bulk walingest is a pathological case because the InMemoryLayer into which ingest happens uses EphemeralFile. EphemeralFile flushes its mutable_tail every 8k of WAL. And it does not do double-buffering / pipelining, i.e., while flushing the buffer, no new WAL can be ingested.

More elaborate analysis:

Solution

Short-term fix: increase buffer sizes of write path.

Long-term fix: Introduce double-buffering on the write paths (delta, EphemeralFile, Image): open up a new BytesMut and flush the old one it in the background.

References

Preliminaries

Preview Give feedback
  1. run-benchmarks
@problame problame added c/storage/pageserver Component: storage: pageserver a/performance Area: relates to performance of the system labels Mar 14, 2024
@problame problame self-assigned this Mar 14, 2024
problame added a commit that referenced this issue Mar 14, 2024
…metrization doesn't work (#7113)

# Problem

While investigating #7124, I noticed that the benchmark was always using
the `DEFAULT_*` `virtual_file_io_engine` , i.e., `tokio-epoll-uring` as
of #7077.

The fundamental problem is that the `control_plane` code has its own
view of `PageServerConfig`, which, I believe, will always be a subset of
the real pageserver's `pageserver/src/config.rs`.

For the `virtual_file_io_engine` and `get_vectored_impl` parametrization
of the test suite, we were constructing a dict on the Python side that
contained these parameters, then handed it to
`control_plane::PageServerConfig`'s derived `serde::Deserialize`.
The default in serde is to ignore unknown fields, so, the Deserialize
impl silently ignored the fields.
In consequence, the fields weren't propagated to the `pageserver --init`
call, and the tests ended up using the
`pageserver/src/config.rs::DEFAULT_` values for the respective options
all the time.

Tests that explicitly used overrides in `env.pageserver.start()` and
similar were not affected by this.

But, it means that all the test suite runs where with parametrization
didn't properly exercise the code path.

# Changes

- use `serde(deny_unknown_fields)` to expose the problem  
- With this change, the Python tests that override
`virtual_file_io_engine` and
`get_vectored_impl` fail on `pageserver --init`, exposing the problem.
- use destructuring to uncover the issue in the future
- fix the issue by adding the missing fields to the `control_plane`
crate's `PageServerConf`
- A better solution would be for control plane to re-use a struct
provided
    by the pageserver crate, so that everything is in one place in
    `pageserver/src/config.rs`, but, our config parsing code is (almost)
    beyond repair anyways.
- fix the `pageserver_virtual_file_io_engine` to be responsive to the
env var
  - => required to make parametrization work in benchmarks

# Testing

Before merging this PR, I re-ran the regression tests & CI with the full
matrix of `virtual_file_io_engine` and `tokio-epoll-uring`, see
9c7ea36
@problame
Copy link
Contributor Author

Last week:

  • finished implementation & benchmarked (WIP branch problame/investigate-slow-test_bulk_insert)

This week:

  • benchmark on i3en.3xlarge
  • make reviewable PR

@problame
Copy link
Contributor Author

Status update:

  • no progress, ENOTIME

This week:

  • benchmark on i3en.3xlarge
  • make reviewable PR

bayandin added a commit that referenced this issue Mar 26, 2024
## Problem
`test_bulk_insert` becomes too slow, and it fails constantly:
#7124

## Summary of changes
- Skip `test_bulk_insert` until it's fixed
@problame
Copy link
Contributor Author

Update:

Posted my WIP branch (minus the part that does page-cache pre-warming) with benchmark results as WIP PR
#7273

Need feedback on whether we like the approach I took, or whether I should be less ambitious and simply junk up extend the EphemeralFile::write_blob's struct Writer further.

@jcsp
Copy link
Collaborator

jcsp commented Apr 15, 2024

This week: merge buffer size increase, ahead of rollout next week.

@jcsp
Copy link
Collaborator

jcsp commented Apr 22, 2024

Blocked by incidents last week, same plan this week

problame added a commit that referenced this issue Apr 26, 2024
@problame problame changed the title test_bulk_insert / bulk walingest generally is 2x slower with tokio-epoll-uring test_bulk_insert / walingest generally is slower with tokio-epoll-uring Apr 26, 2024
problame added a commit that referenced this issue Apr 26, 2024
part of #7124

Changes
-------

This PR replaces the `EphemeralFile::write_blob`-specifc `struct Writer`
with re-use of `owned_buffers_io::write::BufferedWriter`.

Further, it restructures the code to cleanly separate

* the high-level aspect of EphemeralFile's write_blob / read_blk API
* the page-caching aspect
* the aspect of IO
  * performing buffered write IO to an underlying VirtualFile
* serving reads from either the VirtualFile or the buffer if it hasn't
been flushed yet
* the annoying "feature" that reads past the end of the written range
are allowed and expected to return zeroed memory, as long as one remains
within one PAGE_SZ
problame added a commit that referenced this issue Apr 26, 2024
part of #7124

# Problem

(Re-stating the problem from #7124 for posterity)

The `test_bulk_ingest` benchmark shows about 2x lower throughput with
`tokio-epoll-uring` compared to `std-fs`.
That's why we temporarily disabled it in #7238.

The reason for this regression is that the benchmark runs on a system
without memory pressure and thus std-fs writes don't block on disk IO
but only copy the data into the kernel page cache.
`tokio-epoll-uring` cannot beat that at this time, and possibly never.
(However, under memory pressure, std-fs would stall the executor thread
on kernel page cache writeback disk IO. That's why we want to use
`tokio-epoll-uring`. And we likely want to use O_DIRECT in the future,
at which point std-fs becomes an absolute show-stopper.)

More elaborate analysis:
https://neondatabase.notion.site/Why-test_bulk_ingest-is-slower-with-tokio-epoll-uring-918c5e619df045a7bd7b5f806cfbd53f?pvs=4

# Changes

This PR increases the buffer size of `blob_io` and `EphemeralFile` from
PAGE_SZ=8k to 64k.

Longer-term, we probably want to do double-buffering / pipelined IO.

# Resource Usage

We currently do not flush the buffer when freezing the InMemoryLayer.
That means a single Timeline can have multiple 64k buffers alive, esp if
flushing is slow.
This poses an OOM risk.

We should either bound the number of frozen layers
(#7317).

Or we should change the freezing code to flush the buffer and drop the
allocation.

However, that's future work.

# Performance

(Measurements done on i3en.3xlarge.)

The `test_bulk_insert.py` is too noisy, even with instance storage. It
varies by 30-40%. I suspect that's due to compaction. Raising amount of
data by 10x doesn't help with the noisiness.)

So, I used the `bench_ingest` from @jcsp 's #7409  .
Specifically, the `ingest-small-values/ingest 128MB/100b seq` and
`ingest-small-values/ingest 128MB/100b seq, no delta` benchmarks.

|     |                   | seq | seq, no delta |
|-----|-------------------|-----|---------------|
| 8k  | std-fs            | 55  | 165           |
| 8k  | tokio-epoll-uring | 37  | 107           |
| 64k | std-fs            | 55  | 180           |
| 64k | tokio-epoll-uring | 48  | 164           |

The `8k` is from before this PR, the `64k` is with this PR.
The values are the throughput reported by the benchmark (MiB/s).

We see that this PR gets `tokio-epoll-uring` from 67% to 87% of `std-fs`
performance in the `seq` benchmark. Notably, `seq` appears to hit some
other bottleneck at `55 MiB/s`. CC'ing #7418 due to the apparent
bottlenecks in writing delta layers.

For `seq, no delta`, this PR gets `tokio-epoll-uring` from 64% to 91% of
`std-fs` performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/performance Area: relates to performance of the system c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

2 participants