`test_bulk_insert` / walingest generally is slower with tokio-epoll-uring #7124

problame · 2024-03-14T10:07:44Z

High-Level

understand root cause
short-term fix: increase buffer size
longer-term fix: double-buffering
- future work, not part of this issue
- cc bypass PageCache for L0 flush #7418

Problem

The test_bulk_ingest benchmark shows about 2x lower throughput with tokio-epoll-uring compared to std-fs.
That's why we temporarily disabled it in #7238.

The reason for this regression is that the benchmark runs on a system without memory pressure and thus std-fs writes don't block on disk IO but only copy the data into the kernel page cache.
The tokio-epoll-uring cannot beat that at this time, and possibly never.
(However, under memory pressure, std-fs would stall the executor thread on kernel page cache writeback disk IO. That's why we want to use tokio-epoll-uring. And we likely want to use O_DIRECT in the future, at which point std-fs becomes an absolute show-stopper.)

Further, bulk walingest is a pathological case because the InMemoryLayer into which ingest happens uses EphemeralFile. EphemeralFile flushes its mutable_tail every 8k of WAL. And it does not do double-buffering / pipelining, i.e., while flushing the buffer, no new WAL can be ingested.

More elaborate analysis:

Public notion doc
Slack thread

Solution

Short-term fix: increase buffer sizes of write path.

Long-term fix: Introduce double-buffering on the write paths (delta, EphemeralFile, Image): open up a new BytesMut and flush the old one it in the background.

References

Preliminaries

Give feedback

Short-Term: Increase Buffer Sizes

Give feedback

The text was updated successfully, but these errors were encountered:

…metrization doesn't work (#7113) # Problem While investigating #7124, I noticed that the benchmark was always using the `DEFAULT_*` `virtual_file_io_engine` , i.e., `tokio-epoll-uring` as of #7077. The fundamental problem is that the `control_plane` code has its own view of `PageServerConfig`, which, I believe, will always be a subset of the real pageserver's `pageserver/src/config.rs`. For the `virtual_file_io_engine` and `get_vectored_impl` parametrization of the test suite, we were constructing a dict on the Python side that contained these parameters, then handed it to `control_plane::PageServerConfig`'s derived `serde::Deserialize`. The default in serde is to ignore unknown fields, so, the Deserialize impl silently ignored the fields. In consequence, the fields weren't propagated to the `pageserver --init` call, and the tests ended up using the `pageserver/src/config.rs::DEFAULT_` values for the respective options all the time. Tests that explicitly used overrides in `env.pageserver.start()` and similar were not affected by this. But, it means that all the test suite runs where with parametrization didn't properly exercise the code path. # Changes - use `serde(deny_unknown_fields)` to expose the problem - With this change, the Python tests that override `virtual_file_io_engine` and `get_vectored_impl` fail on `pageserver --init`, exposing the problem. - use destructuring to uncover the issue in the future - fix the issue by adding the missing fields to the `control_plane` crate's `PageServerConf` - A better solution would be for control plane to re-use a struct provided by the pageserver crate, so that everything is in one place in `pageserver/src/config.rs`, but, our config parsing code is (almost) beyond repair anyways. - fix the `pageserver_virtual_file_io_engine` to be responsive to the env var - => required to make parametrization work in benchmarks # Testing Before merging this PR, I re-ran the regression tests & CI with the full matrix of `virtual_file_io_engine` and `tokio-epoll-uring`, see 9c7ea36

problame · 2024-03-18T11:10:57Z

Last week:

finished implementation & benchmarked (WIP branch problame/investigate-slow-test_bulk_insert)

This week:

benchmark on i3en.3xlarge
make reviewable PR

problame · 2024-03-25T10:02:30Z

Status update:

no progress, ENOTIME

This week:

benchmark on i3en.3xlarge
make reviewable PR

## Problem `test_bulk_insert` becomes too slow, and it fails constantly: #7124 ## Summary of changes - Skip `test_bulk_insert` until it's fixed

problame · 2024-03-28T15:00:09Z

Update:

Posted my WIP branch (minus the part that does page-cache pre-warming) with benchmark results as WIP PR
#7273

Need feedback on whether we like the approach I took, or whether I should be less ambitious and simply ~~junk up~~ extend the EphemeralFile::write_blob's struct Writer further.

jcsp · 2024-04-15T10:09:28Z

This week: merge buffer size increase, ahead of rollout next week.

jcsp · 2024-04-22T13:35:04Z

Blocked by incidents last week, same plan this week

…ver underlying writer (#7483) part of #7124

part of #7124 Changes ------- This PR replaces the `EphemeralFile::write_blob`-specifc `struct Writer` with re-use of `owned_buffers_io::write::BufferedWriter`. Further, it restructures the code to cleanly separate * the high-level aspect of EphemeralFile's write_blob / read_blk API * the page-caching aspect * the aspect of IO * performing buffered write IO to an underlying VirtualFile * serving reads from either the VirtualFile or the buffer if it hasn't been flushed yet * the annoying "feature" that reads past the end of the written range are allowed and expected to return zeroed memory, as long as one remains within one PAGE_SZ

@jcsp

part of #7124 # Problem (Re-stating the problem from #7124 for posterity) The `test_bulk_ingest` benchmark shows about 2x lower throughput with `tokio-epoll-uring` compared to `std-fs`. That's why we temporarily disabled it in #7238. The reason for this regression is that the benchmark runs on a system without memory pressure and thus std-fs writes don't block on disk IO but only copy the data into the kernel page cache. `tokio-epoll-uring` cannot beat that at this time, and possibly never. (However, under memory pressure, std-fs would stall the executor thread on kernel page cache writeback disk IO. That's why we want to use `tokio-epoll-uring`. And we likely want to use O_DIRECT in the future, at which point std-fs becomes an absolute show-stopper.) More elaborate analysis: https://neondatabase.notion.site/Why-test_bulk_ingest-is-slower-with-tokio-epoll-uring-918c5e619df045a7bd7b5f806cfbd53f?pvs=4 # Changes This PR increases the buffer size of `blob_io` and `EphemeralFile` from PAGE_SZ=8k to 64k. Longer-term, we probably want to do double-buffering / pipelined IO. # Resource Usage We currently do not flush the buffer when freezing the InMemoryLayer. That means a single Timeline can have multiple 64k buffers alive, esp if flushing is slow. This poses an OOM risk. We should either bound the number of frozen layers (#7317). Or we should change the freezing code to flush the buffer and drop the allocation. However, that's future work. # Performance (Measurements done on i3en.3xlarge.) The `test_bulk_insert.py` is too noisy, even with instance storage. It varies by 30-40%. I suspect that's due to compaction. Raising amount of data by 10x doesn't help with the noisiness.) So, I used the `bench_ingest` from @jcsp 's #7409 . Specifically, the `ingest-small-values/ingest 128MB/100b seq` and `ingest-small-values/ingest 128MB/100b seq, no delta` benchmarks. | | | seq | seq, no delta | |-----|-------------------|-----|---------------| | 8k | std-fs | 55 | 165 | | 8k | tokio-epoll-uring | 37 | 107 | | 64k | std-fs | 55 | 180 | | 64k | tokio-epoll-uring | 48 | 164 | The `8k` is from before this PR, the `64k` is with this PR. The values are the throughput reported by the benchmark (MiB/s). We see that this PR gets `tokio-epoll-uring` from 67% to 87% of `std-fs` performance in the `seq` benchmark. Notably, `seq` appears to hit some other bottleneck at `55 MiB/s`. CC'ing #7418 due to the apparent bottlenecks in writing delta layers. For `seq, no delta`, this PR gets `tokio-epoll-uring` from 64% to 91% of `std-fs` performance.

problame added c/storage/pageserver Component: storage: pageserver a/performance Area: relates to performance of the system labels Mar 14, 2024

problame self-assigned this Mar 14, 2024

problame mentioned this issue Mar 14, 2024

fix(test suite) virtual_file_io_engine and get_vectored_impl patametrization doesn't work #7113

Merged

bayandin mentioned this issue Mar 25, 2024

test_runner/performance: skip test_bulk_insert #7238

Merged

5 tasks

bayandin added a commit that referenced this issue Mar 26, 2024

test_runner/performance: skip test_bulk_insert (#7238)

3426619

## Problem `test_bulk_insert` becomes too slow, and it fails constantly: #7124 ## Summary of changes - Skip `test_bulk_insert` until it's fixed

problame mentioned this issue Mar 28, 2024

perf(walingest): mitigate bulk ingest throughput regression through larger EphemeralFile in-memory buffer #7273

Closed

5 tasks

problame added a commit that referenced this issue Apr 26, 2024

refactor(owned_buffer_io::util::size_tracking_writer): make generic o…

bf369f4

…ver underlying writer (#7483) part of #7124

problame changed the title ~~test_bulk_insert / bulk walingest generally is 2x slower with tokio-epoll-uring~~ test_bulk_insert / walingest generally is slower with tokio-epoll-uring Apr 26, 2024

problame closed this as completed Apr 26, 2024

yliang412 mentioned this issue Nov 12, 2024

pageserver: use double buffering in BufferedWriter #9387

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test_bulk_insert` / walingest generally is slower with tokio-epoll-uring #7124

`test_bulk_insert` / walingest generally is slower with tokio-epoll-uring #7124

problame commented Mar 14, 2024 •

edited

Loading

Preliminaries

Short-Term: Increase Buffer Sizes

problame commented Mar 18, 2024

problame commented Mar 25, 2024

problame commented Mar 28, 2024

jcsp commented Apr 15, 2024

jcsp commented Apr 22, 2024

test_bulk_insert / walingest generally is slower with tokio-epoll-uring #7124

test_bulk_insert / walingest generally is slower with tokio-epoll-uring #7124

Comments

problame commented Mar 14, 2024 • edited Loading

High-Level

Problem

Solution

References

Preliminaries

Short-Term: Increase Buffer Sizes

problame commented Mar 18, 2024

problame commented Mar 25, 2024

problame commented Mar 28, 2024

jcsp commented Apr 15, 2024

jcsp commented Apr 22, 2024

`test_bulk_insert` / walingest generally is slower with tokio-epoll-uring #7124

`test_bulk_insert` / walingest generally is slower with tokio-epoll-uring #7124

problame commented Mar 14, 2024 •

edited

Loading