-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_bulk_insert
/ walingest generally is slower with tokio-epoll-uring
#7124
Comments
…metrization doesn't work (#7113) # Problem While investigating #7124, I noticed that the benchmark was always using the `DEFAULT_*` `virtual_file_io_engine` , i.e., `tokio-epoll-uring` as of #7077. The fundamental problem is that the `control_plane` code has its own view of `PageServerConfig`, which, I believe, will always be a subset of the real pageserver's `pageserver/src/config.rs`. For the `virtual_file_io_engine` and `get_vectored_impl` parametrization of the test suite, we were constructing a dict on the Python side that contained these parameters, then handed it to `control_plane::PageServerConfig`'s derived `serde::Deserialize`. The default in serde is to ignore unknown fields, so, the Deserialize impl silently ignored the fields. In consequence, the fields weren't propagated to the `pageserver --init` call, and the tests ended up using the `pageserver/src/config.rs::DEFAULT_` values for the respective options all the time. Tests that explicitly used overrides in `env.pageserver.start()` and similar were not affected by this. But, it means that all the test suite runs where with parametrization didn't properly exercise the code path. # Changes - use `serde(deny_unknown_fields)` to expose the problem - With this change, the Python tests that override `virtual_file_io_engine` and `get_vectored_impl` fail on `pageserver --init`, exposing the problem. - use destructuring to uncover the issue in the future - fix the issue by adding the missing fields to the `control_plane` crate's `PageServerConf` - A better solution would be for control plane to re-use a struct provided by the pageserver crate, so that everything is in one place in `pageserver/src/config.rs`, but, our config parsing code is (almost) beyond repair anyways. - fix the `pageserver_virtual_file_io_engine` to be responsive to the env var - => required to make parametrization work in benchmarks # Testing Before merging this PR, I re-ran the regression tests & CI with the full matrix of `virtual_file_io_engine` and `tokio-epoll-uring`, see 9c7ea36
Last week:
This week:
|
Status update:
This week:
|
## Problem `test_bulk_insert` becomes too slow, and it fails constantly: #7124 ## Summary of changes - Skip `test_bulk_insert` until it's fixed
Update: Posted my WIP branch (minus the part that does page-cache pre-warming) with benchmark results as WIP PR Need feedback on whether we like the approach I took, or whether I should be less ambitious and simply |
This week: merge buffer size increase, ahead of rollout next week. |
Blocked by incidents last week, same plan this week |
test_bulk_insert
/ bulk walingest generally is 2x slower with tokio-epoll-uringtest_bulk_insert
/ walingest generally is slower with tokio-epoll-uring
part of #7124 Changes ------- This PR replaces the `EphemeralFile::write_blob`-specifc `struct Writer` with re-use of `owned_buffers_io::write::BufferedWriter`. Further, it restructures the code to cleanly separate * the high-level aspect of EphemeralFile's write_blob / read_blk API * the page-caching aspect * the aspect of IO * performing buffered write IO to an underlying VirtualFile * serving reads from either the VirtualFile or the buffer if it hasn't been flushed yet * the annoying "feature" that reads past the end of the written range are allowed and expected to return zeroed memory, as long as one remains within one PAGE_SZ
part of #7124 # Problem (Re-stating the problem from #7124 for posterity) The `test_bulk_ingest` benchmark shows about 2x lower throughput with `tokio-epoll-uring` compared to `std-fs`. That's why we temporarily disabled it in #7238. The reason for this regression is that the benchmark runs on a system without memory pressure and thus std-fs writes don't block on disk IO but only copy the data into the kernel page cache. `tokio-epoll-uring` cannot beat that at this time, and possibly never. (However, under memory pressure, std-fs would stall the executor thread on kernel page cache writeback disk IO. That's why we want to use `tokio-epoll-uring`. And we likely want to use O_DIRECT in the future, at which point std-fs becomes an absolute show-stopper.) More elaborate analysis: https://neondatabase.notion.site/Why-test_bulk_ingest-is-slower-with-tokio-epoll-uring-918c5e619df045a7bd7b5f806cfbd53f?pvs=4 # Changes This PR increases the buffer size of `blob_io` and `EphemeralFile` from PAGE_SZ=8k to 64k. Longer-term, we probably want to do double-buffering / pipelined IO. # Resource Usage We currently do not flush the buffer when freezing the InMemoryLayer. That means a single Timeline can have multiple 64k buffers alive, esp if flushing is slow. This poses an OOM risk. We should either bound the number of frozen layers (#7317). Or we should change the freezing code to flush the buffer and drop the allocation. However, that's future work. # Performance (Measurements done on i3en.3xlarge.) The `test_bulk_insert.py` is too noisy, even with instance storage. It varies by 30-40%. I suspect that's due to compaction. Raising amount of data by 10x doesn't help with the noisiness.) So, I used the `bench_ingest` from @jcsp 's #7409 . Specifically, the `ingest-small-values/ingest 128MB/100b seq` and `ingest-small-values/ingest 128MB/100b seq, no delta` benchmarks. | | | seq | seq, no delta | |-----|-------------------|-----|---------------| | 8k | std-fs | 55 | 165 | | 8k | tokio-epoll-uring | 37 | 107 | | 64k | std-fs | 55 | 180 | | 64k | tokio-epoll-uring | 48 | 164 | The `8k` is from before this PR, the `64k` is with this PR. The values are the throughput reported by the benchmark (MiB/s). We see that this PR gets `tokio-epoll-uring` from 67% to 87% of `std-fs` performance in the `seq` benchmark. Notably, `seq` appears to hit some other bottleneck at `55 MiB/s`. CC'ing #7418 due to the apparent bottlenecks in writing delta layers. For `seq, no delta`, this PR gets `tokio-epoll-uring` from 64% to 91% of `std-fs` performance.
High-Level
Problem
The
test_bulk_ingest
benchmark shows about 2x lower throughput withtokio-epoll-uring
compared tostd-fs
.That's why we temporarily disabled it in #7238.
The reason for this regression is that the benchmark runs on a system without memory pressure and thus std-fs writes don't block on disk IO but only copy the data into the kernel page cache.
The
tokio-epoll-uring
cannot beat that at this time, and possibly never.(However, under memory pressure, std-fs would stall the executor thread on kernel page cache writeback disk IO. That's why we want to use
tokio-epoll-uring
. And we likely want to use O_DIRECT in the future, at which point std-fs becomes an absolute show-stopper.)Further, bulk walingest is a pathological case because the
InMemoryLayer
into which ingest happens usesEphemeralFile
.EphemeralFile
flushes itsmutable_tail
every 8k of WAL. And it does not do double-buffering / pipelining, i.e., while flushing the buffer, no new WAL can be ingested.More elaborate analysis:
Solution
Short-term fix: increase buffer sizes of write path.
Long-term fix: Introduce double-buffering on the write paths (delta, EphemeralFile, Image): open up a new BytesMut and flush the old one it in the background.
References
Preliminaries
virtual_file_io_engine
andget_vectored_impl
patametrization doesn't work #7113Short-Term: Increase Buffer Sizes
The text was updated successfully, but these errors were encountered: