pageserver: ingest pre-serialized batches of values #9579

VladLazar · 2024-10-30T10:17:39Z

Problem

#9524 split the decoding and interpretation step from ingestion.
The output of the first phase is a wal_decoder::models::InterpretedWalRecord. Before this patch set that struct
contained a list of Value instances.

We wish to lift the decoding and interpretation step to the safekeeper, but it would be nice if the safekeeper
gave us a batch containing the raw data instead of actual values.

Summary of changes

Main goal here is to make InterpretedWalRecord hold a raw buffer which contains pre-serialized Values.
For this we do:

Add a SerializedValueBatch type. This is inmemory_layer::SerializedBatch with some extra functionality for extension, observing values for shard 0 and tests.
Replace inmemory_layer::SerializedBatch with SerializedValueBatch
Make DatadirModification maintain a SerializedValueBatch.

`DatadirModification` changes

DatadirModification now maintains a SerializedValueBatch and extends
it as new WAL records come in (to avoid flushing to disk on every record).
In turn, this cascaded into a number of modifications to DatadirModification:

Replace pending_data_pages and pending_zero_data_pages with
pending_data_batch.
Removal of pending_zero_data_pages and its cousin
on_wal_record_end
Rename pending_bytes to pending_metadata_bytes since this is what
it tracks now.
Adapting of various utility methods like len,
approx_pending_bytes and has_dirty_data_pages.

Removal of pending_zero_data_pages and the optimisation associated
with it ((1) and (2)) deserves more detail.

Previously all zero data pages went through pending_zero_data_pages.
We wrote zero data pages when filling gaps caused by relation extension
(case A) and when handling special wal records (case B). If it happened
that the same WAL record contained a non zero write for an entry in
pending_zero_data_pages we skipped the zero write.

Case A: We handle this differently now. When ingesting the SerialiezdValueBatch
associated with one PG WAL record, we identify the gaps and fill the them in one go.
Essentially, we move from a per key process (gaps were filled after each new key),
and replace it with a per record process. Hence, the optimisation is not required
anymore.

Case B: When the handling of a special record needs to zero out a key,
it just adds that to the current batch. I inspected the code, and I
don't think the optimisation kicked in here.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-10-30T11:33:52Z

5354 tests run: 5132 passed, 0 failed, 222 skipped (full report)

Flaky tests (2)

Postgres 17

test_isolation[None]: debug-x86-64
test_create_churn_during_restart: debug-x86-64

Code coverage* (full report)

functions: 31.6% (7810 of 24733 functions)
lines: 49.2% (61658 of 125363 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
95df95a at 2024-11-01T19:12:09.642Z :recycle:}

This will replace inmemory_layer::SerializedBatch and also be present in the wire format between safekeepers and pageservers.

Little refactor to clean-up for the next commit.

This is a bigger bite than I would have liked, but this stuff is tightly coupled. The main goal of this commit is to use the `SerializedValueBatch` in `InterpretedWalRecord`. This requires that `DatadirModification` maintains a `SerializedValueBatch` and extends it as new WAL records come in (to avoid flushing to disk on every record). In turn, this cascaded into a number of modifications to `DatadirModification` 1. Replace `pending_data_pages` and `pending_zero_data_pages` with `pending_data_batch`. 2. Removal of `pending_zero_data_pages` and its cousin `on_wal_record_end` 3. Rename `pending_bytes` to `pending_metadata_bytes` since this is what it tracks now. 4. Adapting of various utility methods like `len`, `approx_pending_bytes` and `has_dirty_data_pages`. Removal of `pending_zero_data_pages` and the optimisation associated with it ((1) and (2)) deserves more detail. Previously all zero data pages went through `pending_zero_data_pages`. We wrote zero data pages when filling gaps caused by relation extension (case A) and when handling special wal records (case B). If it happened that the same WAL record contained a non zero write for an entry in `pending_zero_data_pages` we skipped the zero write. Case A: We handle relation extensions differently now. Once the batch grows large enough, we identify the gaps and fill the gaps when handling the entire wal record. Hence, the optimisation is not required anymore. Case B: When the handling of a special record needs to zero out a key, it just adds that to the current batch. I inspected the code, and I don't think the optimisation kicked in here.

VladLazar · 2024-10-31T13:47:54Z

Later-edit: re-benchmarked everything after I pushed some optimisations.
test_bulk_insert WAL ingestion is a hacked up version of test_ingest_real_wal which
ingests the WAL from the test_bulk_insert Python perf test. There's some significant improvement there,
but otherwise things remain pretty much in line with main.

main @ https://github.com/neondatabase/neon/commit/411c3aa0d62a4d8a2e18b43dc03b677bf1969d66

### Rust ingest bench

ingest-small-values/ingest 128MB/100b seq
                        time:   [1.2113 s 1.2141 s 1.2170 s]
                        thrpt:  [105.18 MiB/s 105.43 MiB/s 105.67 MiB/s]
ingest-small-values/ingest 128MB/100b rand
                        time:   [1.6820 s 1.6850 s 1.6883 s]
                        thrpt:  [75.815 MiB/s 75.963 MiB/s 76.102 MiB/s]
ingest-small-values/ingest 128MB/100b rand-1024keys
                        time:   [1.0901 s 1.0940 s 1.0984 s]
                        thrpt:  [116.53 MiB/s 117.00 MiB/s 117.43 MiB/s]
ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [377.23 ms 377.53 ms 377.84 ms]
                        thrpt:  [338.76 MiB/s 339.05 MiB/s 339.32 MiB/s]
ingest-big-values/ingest 128MB/8k seq
                        time:   [344.30 ms 345.69 ms 347.09 ms]
                        thrpt:  [368.78 MiB/s 370.28 MiB/s 371.77 MiB/s]
ingest-big-values/ingest 128MB/8k seq, no delta
                        time:   [111.97 ms 112.24 ms 112.71 ms]
                        thrpt:  [1.1091 GiB/s 1.1137 GiB/s 1.1163 GiB/s]

ingest-small-values/ingest 128MB/100b seq
                        time:   [1.1245 s 1.1560 s 1.1887 s]
                        thrpt:  [107.68 MiB/s 110.73 MiB/s 113.83 MiB/s]
ingest-small-values/ingest 128MB/100b rand
                        time:   [1.6656 s 1.6723 s 1.6777 s]
                        thrpt:  [76.294 MiB/s 76.543 MiB/s 76.850 MiB/s]
ingest-small-values/ingest 128MB/100b rand-1024keys
                        time:   [1.1204 s 1.1238 s 1.1281 s]
                        thrpt:  [113.47 MiB/s 113.90 MiB/s 114.25 MiB/s]
ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [386.98 ms 387.41 ms 387.84 ms]
                        thrpt:  [330.03 MiB/s 330.40 MiB/s 330.77 MiB/s]
ingest-big-values/ingest 128MB/8k seq
                        time:   [368.86 ms 369.41 ms 369.93 ms]
                        thrpt:  [346.01 MiB/s 346.50 MiB/s 347.01 MiB/s]
ingest-big-values/ingest 128MB/8k seq, no delta
                        time:   [129.26 ms 129.56 ms 130.03 ms]
                        thrpt:  [984.35 MiB/s 987.96 MiB/s 990.22 MiB/s]

### test_bulk_insert

test_bulk_insert[vanilla-release-pg16].insert: 23.096 s
test_bulk_insert[vanilla-release-pg16].data_size: 713 MB
test_bulk_insert[vanilla-release-pg16].wal_size: 608 MB
test_bulk_insert[vanilla-release-pg16].wal_written: 1,377 MB
test_bulk_insert[vanilla-release-pg16].compaction: 0.000 s
test_bulk_insert[neon-release-pg16].insert: 23.742 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,409 MB
test_bulk_insert[neon-release-pg16].peak_mem: 609 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 14.874 s
test_bulk_insert[neon-release-pg16].compaction: 0.029 s

### test_sharded_ingest

test_sharded_ingest[release-pg16].wal_ingest: 117.173 s
test_sharded_ingest[release-pg16].pageserver_ingest: 117.285 s
test_sharded_ingest[release-pg16].wal_written: 6,911 MB

### test_bulk_insert WAL ingestion

done in 31.614424216s

origin/vlad/raw-buf-from-decoder @ https://github.com/neondatabase/neon/pull/9579/commits/95df95a81ae8caa819cb3bd2ba95367a27c64f2f

### Rust ingest bench

ingest-big-values/ingest 128MB/8k seq
                        time:   [343.11 ms 344.16 ms 345.20 ms]
                        thrpt:  [370.80 MiB/s 371.92 MiB/s 373.06 MiB/s]
ingest-big-values/ingest 128MB/8k seq, no delta
                        time:   [120.81 ms 125.04 ms 129.02 ms]
                        thrpt:  [992.12 MiB/s 1023.7 MiB/s 1.0347 GiB/s]

ingest-small-values/ingest 128MB/100b seq
                        time:   [1.1041 s 1.1149 s 1.1255 s]
                        thrpt:  [113.73 MiB/s 114.81 MiB/s 115.94 MiB/s]
ingest-small-values/ingest 128MB/100b rand
                        time:   [1.8446 s 1.8478 s 1.8510 s]
                        thrpt:  [69.152 MiB/s 69.273 MiB/s 69.392 MiB/s]
ingest-small-values/ingest 128MB/100b rand-1024keys
                        time:   [1.1174 s 1.1210 s 1.1246 s]
                        thrpt:  [113.82 MiB/s 114.18 MiB/s 114.56 MiB/s]
ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [384.06 ms 384.94 ms 386.24 ms]
                        thrpt:  [331.40 MiB/s 332.52 MiB/s 333.28 MiB/s]
ingest-big-values/ingest 128MB/8k seq
                        time:   [350.69 ms 353.71 ms 356.89 ms]
                        thrpt:  [358.65 MiB/s 361.88 MiB/s 364.99 MiB/s]
ingest-big-values/ingest 128MB/8k seq, no delta
                        time:   [117.31 ms 117.49 ms 117.84 ms]
                        thrpt:  [1.0608 GiB/s 1.0639 GiB/s 1.0656 GiB/s]

### test_bulk_insert

test_bulk_insert[vanilla-release-pg16].insert: 21.765 s
test_bulk_insert[vanilla-release-pg16].data_size: 713 MB
test_bulk_insert[vanilla-release-pg16].wal_size: 608 MB
test_bulk_insert[vanilla-release-pg16].wal_written: 1,377 MB
test_bulk_insert[vanilla-release-pg16].compaction: 0.000 s
test_bulk_insert[neon-release-pg16].insert: 23.638 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,409 MB
test_bulk_insert[neon-release-pg16].peak_mem: 589 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 15.503 s
test_bulk_insert[neon-release-pg16].compaction: 0.030 s

### test_sharded_ingest

test_sharded_ingest[release-pg16-1].wal_ingest: 120.262 s
test_sharded_ingest[release-pg16-1].pageserver_ingest: 120.375 s
test_sharded_ingest[release-pg16-1].wal_written: 6,911 MB
test_sharded_ingest[release-pg16-8].wal_ingest: 119.777 s
test_sharded_ingest[release-pg16-8].pageserver_ingest: 119.901 s
test_sharded_ingest[release-pg16-8].wal_written: 6,911 MB
test_sharded_ingest[release-pg16-32].wal_ingest: 139.059 s
test_sharded_ingest[release-pg16-32].pageserver_ingest: 139.122 s
test_sharded_ingest[release-pg16-32].wal_written: 6,911 MB

### test_bulk_insert WAL ingestion

done in 28.382027011s

pageserver/src/pgdatadir_mapping.rs

erikgrinaker

This appears fine to me, but the changes to pgdatadir_mapping.rs and walingest.rs are worth another look by someone more intimately familiar with this.

libs/wal_decoder/src/decoder.rs

libs/wal_decoder/src/serialized_batch.rs

libs/wal_decoder/src/models.rs

pageserver/src/pgdatadir_mapping.rs

skyzh

feel free to address the comments or maybe it doesn't matter as SerializedValueBatch::default looks like a light operation (zero-sized Vec doesn't allocate extra memory?)

pageserver/src/pgdatadir_mapping.rs

VladLazar · 2024-11-04T17:02:23Z

Benchmarked things again after I added some optimisations (results here). I think this is good to merge. Pending an okay from John for the DatadirModification changes.

VladLazar force-pushed the vlad/raw-buf-from-decoder branch 6 times, most recently from 241f79d to 4f13b27 Compare October 31, 2024 10:32

VladLazar changed the title ~~Vlad/raw buf from decoder~~ pageserver: ingest pre-serialized batches of values Oct 31, 2024

VladLazar force-pushed the vlad/raw-buf-from-decoder branch from 4f13b27 to 940aa58 Compare October 31, 2024 10:47

Base automatically changed from vlad/move-decoding-logic-to-decoder-module to main October 31, 2024 10:47

VladLazar force-pushed the vlad/raw-buf-from-decoder branch from 940aa58 to 3659ea3 Compare October 31, 2024 10:50

VladLazar marked this pull request as ready for review October 31, 2024 10:52

VladLazar requested a review from a team as a code owner October 31, 2024 10:52

VladLazar requested review from erikgrinaker, jcsp and skyzh October 31, 2024 10:52

VladLazar added 4 commits October 31, 2024 12:39

wal_decoder: introduce serialized value batch type

b04cff3

This will replace inmemory_layer::SerializedBatch and also be present in the wire format between safekeepers and pageservers.

pageserver: replace SerializedBatch with SerializedValueBatch

3cdd374

pageserver: lift rel creation logic

d6092ba

Little refactor to clean-up for the next commit.

VladLazar force-pushed the vlad/raw-buf-from-decoder branch from 3659ea3 to 2e3c703 Compare October 31, 2024 11:39

jcsp reviewed Oct 31, 2024

View reviewed changes

pageserver/src/pgdatadir_mapping.rs Outdated Show resolved Hide resolved

jcsp reviewed Oct 31, 2024

View reviewed changes

pageserver/src/pgdatadir_mapping.rs Outdated Show resolved Hide resolved

erikgrinaker approved these changes Oct 31, 2024

View reviewed changes

VladLazar added 5 commits October 31, 2024 19:33

review: rename dirty to has_dirty_data

cef88ea

review: clarify that the decoder returns the end LSN

b0e3f28

review: return lsn and key by value

313ce23

review: remove superfluous use of Cursor

e8d4fdb

review: update wal_decoder::models doc comment

d99e983

review: lift gap finding and add a couple tests for it

e6bdad1

VladLazar requested a review from jcsp November 1, 2024 12:47

skyzh approved these changes Nov 1, 2024

View reviewed changes

pageserver/src/pgdatadir_mapping.rs Outdated Show resolved Hide resolved

pageserver/src/pgdatadir_mapping.rs Outdated Show resolved Hide resolved

pageserver/src/pgdatadir_mapping.rs Outdated Show resolved Hide resolved

VladLazar added 4 commits November 1, 2024 18:22

pageserver: skip gap zero-ing if there's no gaps

718266e

wal_decoder: skip one round of serialization

a7bd2c9

wal_decoder: pre-allocate the raw buffer

b5d7cee

review: skip creating default SerializedValueBatches

95df95a

jcsp approved these changes Nov 5, 2024

View reviewed changes

VladLazar merged commit 4dfa0c2 into main Nov 6, 2024
81 checks passed

VladLazar deleted the vlad/raw-buf-from-decoder branch November 6, 2024 14:10

erikgrinaker mentioned this pull request Nov 22, 2024

pageserver: optimize HashMap::insert() in SerializedValueBatch::extend() #9848

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: ingest pre-serialized batches of values #9579

pageserver: ingest pre-serialized batches of values #9579

VladLazar commented Oct 30, 2024 •

edited

Loading

github-actions bot commented Oct 30, 2024 •

edited

Loading

Postgres 17

VladLazar commented Oct 31, 2024 •

edited

Loading

erikgrinaker left a comment

skyzh left a comment •

edited

Loading

VladLazar commented Nov 4, 2024

pageserver: ingest pre-serialized batches of values #9579

pageserver: ingest pre-serialized batches of values #9579

Conversation

VladLazar commented Oct 30, 2024 • edited Loading

Problem

Summary of changes

DatadirModification changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Oct 30, 2024 • edited Loading

5354 tests run: 5132 passed, 0 failed, 222 skipped (full report)

Postgres 17

Code coverage* (full report)

VladLazar commented Oct 31, 2024 • edited Loading

erikgrinaker left a comment

Choose a reason for hiding this comment

skyzh left a comment • edited Loading

Choose a reason for hiding this comment

VladLazar commented Nov 4, 2024

VladLazar commented Oct 30, 2024 •

edited

Loading

`DatadirModification` changes

github-actions bot commented Oct 30, 2024 •

edited

Loading

VladLazar commented Oct 31, 2024 •

edited

Loading

skyzh left a comment •

edited

Loading