Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: ingest pre-serialized batches of values #9579

Merged
merged 14 commits into from
Nov 6, 2024

Conversation

VladLazar
Copy link
Contributor

@VladLazar VladLazar commented Oct 30, 2024

Problem

#9524 split the decoding and interpretation step from ingestion.
The output of the first phase is a wal_decoder::models::InterpretedWalRecord. Before this patch set that struct
contained a list of Value instances.

We wish to lift the decoding and interpretation step to the safekeeper, but it would be nice if the safekeeper
gave us a batch containing the raw data instead of actual values.

Summary of changes

Main goal here is to make InterpretedWalRecord hold a raw buffer which contains pre-serialized Values.
For this we do:

  1. Add a SerializedValueBatch type. This is inmemory_layer::SerializedBatch with some extra functionality for extension, observing values for shard 0 and tests.
  2. Replace inmemory_layer::SerializedBatch with SerializedValueBatch
  3. Make DatadirModification maintain a SerializedValueBatch.

DatadirModification changes

DatadirModification now maintains a SerializedValueBatch and extends
it as new WAL records come in (to avoid flushing to disk on every record).
In turn, this cascaded into a number of modifications to DatadirModification:

  1. Replace pending_data_pages and pending_zero_data_pages with
    pending_data_batch.
  2. Removal of pending_zero_data_pages and its cousin
    on_wal_record_end
  3. Rename pending_bytes to pending_metadata_bytes since this is what
    it tracks now.
  4. Adapting of various utility methods like len,
    approx_pending_bytes and has_dirty_data_pages.

Removal of pending_zero_data_pages and the optimisation associated
with it ((1) and (2)) deserves more detail.

Previously all zero data pages went through pending_zero_data_pages.
We wrote zero data pages when filling gaps caused by relation extension
(case A) and when handling special wal records (case B). If it happened
that the same WAL record contained a non zero write for an entry in
pending_zero_data_pages we skipped the zero write.

Case A: We handle this differently now. When ingesting the SerialiezdValueBatch
associated with one PG WAL record, we identify the gaps and fill the them in one go.
Essentially, we move from a per key process (gaps were filled after each new key),
and replace it with a per record process. Hence, the optimisation is not required
anymore.

Case B: When the handling of a special record needs to zero out a key,
it just adds that to the current batch. I inspected the code, and I
don't think the optimisation kicked in here.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

Copy link

github-actions bot commented Oct 30, 2024

5354 tests run: 5132 passed, 0 failed, 222 skipped (full report)


Flaky tests (2)

Postgres 17

Code coverage* (full report)

  • functions: 31.6% (7810 of 24733 functions)
  • lines: 49.2% (61658 of 125363 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
95df95a at 2024-11-01T19:12:09.642Z :recycle:

@VladLazar VladLazar force-pushed the vlad/raw-buf-from-decoder branch 6 times, most recently from 241f79d to 4f13b27 Compare October 31, 2024 10:32
@VladLazar VladLazar changed the title Vlad/raw buf from decoder pageserver: ingest pre-serialized batches of values Oct 31, 2024
@VladLazar VladLazar force-pushed the vlad/raw-buf-from-decoder branch from 4f13b27 to 940aa58 Compare October 31, 2024 10:47
Base automatically changed from vlad/move-decoding-logic-to-decoder-module to main October 31, 2024 10:47
@VladLazar VladLazar force-pushed the vlad/raw-buf-from-decoder branch from 940aa58 to 3659ea3 Compare October 31, 2024 10:50
@VladLazar VladLazar marked this pull request as ready for review October 31, 2024 10:52
@VladLazar VladLazar requested a review from a team as a code owner October 31, 2024 10:52
This will replace inmemory_layer::SerializedBatch and also be
present in the wire format between safekeepers and pageservers.
Little refactor to clean-up for the next commit.
This is a bigger bite than I would have liked, but this stuff is
tightly coupled.

The main goal of this commit is to use the `SerializedValueBatch`
in `InterpretedWalRecord`. This requires that `DatadirModification`
maintains a `SerializedValueBatch` and extends it as new WAL records
come in (to avoid flushing to disk on every record). In turn, this
cascaded into a number of modifications to `DatadirModification`
1. Replace `pending_data_pages` and `pending_zero_data_pages` with
`pending_data_batch`.
2. Removal of `pending_zero_data_pages` and its cousin
`on_wal_record_end`
3. Rename `pending_bytes` to `pending_metadata_bytes` since this is what
it tracks now.
4. Adapting of various utility methods like `len`,
`approx_pending_bytes` and `has_dirty_data_pages`.

Removal of `pending_zero_data_pages` and the optimisation associated
with it ((1) and (2)) deserves more detail.

Previously all zero data pages went through `pending_zero_data_pages`.
We wrote zero data pages when filling gaps caused by relation extension
(case A) and when handling special wal records (case B). If it happened
that the same WAL record contained a non zero write for an entry in
`pending_zero_data_pages` we skipped the zero write.

Case A: We handle relation extensions differently now. Once the batch
grows large enough, we identify the gaps and fill the gaps when handling
the entire wal record. Hence, the optimisation is not required
anymore.

Case B: When the handling of a special record needs to zero out a key,
it just adds that to the current batch. I inspected the code, and I
don't think the optimisation kicked in here.
@VladLazar VladLazar force-pushed the vlad/raw-buf-from-decoder branch from 3659ea3 to 2e3c703 Compare October 31, 2024 11:39
@VladLazar
Copy link
Contributor Author

VladLazar commented Oct 31, 2024

Later-edit: re-benchmarked everything after I pushed some optimisations.
test_bulk_insert WAL ingestion is a hacked up version of test_ingest_real_wal which
ingests the WAL from the test_bulk_insert Python perf test. There's some significant improvement there,
but otherwise things remain pretty much in line with main.

main @ https://github.com/neondatabase/neon/commit/411c3aa0d62a4d8a2e18b43dc03b677bf1969d66

### Rust ingest bench

ingest-small-values/ingest 128MB/100b seq
                        time:   [1.2113 s 1.2141 s 1.2170 s]
                        thrpt:  [105.18 MiB/s 105.43 MiB/s 105.67 MiB/s]
ingest-small-values/ingest 128MB/100b rand
                        time:   [1.6820 s 1.6850 s 1.6883 s]
                        thrpt:  [75.815 MiB/s 75.963 MiB/s 76.102 MiB/s]
ingest-small-values/ingest 128MB/100b rand-1024keys
                        time:   [1.0901 s 1.0940 s 1.0984 s]
                        thrpt:  [116.53 MiB/s 117.00 MiB/s 117.43 MiB/s]
ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [377.23 ms 377.53 ms 377.84 ms]
                        thrpt:  [338.76 MiB/s 339.05 MiB/s 339.32 MiB/s]
ingest-big-values/ingest 128MB/8k seq
                        time:   [344.30 ms 345.69 ms 347.09 ms]
                        thrpt:  [368.78 MiB/s 370.28 MiB/s 371.77 MiB/s]
ingest-big-values/ingest 128MB/8k seq, no delta
                        time:   [111.97 ms 112.24 ms 112.71 ms]
                        thrpt:  [1.1091 GiB/s 1.1137 GiB/s 1.1163 GiB/s]

ingest-small-values/ingest 128MB/100b seq
                        time:   [1.1245 s 1.1560 s 1.1887 s]
                        thrpt:  [107.68 MiB/s 110.73 MiB/s 113.83 MiB/s]
ingest-small-values/ingest 128MB/100b rand
                        time:   [1.6656 s 1.6723 s 1.6777 s]
                        thrpt:  [76.294 MiB/s 76.543 MiB/s 76.850 MiB/s]
ingest-small-values/ingest 128MB/100b rand-1024keys
                        time:   [1.1204 s 1.1238 s 1.1281 s]
                        thrpt:  [113.47 MiB/s 113.90 MiB/s 114.25 MiB/s]
ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [386.98 ms 387.41 ms 387.84 ms]
                        thrpt:  [330.03 MiB/s 330.40 MiB/s 330.77 MiB/s]
ingest-big-values/ingest 128MB/8k seq
                        time:   [368.86 ms 369.41 ms 369.93 ms]
                        thrpt:  [346.01 MiB/s 346.50 MiB/s 347.01 MiB/s]
ingest-big-values/ingest 128MB/8k seq, no delta
                        time:   [129.26 ms 129.56 ms 130.03 ms]
                        thrpt:  [984.35 MiB/s 987.96 MiB/s 990.22 MiB/s]

### test_bulk_insert

test_bulk_insert[vanilla-release-pg16].insert: 23.096 s
test_bulk_insert[vanilla-release-pg16].data_size: 713 MB
test_bulk_insert[vanilla-release-pg16].wal_size: 608 MB
test_bulk_insert[vanilla-release-pg16].wal_written: 1,377 MB
test_bulk_insert[vanilla-release-pg16].compaction: 0.000 s
test_bulk_insert[neon-release-pg16].insert: 23.742 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,409 MB
test_bulk_insert[neon-release-pg16].peak_mem: 609 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 14.874 s
test_bulk_insert[neon-release-pg16].compaction: 0.029 s

### test_sharded_ingest

test_sharded_ingest[release-pg16].wal_ingest: 117.173 s
test_sharded_ingest[release-pg16].pageserver_ingest: 117.285 s
test_sharded_ingest[release-pg16].wal_written: 6,911 MB

### test_bulk_insert WAL ingestion

done in 31.614424216s
origin/vlad/raw-buf-from-decoder @ https://github.com/neondatabase/neon/pull/9579/commits/95df95a81ae8caa819cb3bd2ba95367a27c64f2f

### Rust ingest bench

ingest-big-values/ingest 128MB/8k seq
                        time:   [343.11 ms 344.16 ms 345.20 ms]
                        thrpt:  [370.80 MiB/s 371.92 MiB/s 373.06 MiB/s]
ingest-big-values/ingest 128MB/8k seq, no delta
                        time:   [120.81 ms 125.04 ms 129.02 ms]
                        thrpt:  [992.12 MiB/s 1023.7 MiB/s 1.0347 GiB/s]

ingest-small-values/ingest 128MB/100b seq
                        time:   [1.1041 s 1.1149 s 1.1255 s]
                        thrpt:  [113.73 MiB/s 114.81 MiB/s 115.94 MiB/s]
ingest-small-values/ingest 128MB/100b rand
                        time:   [1.8446 s 1.8478 s 1.8510 s]
                        thrpt:  [69.152 MiB/s 69.273 MiB/s 69.392 MiB/s]
ingest-small-values/ingest 128MB/100b rand-1024keys
                        time:   [1.1174 s 1.1210 s 1.1246 s]
                        thrpt:  [113.82 MiB/s 114.18 MiB/s 114.56 MiB/s]
ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [384.06 ms 384.94 ms 386.24 ms]
                        thrpt:  [331.40 MiB/s 332.52 MiB/s 333.28 MiB/s]
ingest-big-values/ingest 128MB/8k seq
                        time:   [350.69 ms 353.71 ms 356.89 ms]
                        thrpt:  [358.65 MiB/s 361.88 MiB/s 364.99 MiB/s]
ingest-big-values/ingest 128MB/8k seq, no delta
                        time:   [117.31 ms 117.49 ms 117.84 ms]
                        thrpt:  [1.0608 GiB/s 1.0639 GiB/s 1.0656 GiB/s]

### test_bulk_insert

test_bulk_insert[vanilla-release-pg16].insert: 21.765 s
test_bulk_insert[vanilla-release-pg16].data_size: 713 MB
test_bulk_insert[vanilla-release-pg16].wal_size: 608 MB
test_bulk_insert[vanilla-release-pg16].wal_written: 1,377 MB
test_bulk_insert[vanilla-release-pg16].compaction: 0.000 s
test_bulk_insert[neon-release-pg16].insert: 23.638 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,409 MB
test_bulk_insert[neon-release-pg16].peak_mem: 589 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 15.503 s
test_bulk_insert[neon-release-pg16].compaction: 0.030 s

### test_sharded_ingest

test_sharded_ingest[release-pg16-1].wal_ingest: 120.262 s
test_sharded_ingest[release-pg16-1].pageserver_ingest: 120.375 s
test_sharded_ingest[release-pg16-1].wal_written: 6,911 MB
test_sharded_ingest[release-pg16-8].wal_ingest: 119.777 s
test_sharded_ingest[release-pg16-8].pageserver_ingest: 119.901 s
test_sharded_ingest[release-pg16-8].wal_written: 6,911 MB
test_sharded_ingest[release-pg16-32].wal_ingest: 139.059 s
test_sharded_ingest[release-pg16-32].pageserver_ingest: 139.122 s
test_sharded_ingest[release-pg16-32].wal_written: 6,911 MB

### test_bulk_insert WAL ingestion

done in 28.382027011s

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears fine to me, but the changes to pgdatadir_mapping.rs and walingest.rs are worth another look by someone more intimately familiar with this.

libs/wal_decoder/src/decoder.rs Outdated Show resolved Hide resolved
libs/wal_decoder/src/serialized_batch.rs Outdated Show resolved Hide resolved
libs/wal_decoder/src/serialized_batch.rs Show resolved Hide resolved
libs/wal_decoder/src/serialized_batch.rs Show resolved Hide resolved
libs/wal_decoder/src/serialized_batch.rs Outdated Show resolved Hide resolved
libs/wal_decoder/src/models.rs Show resolved Hide resolved
pageserver/src/pgdatadir_mapping.rs Show resolved Hide resolved
@VladLazar VladLazar requested a review from jcsp November 1, 2024 12:47
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to address the comments or maybe it doesn't matter as SerializedValueBatch::default looks like a light operation (zero-sized Vec doesn't allocate extra memory?)

pageserver/src/pgdatadir_mapping.rs Outdated Show resolved Hide resolved
pageserver/src/pgdatadir_mapping.rs Outdated Show resolved Hide resolved
pageserver/src/pgdatadir_mapping.rs Outdated Show resolved Hide resolved
@VladLazar
Copy link
Contributor Author

Benchmarked things again after I added some optimisations (results here). I think this is good to merge. Pending an okay from John for the DatadirModification changes.

@VladLazar VladLazar merged commit 4dfa0c2 into main Nov 6, 2024
81 checks passed
@VladLazar VladLazar deleted the vlad/raw-buf-from-decoder branch November 6, 2024 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants