-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: lift decoding and interpreting of wal into wal_decoder #9524
Conversation
19c9dd1
to
7c4807e
Compare
5354 tests run: 5132 passed, 0 failed, 222 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
c5b2c78 at 2024-10-31T10:44:37.184Z :recycle: |
285e8d2
to
0d7d792
Compare
A new type is added to `wal_decoder::models`, InterpretedWalRecord. This type contains everything that the pageserver requires in order to ingest a WAL record. The highlights are the `metadata_record` which is an optional special record type to be handled and `blocks` which stores key, value pairs to be persisted to storage. This type is produced by `wal_decoder::models::InterpretedWalRecord::from_bytes` from a raw PG wal record. The rest of this commit separates decoding and interpretation of the PG WAL record from its application in [`WalIngest::ingest_record`].
0d7d792
to
47604a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing an initial high-level review. Regardless of the ingest sharding, this is a nice cleanup.
I don't see any added test coverage in wal_decoder
. I assume we have some coverage in walingest.rs
, but is it worth moving some of it over to wal_decoder
and improving it? Can be deferred until a later PR.
I assume most of the MetadataRecord::from_decoded()
code was just a simple move? Anything in particular that needs a closer review?
I agree. Unit testing for decoding is lacking, but that was the case before as well. The best coverage we have right now is from the integration pg regress tests. Decoding is a good candidate for fuzzing, but we don't have the right utils in place and I don't wish to sink too much time into it right now. Or has that changed with your WAL generation work?
Yup, just a move. Nothing comes to mind, this PR is largely just moving things around and adding types. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo comments.
The best coverage we have right now is from the integration pg regress tests. Decoding is a good candidate for fuzzing, but we don't have the right utils in place and I don't wish to sink too much time into it right now. Or has that changed with your WAL generation work?
Nah, nothing there that would really help right now.
Since the input comes from Postgres, which we can presumably trust to generally produce well-formed WAL, I guess the main concern is to handle the happy path -- and we get lots of incidental coverage from anything that writes WAL. Fine to punt this for later.
This reverts commit 138f3bc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM :)
## Problem #9524 split the decoding and interpretation step from ingestion. The output of the first phase is a `wal_decoder::models::InterpretedWalRecord`. Before this patch set that struct contained a list of `Value` instances. We wish to lift the decoding and interpretation step to the safekeeper, but it would be nice if the safekeeper gave us a batch containing the raw data instead of actual values. ## Summary of changes Main goal here is to make `InterpretedWalRecord` hold a raw buffer which contains pre-serialized Values. For this we do: 1. Add a `SerializedValueBatch` type. This is `inmemory_layer::SerializedBatch` with some extra functionality for extension, observing values for shard 0 and tests. 2. Replace `inmemory_layer::SerializedBatch` with `SerializedValueBatch` 3. Make `DatadirModification` maintain a `SerializedValueBatch`. ### `DatadirModification` changes `DatadirModification` now maintains a `SerializedValueBatch` and extends it as new WAL records come in (to avoid flushing to disk on every record). In turn, this cascaded into a number of modifications to `DatadirModification`: 1. Replace `pending_data_pages` and `pending_zero_data_pages` with `pending_data_batch`. 2. Removal of `pending_zero_data_pages` and its cousin `on_wal_record_end` 3. Rename `pending_bytes` to `pending_metadata_bytes` since this is what it tracks now. 4. Adapting of various utility methods like `len`, `approx_pending_bytes` and `has_dirty_data_pages`. Removal of `pending_zero_data_pages` and the optimisation associated with it ((1) and (2)) deserves more detail. Previously all zero data pages went through `pending_zero_data_pages`. We wrote zero data pages when filling gaps caused by relation extension (case A) and when handling special wal records (case B). If it happened that the same WAL record contained a non zero write for an entry in `pending_zero_data_pages` we skipped the zero write. Case A: We handle this differently now. When ingesting the `SerialiezdValueBatch` associated with one PG WAL record, we identify the gaps and fill the them in one go. Essentially, we move from a per key process (gaps were filled after each new key), and replace it with a per record process. Hence, the optimisation is not required anymore. Case B: When the handling of a special record needs to zero out a key, it just adds that to the current batch. I inspected the code, and I don't think the optimisation kicked in here.
Problem
Decoding and ingestion are still coupled in
pageserver::WalIngest
.Summary of changes
A new type is added to
wal_decoder::models
, InterpretedWalRecord. This type contains everything that the pageserver requires in order to ingest a WAL record. The highlights are themetadata_record
which is an optional special record type to be handled andblocks
which stores key, value pairs to be persisted to storage.This type is produced by
wal_decoder::models::InterpretedWalRecord::from_bytes
from a raw PG wal record.The rest of this commit separates decoding and interpretation of the PG WAL record from its application in
WalIngest::ingest_record
.Related: #9335
Epic: #9329
Checklist before requesting a review
Checklist before merging