WAL filtering in safekeeper for sharding #6345

petuhovskiy · 2024-01-12T11:46:02Z

Arthur's prototype from January: https://github.com/neondatabase/neon/tree/sk-sharding-stream

Precursors

We may start by refactoring pageserver WAL ingest code to make decoding WAL records more independent of Timeline.

Tasks

Give feedback

rfcs: add sharded ingest RFC #8754

t/tech_design_rfc
pageserver: batch InMemoryLayer puts, remove need to sort items by LSN during ingest #8591

a/tech_debt c/storage/pageserver
pageserver: separate metadata and data pages in DatadirModification #8621

a/tech_debt c/storage/pageserver
Defer applying metadata changes until commit() time in DataDirModification (stop carrying a Timeline ref)
Write GC cutoff to S3 from Shard 0, and stop storing SLRU data on other shards.
Options

Implement splitting on safekeeper

Tasks

Give feedback

Transmit ShardIndex to safekeeper when subscribing for WAL
Change WAL send protocol to send a decoded DataDirModification instead of a raw WAL record
Filter modifications on the safekeeper to only include pages relevant to the shard who is subscribed
WAL records that are no-ops cause timelines to remain active #5962

1 of 3

c/storage c/storage/pageserver t/bug triaged
Not strictly required but relevant: Epic: Improve sk<->ps connection observability #7002
Options

Optimizations

Tasks

Give feedback

Implement a per-Timeline cursor to avoid reading from disk & decoding repeatedly for each shard
Options

The text was updated successfully, but these errors were encountered:

…8621) ## Problem Currently, DatadirModification keeps a key-indexed map of all pending writes, even though we (almost) never need to read back dirty pages for anything other than metadata pages (e.g. relation sizes). Related: #6345 ## Summary of changes - commit() modifications before ingesting database creation wal records, so that they are guaranteed to be able to get() everything they need directly from the underlying Timeline. - Split dirty pages in DatadirModification into pending_metadata_pages and pending_data_pages. The data ones don't need to be in a key-addressable format, so they just go in a Vec instead. - Special case handling of zero-page writes in DatadirModification, putting them in a map which is flushed on the end of a WAL record. This handles the case where during ingest, we might first write a zero page, and then ingest a postgres write to that page. We used to do this via the key-indexed map of writes, but in this PR we change the data page write path to not bother indexing these by key. My least favorite thing about this PR is that I needed to change the DatadirModification interface to add the on_record_end call. This is not very invasive because there's really only one place we use it, but it changes the object's behaviour from being clearly an aggregation of many records to having some per-record state. I could avoid this by implicitly doing the work when someone calls set_lsn or commit -- I'm open to opinions on whether that's cleaner or dirtier. ## Performance There may be some efficiency improvement here, but the primary motivation is to enable an earlier stage of ingest to operate without access to a Timeline. The `pending_data_pages` part is the "fast path" bulk write data that can in principle be generated without a Timeline, in parallel with other ingest batches, and ultimately on the safekeeper. `test_bulk_insert` on AX102 shows approximately the same results as in the previous PR #8591: ``` ------------------------------ Benchmark results ------------------------------- test_bulk_insert[neon-release-pg16].insert: 23.577 s test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB test_bulk_insert[neon-release-pg16].peak_mem: 637 MB test_bulk_insert[neon-release-pg16].size: 0 MB test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB test_bulk_insert[neon-release-pg16].wal_recovery: 18.264 s test_bulk_insert[neon-release-pg16].compaction: 0.052 s ```

VladLazar · 2024-10-11T15:37:32Z

(Accidentally) replaced by #9329

petuhovskiy added t/bug Issue Type: Bug c/storage/safekeeper Component: storage: safekeeper labels Jan 12, 2024

jcsp self-assigned this Aug 5, 2024

problame self-assigned this Aug 27, 2024

jcsp mentioned this issue Aug 27, 2024

pageserver: separate metadata and data pages in DatadirModification #8621

Merged

5 tasks

problame mentioned this issue Sep 2, 2024

WAL records that are no-ops cause timelines to remain active #5962

Open

jcsp mentioned this issue Oct 10, 2024

Epic: sharded pageserver ingest #9329

Open

VladLazar closed this as completed Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WAL filtering in safekeeper for sharding #6345

WAL filtering in safekeeper for sharding #6345

petuhovskiy commented Jan 12, 2024 •

edited by jcsp

Loading

Tasks

Tasks

Tasks

VladLazar commented Oct 11, 2024

WAL filtering in safekeeper for sharding #6345

WAL filtering in safekeeper for sharding #6345

Comments

petuhovskiy commented Jan 12, 2024 • edited by jcsp Loading

Precursors

Tasks

Implement splitting on safekeeper

Tasks

Optimizations

Tasks

VladLazar commented Oct 11, 2024

petuhovskiy commented Jan 12, 2024 •

edited by jcsp

Loading