safekeeper: efficient shard catchup for WAL cursor fan-out #9338

erikgrinaker · 2024-10-09T15:29:18Z

After #9337, when shards restart and need to catch up on old WAL, each shard will pull WAL records from S3 and filter them. This results in O(catchup_ranges) work. We should do this work once across multiple shards, since we expect many shards to require catchup at roughly the same time.

TODO: details post-RFC.

Consider gossiping timeline progress between safekeepers to know how many shards are offline/lagging.

Consider memory budgeting. Simple approach: estimate timeline catchup volume from LSNs, acquire from semaphore, block when unavailable. Consider QoS to prioritize "important" tenants.

erikgrinaker added c/storage/safekeeper Component: storage: safekeeper a/scalability Area: related to scalability labels Oct 9, 2024

VladLazar mentioned this issue Oct 11, 2024

Epic: sharded pageserver ingest #9329

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

safekeeper: efficient shard catchup for WAL cursor fan-out #9338

safekeeper: efficient shard catchup for WAL cursor fan-out #9338

erikgrinaker commented Oct 9, 2024 •

edited

Loading

safekeeper: efficient shard catchup for WAL cursor fan-out #9338

safekeeper: efficient shard catchup for WAL cursor fan-out #9338

Comments

erikgrinaker commented Oct 9, 2024 • edited Loading

erikgrinaker commented Oct 9, 2024 •

edited

Loading