Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ledger): running out of disk space due to cleanup service not cleaning anything #591

Merged
merged 6 commits into from
Mar 4, 2025

Conversation

dnut
Copy link
Contributor

@dnut dnut commented Feb 28, 2025

Fix

The cleanup service is supposed to delete data from the ledger from very old slots. It also refuses to delete data from unrooted slots, since we definitely need those slots.

Since we don't have consensus, we have no way to know which slots are rooted. So the cleanup service is just assuming no slots are rooted, and likewise isn't cleaning up anything. That means we will eventually run out of storage on any server.

I added an assumption that anything over 100 slots old is rooted. It's not a perfectly safe assumption because you really need consensus to be sure about this. So we'll fix this when consensus is done. But for now we need some kind of basic assumption here so the validator can function without crashing the entire server.

A slot typically becomes rooted after 32 slots pass. The probability of passing 100 slots without being rooted is extremely unlikely and would only happen in the event of a massive consensus failure. This is a reasonable assumption to live with for the time being, until we have consensus.

I also increased the number of shreds to allow before cleaning them up. Previously the limit was 1,000 which could represent only a few slots. This configurable basically means we delete all rooted slots. I bumped this up to 5 million shreds. This means the ledger will use about 10-20 GB of storage space once it reaches the limit, and it should contain shreds going back for roughly an hour. Later we'll likely need to make this larger so replay can catch up from snapshots more reliably, or configurable so RPC can serve arbitrarily old data. For now, while the shreds are not actually used for anything, I'm keeping it more conservative to minimize hardware requirements.

Test and RocksDB upgrade

To satisfy test coverage requirements I needed to add a unit test for cleanBlockstore. This test wouldn't work without flushing the data_shred column family. This required changes to rocksdb-zig to add flush. So those upgrades are also included in this pr.

@dnut dnut requested review from yewman and dadepo February 28, 2025 22:42
Copy link

codecov bot commented Feb 28, 2025

Codecov Report

Attention: Patch coverage is 96.22642% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/ledger/cleanup_service.zig 96.66% 1 Missing ⚠️
src/utils/collections.zig 94.73% 1 Missing ⚠️
Files with missing lines Coverage Δ
src/ledger/database/hashmap.zig 91.66% <ø> (ø)
src/ledger/database/interface.zig 99.72% <ø> (ø)
src/ledger/database/rocksdb.zig 95.74% <100.00%> (+0.12%) ⬆️
src/utils/interface.zig 100.00% <ø> (ø)
src/ledger/cleanup_service.zig 91.78% <96.66%> (+7.57%) ⬆️
src/utils/collections.zig 93.98% <94.73%> (-0.16%) ⬇️

... and 2 files with indirect coverage changes

@dnut dnut added this pull request to the merge queue Mar 4, 2025
Merged via the queue into main with commit 267e61e Mar 4, 2025
17 checks passed
@dnut dnut deleted the dnut/fix/ledger/cleanup-service branch March 4, 2025 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

2 participants