fix(ledger): running out of disk space due to cleanup service not cleaning anything #591
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix
The cleanup service is supposed to delete data from the ledger from very old slots. It also refuses to delete data from unrooted slots, since we definitely need those slots.
Since we don't have consensus, we have no way to know which slots are rooted. So the cleanup service is just assuming no slots are rooted, and likewise isn't cleaning up anything. That means we will eventually run out of storage on any server.
I added an assumption that anything over 100 slots old is rooted. It's not a perfectly safe assumption because you really need consensus to be sure about this. So we'll fix this when consensus is done. But for now we need some kind of basic assumption here so the validator can function without crashing the entire server.
A slot typically becomes rooted after 32 slots pass. The probability of passing 100 slots without being rooted is extremely unlikely and would only happen in the event of a massive consensus failure. This is a reasonable assumption to live with for the time being, until we have consensus.
I also increased the number of shreds to allow before cleaning them up. Previously the limit was 1,000 which could represent only a few slots. This configurable basically means we delete all rooted slots. I bumped this up to 5 million shreds. This means the ledger will use about 10-20 GB of storage space once it reaches the limit, and it should contain shreds going back for roughly an hour. Later we'll likely need to make this larger so replay can catch up from snapshots more reliably, or configurable so RPC can serve arbitrarily old data. For now, while the shreds are not actually used for anything, I'm keeping it more conservative to minimize hardware requirements.
Test and RocksDB upgrade
To satisfy test coverage requirements I needed to add a unit test for cleanBlockstore. This test wouldn't work without flushing the data_shred column family. This required changes to rocksdb-zig to add
flush
. So those upgrades are also included in this pr.