-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Optimize stale slot shrinking for previously cleaned roots #10099
Optimize stale slot shrinking for previously cleaned roots #10099
Conversation
Codecov Report
@@ Coverage Diff @@
## master #10099 +/- ##
========================================
Coverage 81.6% 81.7%
========================================
Files 296 296
Lines 69320 69447 +127
========================================
+ Hits 56615 56751 +136
+ Misses 12705 12696 -9 |
runtime/src/accounts_db.rs
Outdated
@@ -233,6 +233,9 @@ pub struct AccountStorageEntry { | |||
/// status corresponding to the storage, lets us know that | |||
/// the append_vec, once maxed out, then emptied, can be reclaimed | |||
count_and_status: RwLock<(usize, AccountStorageStatus)>, | |||
|
|||
#[serde(skip)] | |||
store_count: AtomicUsize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted store_count
to be a part of count_and_status
but the change require alot change. And I don't need absolute accuracy for store_count
.
So, this might be inconsistent with count_and_status
due to lack of atomicity / synchronization with count_and_status
Maybe, to indicate such a nature, I'll rename this to approx_store_count
and comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, a comment would be nice to indicate how it's different to count_and_status
@sakridge Could you review this as a possible fix? I've went in-depth in explaining the problem... Hope this illustrates the current devnet issue! If this solution sounds nice, I'll finish up this pr (writing pr, commenting...) |
@sakridge Could you review this if this approach makes sense so that I can progress this to finish it up? :) |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Yea, this seems fine, let's continue in this direction. Thanks for the work. |
|
||
if next.is_some() { | ||
next | ||
} else { | ||
let mut new_all_slots = self.all_root_slots_in_index(); | ||
let next = new_all_slots.pop(); | ||
|
||
let mut candidates = self.shrink_candidate_slots.lock().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented, we need to broaden locking a bit.
7cc1da8
to
706590a
Compare
@sakridge I think this is ready for review to merge this. :) |
c320f9d
to
53006c5
Compare
ce5b125
to
c32b97a
Compare
@sakridge Thanks for reviews! I've addressed all! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
* Prioritize shrinking of previously cleaned roots * measure time of stale slot shrinking * Disable shrink for test * shrink: budgeting, store count, force for snapshot * Polish implementation and fix tests * Fix ci.. * Clean up a bit * Further polish implementation and fix/add tests * Rebase fixes * Remove unneeded Default for AccountStorageEntry * Address review comments * More cleanup * More cleanup (cherry picked from commit dfe72d5)
I've tested this locally and against devnet as a last minute test and found no issues with the tip of this branch. And backported as well. @mvines Feel free to deploy this on devnet at your convenient time. :) Also, this doesn't introduce any consensus change so doesn't need rolling update. |
ok, thanks. When it lands in 1.2 I'll just treat this like any other normal patch then that'll get deployed in v1.2.2 |
It looks like snapshot bloat issue is actually fixed with this (included in the v1.2.2, f13498b): $ ./target/release/solana --url http://devnet.solana.com:8899 cluster-version
1.2.2 f13498b4
|
(status update) Also, I'm aware of this landed in testnet/tds as well. From what I casually observe, this isn't causing any harm there. So, I'm en-queueing my next accountsdb/bank-related change in my mind for next week: #10206. |
Problem
There are several problems for eager rent collection and stale slot shrinking...
stale slot shrinking (perf. degradation)
Shrinking is causing trouble under load or benchmark.
First, shrinking (reading & writing accounts) a slot full of account updates isn't a light-weight background task in any way, even not stale at all. But, that's constantly happening too often while benchmarking. That's partly because supposedly our bench is repeatedly updating same accounts from some limited (large) number of test account sets. So, that heavy slot is wrongly passing the existing stale check too easily. (20% of accounts are outdated, meaning unused/empty/shrinkable). Yeah, that's plain bug when stale slot shrinking was introduced. Sorry...
Also, bench might not be representational real world work load. But, this still bothers us and indeed exposes the potential perf problem in stale slot shrinking. And, shrinking probably will be hurting the performance under real-world saturating tps situation.
eager rent collection (bloated snapshot)
The snapshots are too prone to grow when there are theoretically maximum number of alive
AppendVec
s. This is currently happening only on devnet, because there are so many rent-exempt accounts on devnet (1.5 million accounts; I counted from a sample recent snapshot). That huge number of AppendVec is a side-effect of the fact eager rent collection is spreading those accounts over 2 days worth of slots (an epoch on tds/mainnet-beta), so thinly (well, that's intended design...).When, there is such large number of AppendVecs, the stale slot shrinking works poorly because the traversing algorithm is simplistic. It was originally exactly designed for the stale slots. However, eager rent collection creates fresh AppendVecs periodically now on devnet at too fast pace, making the stale assumption obsolete.
The algorithm is like this: grab all of rooted alive slot set from AccountsIndex at a time and loop over the set (without fetching new roots) until the end of the set at 100ms interval, and repeat the process infinitely. This means when there are 400K slots, it doesn't shrink newly- (and constantly- by eager rent collection) created slots for 11 hours.
Eager rent collection constantly creates 4MB
AppendVec
s with few rent-collected accounts, which should be shrunk but not done, causing bloated snapshot. Those few accounts prevents the AppendVec from being reclaimed quickly (usually reclaimed for the idling cluster before eager rent collection). Then, the outdated garbage sysvars in the AppendVec is nightmare forbzip2
; 130kb of slightly changing bits for eachAppendVec
.stale slot shrinking got a bit obsoleted by the introduction of eager rent collection because stale slots doesn't exist anymore. there is only a single epoch aged slots (AppendVecs) at most now. That's because all of accounts will be updated over an epoch regardless of rent-exempt. (This is like super long-timed DIMM memory refreshing, we are necessitating all account data be equally hot for uniform rent fee and incentive structure.)
Summary of Changes
eager rent collection
To fix the above problems, let's make stale slot shrinking to prioritize recently rooted slots (to fix bloated snapshot) and to optimize for idling cluster (to avoid shinking when benchmarking).
For that, firstly create a shortcut codepath from the recently-rooted to the slot shrink.
Reason of not immediate but previous parent: immediate roots may be too hot. let it cool down a bit for some time like snapshot interval. Also, sysvars in the most recent root slot can not be reclaimed (yet). Also, I anticipate slightly older root slots will be dead more often without needing to shrink first of all. Reclaiming dead slots are a lot faster/simpler.
I chosen this over making AppendVec's default size dynamic because to compress well, we actually need to shrink AppendVec to strip sysvars.
stale slot shrinking
AccountStorageEntry.approx_store_count
for more quick bailout (well this could be done immediately after Introduce background stale AppendVec shrink mechanism #9219 )I found deciding the proper fixed or dynamic/adaptive threadhold is hard. Specifically, what criteria could be stale in the light of the above problems, considering the variable number of votes per slot (these almost instantly becomes overwritten, accounting for empty in slot; candidate for shrinking), unbound tps assumption, heterogeneous nature of our validators machine spec, ideal sensitivity to sudden peak tps. Simplicity of implementation. Also, assumed heterogeneous work load and account sizes in the real world.
For example, lowering the shrink threshold to 90% shrinkable from 20% shrinkable doesn't quite work for the idling case, where there is not so many accounts to begin with. (1 (stale) eager rent collect account + 7 (outdated) sysvar accounts + 1 (outdatd) vote account on devnet).
Instead, I focused on two extreme mode of operation in mind when designing the new shrinking strategy: idling cluster and peak cluster.
I wanted something like this:
When the cluster is idling, the validator shrinks every slots without delay. When the cluster is in high load, as soon as a big stale slot is shrunken, the subsequent stale slot shrinking is paused for some time, proportional to the number of shrunken accounts.
Ultimately, my design turned out like this: pseudo peak load detection via actual shrunken account count.
This limits the upper bound of shrunken accounts to 43200000 (250 * 3600 * 24 * 2) for an epoch (= eager rent collection cycle). That's fine because stale slot shrinking is optional to begin with. And the number is large enough for us for now. When there are more accounts than the limit, snapshot again starts to bloat. But at that time, there will be different practical difficulty for monolithic single snapshot for everything...
The downside of this strategy is that reaction latency is rather long: after processing peaked capacity tps in a slot.
Also, I didn't make the background thread's priority lower because this will cause a priority inversion too easily without proper handling...
follow up to: #9219 #9527.