Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

Optimize stale slot shrinking for previously cleaned roots #10099

Merged
merged 13 commits into from
Jun 12, 2020

Conversation

ryoqun
Copy link
Contributor

@ryoqun ryoqun commented May 18, 2020

Problem

There are several problems for eager rent collection and stale slot shrinking...

stale slot shrinking (perf. degradation)

Shrinking is causing trouble under load or benchmark.

First, shrinking (reading & writing accounts) a slot full of account updates isn't a light-weight background task in any way, even not stale at all. But, that's constantly happening too often while benchmarking. That's partly because supposedly our bench is repeatedly updating same accounts from some limited (large) number of test account sets. So, that heavy slot is wrongly passing the existing stale check too easily. (20% of accounts are outdated, meaning unused/empty/shrinkable). Yeah, that's plain bug when stale slot shrinking was introduced. Sorry...

Also, bench might not be representational real world work load. But, this still bothers us and indeed exposes the potential perf problem in stale slot shrinking. And, shrinking probably will be hurting the performance under real-world saturating tps situation.

eager rent collection (bloated snapshot)

The snapshots are too prone to grow when there are theoretically maximum number of alive AppendVecs. This is currently happening only on devnet, because there are so many rent-exempt accounts on devnet (1.5 million accounts; I counted from a sample recent snapshot). That huge number of AppendVec is a side-effect of the fact eager rent collection is spreading those accounts over 2 days worth of slots (an epoch on tds/mainnet-beta), so thinly (well, that's intended design...).

When, there is such large number of AppendVecs, the stale slot shrinking works poorly because the traversing algorithm is simplistic. It was originally exactly designed for the stale slots. However, eager rent collection creates fresh AppendVecs periodically now on devnet at too fast pace, making the stale assumption obsolete.

The algorithm is like this: grab all of rooted alive slot set from AccountsIndex at a time and loop over the set (without fetching new roots) until the end of the set at 100ms interval, and repeat the process infinitely. This means when there are 400K slots, it doesn't shrink newly- (and constantly- by eager rent collection) created slots for 11 hours.

Eager rent collection constantly creates 4MB AppendVecs with few rent-collected accounts, which should be shrunk but not done, causing bloated snapshot. Those few accounts prevents the AppendVec from being reclaimed quickly (usually reclaimed for the idling cluster before eager rent collection). Then, the outdated garbage sysvars in the AppendVec is nightmare for bzip2; 130kb of slightly changing bits for each AppendVec.

stale slot shrinking got a bit obsoleted by the introduction of eager rent collection because stale slots doesn't exist anymore. there is only a single epoch aged slots (AppendVecs) at most now. That's because all of accounts will be updated over an epoch regardless of rent-exempt. (This is like super long-timed DIMM memory refreshing, we are necessitating all account data be equally hot for uniform rent fee and incentive structure.)

Summary of Changes

eager rent collection

To fix the above problems, let's make stale slot shrinking to prioritize recently rooted slots (to fix bloated snapshot) and to optimize for idling cluster (to avoid shinking when benchmarking).

For that, firstly create a shortcut codepath from the recently-rooted to the slot shrink.

Reason of not immediate but previous parent: immediate roots may be too hot. let it cool down a bit for some time like snapshot interval. Also, sysvars in the most recent root slot can not be reclaimed (yet). Also, I anticipate slightly older root slots will be dead more often without needing to shrink first of all. Reclaiming dead slots are a lot faster/simpler.

I chosen this over making AppendVec's default size dynamic because to compress well, we actually need to shrink AppendVec to strip sysvars.

stale slot shrinking

  • Add simplistic yet adaptive budget scheme for stale slot shrinking; 250 account update / sec.
  • Introduce AccountStorageEntry.approx_store_count for more quick bailout (well this could be done immediately after Introduce background stale AppendVec shrink mechanism #9219 )
  • (extra) Fully shrink slots when loaded from snapshot.

I found deciding the proper fixed or dynamic/adaptive threadhold is hard. Specifically, what criteria could be stale in the light of the above problems, considering the variable number of votes per slot (these almost instantly becomes overwritten, accounting for empty in slot; candidate for shrinking), unbound tps assumption, heterogeneous nature of our validators machine spec, ideal sensitivity to sudden peak tps. Simplicity of implementation. Also, assumed heterogeneous work load and account sizes in the real world.

For example, lowering the shrink threshold to 90% shrinkable from 20% shrinkable doesn't quite work for the idling case, where there is not so many accounts to begin with. (1 (stale) eager rent collect account + 7 (outdated) sysvar accounts + 1 (outdatd) vote account on devnet).

Instead, I focused on two extreme mode of operation in mind when designing the new shrinking strategy: idling cluster and peak cluster.

I wanted something like this:

When the cluster is idling, the validator shrinks every slots without delay. When the cluster is in high load, as soon as a big stale slot is shrunken, the subsequent stale slot shrinking is paused for some time, proportional to the number of shrunken accounts.

Ultimately, my design turned out like this: pseudo peak load detection via actual shrunken account count.

This limits the upper bound of shrunken accounts to 43200000 (250 * 3600 * 24 * 2) for an epoch (= eager rent collection cycle). That's fine because stale slot shrinking is optional to begin with. And the number is large enough for us for now. When there are more accounts than the limit, snapshot again starts to bloat. But at that time, there will be different practical difficulty for monolithic single snapshot for everything...

The downside of this strategy is that reaction latency is rather long: after processing peaked capacity tps in a slot.

Also, I didn't make the background thread's priority lower because this will cause a priority inversion too easily without proper handling...

follow up to: #9219 #9527.

@ryoqun ryoqun mentioned this pull request May 18, 2020
5 tasks
@codecov
Copy link

codecov bot commented May 18, 2020

Codecov Report

Merging #10099 into master will increase coverage by 0.0%.
The diff coverage is 98.8%.

@@           Coverage Diff            @@
##           master   #10099    +/-   ##
========================================
  Coverage    81.6%    81.7%            
========================================
  Files         296      296            
  Lines       69320    69447   +127     
========================================
+ Hits        56615    56751   +136     
+ Misses      12705    12696     -9     

@ryoqun
Copy link
Contributor Author

ryoqun commented May 19, 2020

it looks like this pr is working as expected.

Before (without this pr): snapshot steadily grow (maybe bound to 2G):

image

After (with this pr): snapshot size stays at the bare minimal level. (also, this is at the same level even before the eager rent collection):

image

@ryoqun ryoqun changed the title [wip] Prioritize shrinking of previously cleaned roots Optimize stale slot shrinking for previously cleaned roots May 21, 2020
@@ -233,6 +233,9 @@ pub struct AccountStorageEntry {
/// status corresponding to the storage, lets us know that
/// the append_vec, once maxed out, then emptied, can be reclaimed
count_and_status: RwLock<(usize, AccountStorageStatus)>,

#[serde(skip)]
store_count: AtomicUsize,
Copy link
Contributor Author

@ryoqun ryoqun May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted store_count to be a part of count_and_status but the change require alot change. And I don't need absolute accuracy for store_count.
So, this might be inconsistent with count_and_status due to lack of atomicity / synchronization with count_and_status

Maybe, to indicate such a nature, I'll rename this to approx_store_count and comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, a comment would be nice to indicate how it's different to count_and_status

@ryoqun
Copy link
Contributor Author

ryoqun commented May 21, 2020

@sakridge Could you review this as a possible fix? I've went in-depth in explaining the problem... Hope this illustrates the current devnet issue!

If this solution sounds nice, I'll finish up this pr (writing pr, commenting...)

@ryoqun ryoqun requested a review from sakridge May 21, 2020 20:18
@ryoqun
Copy link
Contributor Author

ryoqun commented May 21, 2020

For illustrative purpose, here is what looks like under our current nightly performance benchmark:

image

@ryoqun
Copy link
Contributor Author

ryoqun commented May 26, 2020

@sakridge Could you review this if this approach makes sense so that I can progress this to finish it up? :)

@stale
Copy link

stale bot commented Jun 2, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added stale [bot only] Added to stale content; results in auto-close after a week. and removed stale [bot only] Added to stale content; results in auto-close after a week. labels Jun 2, 2020
@sakridge
Copy link
Contributor

sakridge commented Jun 3, 2020

@sakridge Could you review this if this approach makes sense so that I can progress this to finish it up? :)

Yea, this seems fine, let's continue in this direction. Thanks for the work.

@ryoqun ryoqun added the v1.2 label Jun 8, 2020

if next.is_some() {
next
} else {
let mut new_all_slots = self.all_root_slots_in_index();
let next = new_all_slots.pop();

let mut candidates = self.shrink_candidate_slots.lock().unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented, we need to broaden locking a bit.

@ryoqun ryoqun marked this pull request as ready for review June 8, 2020 14:48
@ryoqun ryoqun force-pushed the prioritized-previously-cleaned-roots branch from 7cc1da8 to 706590a Compare June 8, 2020 15:03
@ryoqun ryoqun requested review from sakridge and removed request for sakridge June 8, 2020 15:09
@ryoqun
Copy link
Contributor Author

ryoqun commented Jun 8, 2020

@sakridge I think this is ready for review to merge this. :)

@ryoqun ryoqun force-pushed the prioritized-previously-cleaned-roots branch 2 times, most recently from c320f9d to 53006c5 Compare June 9, 2020 03:32
@ryoqun ryoqun force-pushed the prioritized-previously-cleaned-roots branch 2 times, most recently from ce5b125 to c32b97a Compare June 9, 2020 03:39
@ryoqun ryoqun added stale [bot only] Added to stale content; results in auto-close after a week. security Pull requests that address a security vulnerability and removed stale [bot only] Added to stale content; results in auto-close after a week. labels Jun 10, 2020
@ryoqun
Copy link
Contributor Author

ryoqun commented Jun 11, 2020

@sakridge Thanks for reviews! I've addressed all!

Copy link
Contributor

@sakridge sakridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ryoqun ryoqun merged commit dfe72d5 into solana-labs:master Jun 12, 2020
mergify bot pushed a commit that referenced this pull request Jun 12, 2020
* Prioritize shrinking of previously cleaned roots

* measure time of stale slot shrinking

* Disable shrink for test

* shrink: budgeting, store count, force for snapshot

* Polish implementation and fix tests

* Fix ci..

* Clean up a bit

* Further polish implementation and fix/add tests

* Rebase fixes

* Remove unneeded Default for AccountStorageEntry

* Address review comments

* More cleanup

* More cleanup

(cherry picked from commit dfe72d5)
@ryoqun
Copy link
Contributor Author

ryoqun commented Jun 12, 2020

I've tested this locally and against devnet as a last minute test and found no issues with the tip of this branch. And backported as well.

@mvines Feel free to deploy this on devnet at your convenient time. :) Also, this doesn't introduce any consensus change so doesn't need rolling update.

@mvines
Copy link
Contributor

mvines commented Jun 12, 2020

ok, thanks. When it lands in 1.2 I'll just treat this like any other normal patch then that'll get deployed in v1.2.2

@ryoqun
Copy link
Contributor Author

ryoqun commented Jun 13, 2020

image

I'll omit to explain. (tell me if anyone wonders!) but budgeting is working as expected, observable from this chart from the nightly perf. test.

@ryoqun
Copy link
Contributor Author

ryoqun commented Jun 15, 2020

It looks like snapshot bloat issue is actually fixed with this (included in the v1.2.2, f13498b):
image

$ ./target/release/solana  --url http://devnet.solana.com:8899 cluster-version
1.2.2 f13498b4

@ryoqun
Copy link
Contributor Author

ryoqun commented Jun 16, 2020

(status update) Also, I'm aware of this landed in testnet/tds as well. From what I casually observe, this isn't causing any harm there. So, I'm en-queueing my next accountsdb/bank-related change in my mind for next week: #10206.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
security Pull requests that address a security vulnerability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants