Optimize stale slot shrinking for previously cleaned roots #10099

ryoqun · 2020-05-18T16:38:00Z

Problem

There are several problems for eager rent collection and stale slot shrinking...

stale slot shrinking (perf. degradation)

Shrinking is causing trouble under load or benchmark.

First, shrinking (reading & writing accounts) a slot full of account updates isn't a light-weight background task in any way, even not stale at all. But, that's constantly happening too often while benchmarking. That's partly because supposedly our bench is repeatedly updating same accounts from some limited (large) number of test account sets. So, that heavy slot is wrongly passing the existing stale check too easily. (20% of accounts are outdated, meaning unused/empty/shrinkable). Yeah, that's plain bug when stale slot shrinking was introduced. Sorry...

Also, bench might not be representational real world work load. But, this still bothers us and indeed exposes the potential perf problem in stale slot shrinking. And, shrinking probably will be hurting the performance under real-world saturating tps situation.

eager rent collection (bloated snapshot)

The snapshots are too prone to grow when there are theoretically maximum number of alive AppendVecs. This is currently happening only on devnet, because there are so many rent-exempt accounts on devnet (1.5 million accounts; I counted from a sample recent snapshot). That huge number of AppendVec is a side-effect of the fact eager rent collection is spreading those accounts over 2 days worth of slots (an epoch on tds/mainnet-beta), so thinly (well, that's intended design...).

When, there is such large number of AppendVecs, the stale slot shrinking works poorly because the traversing algorithm is simplistic. It was originally exactly designed for the stale slots. However, eager rent collection creates fresh AppendVecs periodically now on devnet at too fast pace, making the stale assumption obsolete.

The algorithm is like this: grab all of rooted alive slot set from AccountsIndex at a time and loop over the set (without fetching new roots) until the end of the set at 100ms interval, and repeat the process infinitely. This means when there are 400K slots, it doesn't shrink newly- (and constantly- by eager rent collection) created slots for 11 hours.

Eager rent collection constantly creates 4MB AppendVecs with few rent-collected accounts, which should be shrunk but not done, causing bloated snapshot. Those few accounts prevents the AppendVec from being reclaimed quickly (usually reclaimed for the idling cluster before eager rent collection). Then, the outdated garbage sysvars in the AppendVec is nightmare for bzip2; 130kb of slightly changing bits for each AppendVec.

stale slot shrinking got a bit obsoleted by the introduction of eager rent collection because stale slots doesn't exist anymore. there is only a single epoch aged slots (AppendVecs) at most now. That's because all of accounts will be updated over an epoch regardless of rent-exempt. (This is like super long-timed DIMM memory refreshing, we are necessitating all account data be equally hot for uniform rent fee and incentive structure.)

Summary of Changes

eager rent collection

To fix the above problems, let's make stale slot shrinking to prioritize recently rooted slots (to fix bloated snapshot) and to optimize for idling cluster (to avoid shinking when benchmarking).

For that, firstly create a shortcut codepath from the recently-rooted to the slot shrink.

Reason of not immediate but previous parent: immediate roots may be too hot. let it cool down a bit for some time like snapshot interval. Also, sysvars in the most recent root slot can not be reclaimed (yet). Also, I anticipate slightly older root slots will be dead more often without needing to shrink first of all. Reclaiming dead slots are a lot faster/simpler.

I chosen this over making AppendVec's default size dynamic because to compress well, we actually need to shrink AppendVec to strip sysvars.

stale slot shrinking

Add simplistic yet adaptive budget scheme for stale slot shrinking; 250 account update / sec.
Introduce AccountStorageEntry.approx_store_count for more quick bailout (well this could be done immediately after Introduce background stale AppendVec shrink mechanism #9219 )
(extra) Fully shrink slots when loaded from snapshot.

I found deciding the proper fixed or dynamic/adaptive threadhold is hard. Specifically, what criteria could be stale in the light of the above problems, considering the variable number of votes per slot (these almost instantly becomes overwritten, accounting for empty in slot; candidate for shrinking), unbound tps assumption, heterogeneous nature of our validators machine spec, ideal sensitivity to sudden peak tps. Simplicity of implementation. Also, assumed heterogeneous work load and account sizes in the real world.

For example, lowering the shrink threshold to 90% shrinkable from 20% shrinkable doesn't quite work for the idling case, where there is not so many accounts to begin with. (1 (stale) eager rent collect account + 7 (outdated) sysvar accounts + 1 (outdatd) vote account on devnet).

Instead, I focused on two extreme mode of operation in mind when designing the new shrinking strategy: idling cluster and peak cluster.

I wanted something like this:

When the cluster is idling, the validator shrinks every slots without delay. When the cluster is in high load, as soon as a big stale slot is shrunken, the subsequent stale slot shrinking is paused for some time, proportional to the number of shrunken accounts.

Ultimately, my design turned out like this: pseudo peak load detection via actual shrunken account count.

This limits the upper bound of shrunken accounts to 43200000 (250 * 3600 * 24 * 2) for an epoch (= eager rent collection cycle). That's fine because stale slot shrinking is optional to begin with. And the number is large enough for us for now. When there are more accounts than the limit, snapshot again starts to bloat. But at that time, there will be different practical difficulty for monolithic single snapshot for everything...

The downside of this strategy is that reaction latency is rather long: after processing peaked capacity tps in a slot.

Also, I didn't make the background thread's priority lower because this will cause a priority inversion too easily without proper handling...

follow up to: #9219 #9527.

codecov · 2020-05-18T18:03:26Z

Codecov Report

Merging #10099 into master will increase coverage by 0.0%.
The diff coverage is 98.8%.

@@           Coverage Diff            @@
##           master   #10099    +/-   ##
========================================
  Coverage    81.6%    81.7%            
========================================
  Files         296      296            
  Lines       69320    69447   +127     
========================================
+ Hits        56615    56751   +136     
+ Misses      12705    12696     -9

ryoqun · 2020-05-19T16:15:07Z

it looks like this pr is working as expected.

Before (without this pr): snapshot steadily grow (maybe bound to 2G):

After (with this pr): snapshot size stays at the bare minimal level. (also, this is at the same level even before the eager rent collection):

ryoqun · 2020-05-21T18:56:23Z

runtime/src/accounts_db.rs

@@ -233,6 +233,9 @@ pub struct AccountStorageEntry {
    /// status corresponding to the storage, lets us know that
    ///  the append_vec, once maxed out, then emptied, can be reclaimed
    count_and_status: RwLock<(usize, AccountStorageStatus)>,
+
+    #[serde(skip)]
+    store_count: AtomicUsize,


I wanted store_count to be a part of count_and_status but the change require alot change. And I don't need absolute accuracy for store_count.
So, this might be inconsistent with count_and_status due to lack of atomicity / synchronization with count_and_status

Maybe, to indicate such a nature, I'll rename this to approx_store_count and comment?

Yea, a comment would be nice to indicate how it's different to count_and_status

ryoqun · 2020-05-21T20:17:41Z

@sakridge Could you review this as a possible fix? I've went in-depth in explaining the problem... Hope this illustrates the current devnet issue!

If this solution sounds nice, I'll finish up this pr (writing pr, commenting...)

ryoqun · 2020-05-21T20:23:21Z

For illustrative purpose, here is what looks like under our current nightly performance benchmark:

runtime/src/accounts_db.rs

ryoqun · 2020-05-26T08:32:57Z

@sakridge Could you review this if this approach makes sense so that I can progress this to finish it up? :)

stale · 2020-06-02T08:53:07Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

sakridge · 2020-06-03T15:24:54Z

@sakridge Could you review this if this approach makes sense so that I can progress this to finish it up? :)

Yea, this seems fine, let's continue in this direction. Thanks for the work.

ryoqun · 2020-06-08T08:20:23Z

runtime/src/accounts_db.rs


        if next.is_some() {
            next
        } else {
            let mut new_all_slots = self.all_root_slots_in_index();
            let next = new_all_slots.pop();
-
-            let mut candidates = self.shrink_candidate_slots.lock().unwrap();


As commented, we need to broaden locking a bit.

ryoqun · 2020-06-08T15:10:21Z

@sakridge I think this is ready for review to merge this. :)

runtime/src/accounts_db.rs

core/src/accounts_background_service.rs

ryoqun · 2020-06-11T17:42:51Z

@sakridge Thanks for reviews! I've addressed all!

sakridge

lgtm

* Prioritize shrinking of previously cleaned roots * measure time of stale slot shrinking * Disable shrink for test * shrink: budgeting, store count, force for snapshot * Polish implementation and fix tests * Fix ci.. * Clean up a bit * Further polish implementation and fix/add tests * Rebase fixes * Remove unneeded Default for AccountStorageEntry * Address review comments * More cleanup * More cleanup (cherry picked from commit dfe72d5)

…10534) automerge

ryoqun · 2020-06-12T15:14:33Z

I've tested this locally and against devnet as a last minute test and found no issues with the tip of this branch. And backported as well.

@mvines Feel free to deploy this on devnet at your convenient time. :) Also, this doesn't introduce any consensus change so doesn't need rolling update.

mvines · 2020-06-12T15:19:50Z

ok, thanks. When it lands in 1.2 I'll just treat this like any other normal patch then that'll get deployed in v1.2.2

ryoqun · 2020-06-13T06:01:53Z

I'll omit to explain. (tell me if anyone wonders!) but budgeting is working as expected, observable from this chart from the nightly perf. test.

ryoqun · 2020-06-15T09:19:34Z

It looks like snapshot bloat issue is actually fixed with this (included in the v1.2.2, f13498b):

$ ./target/release/solana  --url http://devnet.solana.com:8899 cluster-version
1.2.2 f13498b4

ryoqun · 2020-06-16T17:34:25Z

(status update) Also, I'm aware of this landed in testnet/tds as well. From what I casually observe, this isn't causing any harm there. So, I'm en-queueing my next accountsdb/bank-related change in my mind for next week: #10206.

ryoqun mentioned this pull request May 18, 2020

Introduce eager rent collection #9527

Merged

5 tasks

ryoqun mentioned this pull request May 20, 2020

Reduce stale slot shrinking load when load #10139

Closed

ryoqun changed the title ~~[wip] Prioritize shrinking of previously cleaned roots~~ Optimize stale slot shrinking for previously cleaned roots May 21, 2020

ryoqun commented May 21, 2020

View reviewed changes

ryoqun requested a review from sakridge May 21, 2020 20:18

ryoqun commented May 21, 2020

View reviewed changes

runtime/src/accounts_db.rs Show resolved Hide resolved

stale bot added stale [bot only] Added to stale content; results in auto-close after a week. and removed stale [bot only] Added to stale content; results in auto-close after a week. labels Jun 2, 2020

ryoqun added the v1.2 label Jun 8, 2020

ryoqun commented Jun 8, 2020

View reviewed changes

ryoqun marked this pull request as ready for review June 8, 2020 14:48

ryoqun force-pushed the prioritized-previously-cleaned-roots branch from 7cc1da8 to 706590a Compare June 8, 2020 15:03

ryoqun requested review from sakridge and removed request for sakridge June 8, 2020 15:09

ryoqun force-pushed the prioritized-previously-cleaned-roots branch 2 times, most recently from c320f9d to 53006c5 Compare June 9, 2020 03:32

ryoqun added 6 commits June 9, 2020 12:32

Prioritize shrinking of previously cleaned roots

99fbee6

measure time of stale slot shrinking

641ca6e

Disable shrink for test

e49bb0f

shrink: budgeting, store count, force for snapshot

5d6bc0f

Polish implementation and fix tests

44a2210

Further polish implementation and fix/add tests

05177aa

ryoqun added 4 commits June 9, 2020 12:32

Fix ci..

12e0f94

Clean up a bit

8f7d6eb

Rebase fixes

591b1e5

Remove unneeded Default for AccountStorageEntry

c32b97a

ryoqun force-pushed the prioritized-previously-cleaned-roots branch 2 times, most recently from ce5b125 to c32b97a Compare June 9, 2020 03:39

ryoqun mentioned this pull request Jun 9, 2020

Fix bad rent in Bank::deposit as if since epoch 0 #10468

Merged

2 tasks

ryoqun added stale [bot only] Added to stale content; results in auto-close after a week. security Pull requests that address a security vulnerability and removed stale [bot only] Added to stale content; results in auto-close after a week. labels Jun 10, 2020

sakridge reviewed Jun 11, 2020

View reviewed changes

runtime/src/accounts_db.rs Outdated Show resolved Hide resolved

sakridge reviewed Jun 11, 2020

View reviewed changes

core/src/accounts_background_service.rs Show resolved Hide resolved

ryoqun added 3 commits June 12, 2020 02:13

Address review comments

01c7546

More cleanup

e9f1d50

More cleanup

ef12ad2

sakridge approved these changes Jun 11, 2020

View reviewed changes

ryoqun merged commit dfe72d5 into solana-labs:master Jun 12, 2020

mergify bot mentioned this pull request Jun 12, 2020

Optimize stale slot shrinking for previously cleaned roots (bp #10099) #10534

Merged

solana-grimes pushed a commit that referenced this pull request Jun 12, 2020

Optimize stale slot shrinking for previously cleaned roots (#10099) (#…

813b11a

…10534) automerge

ryoqun mentioned this pull request Jul 28, 2020

Fix race condition between shrinking and cleaning (for 1.1) #11234

Merged

ryoqun mentioned this pull request Dec 11, 2020

Shrink slots by sparseness of written data size #14072

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize stale slot shrinking for previously cleaned roots #10099

Optimize stale slot shrinking for previously cleaned roots #10099

ryoqun commented May 18, 2020 •

edited

Loading

codecov bot commented May 18, 2020 •

edited

Loading

ryoqun commented May 19, 2020

ryoqun May 21, 2020 •

edited

Loading

sakridge Jun 2, 2020

ryoqun commented May 21, 2020

ryoqun commented May 21, 2020

ryoqun commented May 26, 2020

stale bot commented Jun 2, 2020

sakridge commented Jun 3, 2020

ryoqun Jun 8, 2020

ryoqun commented Jun 8, 2020

ryoqun commented Jun 11, 2020 •

edited

Loading

sakridge left a comment

ryoqun commented Jun 12, 2020

mvines commented Jun 12, 2020

ryoqun commented Jun 13, 2020 •

edited

Loading

ryoqun commented Jun 15, 2020

ryoqun commented Jun 16, 2020

Optimize stale slot shrinking for previously cleaned roots #10099

Optimize stale slot shrinking for previously cleaned roots #10099

Conversation

ryoqun commented May 18, 2020 • edited Loading

Problem

stale slot shrinking (perf. degradation)

eager rent collection (bloated snapshot)

Summary of Changes

eager rent collection

stale slot shrinking

codecov bot commented May 18, 2020 • edited Loading

Codecov Report

ryoqun commented May 19, 2020

ryoqun May 21, 2020 • edited Loading

Choose a reason for hiding this comment

sakridge Jun 2, 2020

Choose a reason for hiding this comment

ryoqun commented May 21, 2020

ryoqun commented May 21, 2020

ryoqun commented May 26, 2020

stale bot commented Jun 2, 2020

sakridge commented Jun 3, 2020

ryoqun Jun 8, 2020

Choose a reason for hiding this comment

ryoqun commented Jun 8, 2020

ryoqun commented Jun 11, 2020 • edited Loading

sakridge left a comment

Choose a reason for hiding this comment

ryoqun commented Jun 12, 2020

mvines commented Jun 12, 2020

ryoqun commented Jun 13, 2020 • edited Loading

ryoqun commented Jun 15, 2020

ryoqun commented Jun 16, 2020

ryoqun commented May 18, 2020 •

edited

Loading

codecov bot commented May 18, 2020 •

edited

Loading

ryoqun May 21, 2020 •

edited

Loading

ryoqun commented Jun 11, 2020 •

edited

Loading

ryoqun commented Jun 13, 2020 •

edited

Loading