-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Remove needless last_root for better reclaims #8148
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8148 +/- ##
========================================
- Coverage 82% 82% -0.1%
========================================
Files 249 249
Lines 53530 53503 -27
========================================
- Hits 43913 43887 -26
+ Misses 9617 9616 -1 |
Seems right to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome detective work as always @ryoqun! Any chance you could write a test for this?
Sad status report: It seems the description is wrong. I can't create a valid test demonstrating the described error condition... And I cannot find another alternative bad scenario at the moment.... This bug is hard. ;) |
4132eb6
to
5cdec64
Compare
5cdec64
to
3120504
Compare
@@ -1156,9 +1153,7 @@ impl AccountsDB { | |||
dead_slots | |||
} | |||
|
|||
fn cleanup_dead_slots(&self, dead_slots: &mut HashSet<Slot>, last_root: u64) { | |||
// a slot is not totally dead until it is older than the root | |||
dead_slots.retain(|slot| *slot < last_root); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized all inputs of this function is ensured to be older or equal to last_root
.
Notice the or equal to
part, this assertion must be *slot <= last_root
for purge_zero_lamport_accounts
. But, we don't need this to begin with, as said above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, why are they guaranteed to be older than last_root?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can explain this by reading the code from here upwards the call graph.
Firstly, this cleanup_dead_slots(dead_slots, last_root)
is only called from handle_reclaims(reclaims)
.
dead_slots
are derived from reclaims
by calling remove_dead_accounts()
inside handle_reclaims()
, which does just simple 1-to-1 storage -> slot mapping. Thus, we just need to explain in terms of reclaims
further on.
handle_reclaims()
are called only from purge_zero_lamport_accounts()
(1) and store_with_hashes()
(2). These are the only 2 code paths reaching to this function.
(1) For the purge_zero_lamport_accounts()
cod epath, reasoning is simple:
The reclaims
is a collection of results of accounts_index.purge(&pubkey)
, which in turn the sum of get_rooted_entries()
for all such zero lamport pubkey
s. So, reclaims
(thus, dead_slots
) are always rooted slots only. This is true for normal operation because the validator calls add_root()
as it roots slots.
Also this is also true for snapshot restoration because generate_index()
is called before purge_zero_lamport_accounts()
and generate_index()
does accounts_index.roots.insert(*slot_id)
.
(2) For the store_with_hashes()
, reasoning is a bit longer:
The reclaims
is ultimately only generated at accounts_index.update()
. There is intermediately fn
s like accounts_db.update_index()
and accounts_index.insert()
, but the destination of this call graph is same.
Next, accounts_index.update(slot, ...)
creates reclaims
by concatenating two sets:
A: can_purge
-able (= slot
< max_root
) entries
B: Old entries in slot
by replacing with updated one
A is obviously fine because max_root
is always member of accounts_index.roots
. B is ensured not to be dead
at remove_dead_accounts()
because the case B is always inserting an updated alive entry into the slot
.
So, as a normal operation, cleanup_dead_slots()
never removes anything such that slot > last_root
.
And leaving this line's retain
causes halfway-deallocated state because cleanup_dead_slots()
is only called once ever for each given slot. That's because later invocations of remove_dead_accounts()
would remove any slots whose storage are already removed.
Thanks for reading a rather long comment. :)
So I think this retain
does add little value and complicates code reading with seemingly overlapping and spread responsibility for ensuring purge
-able slots at caller-site or at callee-site.
As a second thought, if you're a bit concerned for future changes that might break these assumptions, this PR can be changed to assert!
that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Thanks for the description. I think it can be removed.
@@ -161,11 +154,6 @@ impl<T: Clone> AccountsIndex<T> { | |||
} | |||
|
|||
pub fn add_root(&mut self, slot: Slot) { | |||
assert!( | |||
(self.last_root == 0 && slot == 0) || (slot >= self.last_root), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing this assertion is ok to remove last_root
, first of all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sakridge is the best person to approve this PR, he's more familiar with the details in this part of the codebase than I
Removing the |
@sakridge Thanks for checking this! I've just merged! |
This got needed in the v0.23 branch for #8436. |
* Restore last_root to fix unintended storage delete * Remove last_root thing altogether * Remove unneeded test... (cherry picked from commit 5872746)
Problem
last_root wasn't correctly restored from snapshot and its related range check is bad for
purge_zero_lamport_account
, which ultimately causes dangling storage entries some times when restoring from snapshot.(a lot of original investigation report for #8130 is moved to new PR: #8176)
Summary of changes
After careful code scrutinization, I'm pretty sure the
last_root
thing can go away.Just remove
last_root
Background
I found this as part of investigation for #8130.
Part of #7167.