-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.16: Shares accounts hash cache data between full and incremental (backport of #33164) #33212
Conversation
(cherry picked from commit 6298c6c) # Conflicts: # runtime/src/accounts_db.rs # runtime/src/cache_hash_data.rs
790f82f
to
946fb0e
Compare
946fb0e
to
d542486
Compare
@@ -2370,7 +2368,7 @@ impl AccountsDb { | |||
(accounts_hash_cache_path, None) | |||
} else { | |||
let temp_dir = TempDir::new().expect("new tempdir"); | |||
let cache_path = temp_dir.path().join(Self::DEFAULT_ACCOUNTS_HASH_CACHE_DIR); | |||
let cache_path = temp_dir.path().to_path_buf(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This temp dir is only used by the accounts hash cache. By using it directly, we don't need to create a subdir within it.
Codecov Report
@@ Coverage Diff @@
## v1.16 #33212 +/- ##
=========================================
- Coverage 81.9% 81.9% -0.1%
=========================================
Files 762 762
Lines 207822 207824 +2
=========================================
Hits 170355 170355
- Misses 37467 37469 +2 |
@HaoranYi I had to resolve merge conflicts in this backport. Would you be able to review this one again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@t-nelson Requesting your review since you've been fielding some of the validator questions/concerns on this. This will help cleanup some files on disk for validators. These subsequent PRs will also make their way to v1.16, #33181 and #33183, which will remove all redundant/slightly-confusing files/directories. |
@CriesofCarrots Requesting your review for 🦅 👀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will orphan any old dirs lying around, no? we should hold this until a pr that cleans them up is ready if os
pre_existing_cache_files: Arc<Mutex<HashSet<PathBuf>>>, | ||
should_delete_old_cache_files_on_drop: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the inconsistent terminology here isn't great "old" vs. "pre-existing"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the inconsistent terminology here isn't great "old" vs. "pre-existing"
Yes, it's non-intuitive.
When an instance of CacheHashData
is created, we call get_cache_files()
on it. That method populates pre_existing_cache_files
will all the cache files it finds on disk. As this instance gets used, it is queried for cache files via get_file_reference_to_map_later()
. As part of that function, we remove the path from the pre_existing_cache_files
.
So, on drop
, the remaining paths left in pre_existing_cache_files
are all the ones that have not been used. This indicates they are now old, since they were not part of the accounts hash calculation, and can be deleted.
So the tl;dr is that on drop, "pre-existing" means "old", which is why both the method (delete_old_cache_files()
) and the flag (should_delete_old_cache_files_on_drop
) have "old" in the name.
We could tweak that names to reflect this, or add documentation, or a second HashSet/Container for the pre-existing vs the old files. I would image that those changes should go to master
first.
That v1.16 backport PR is here: #33224. We need to merge this one first, otherwise we'll purge the in-use directory on every restart. Edit: Ideally we land all these backport PRs in the same release, and then there will be no issue. |
@brooksprumo , can you make the actual case for backporting? ie. why is it important these changes go in v1.16 instead of waiting for v1.17? |
Once we enable Incremental Accounts Hash, we'll start doing a different accounts hash for incremental snapshots. This will greatly reduce the amount of work each incremental snapshot needs to do for its accompanying (now-incremental) accounts hash. As part of that, the accounts hash cache files, which are the input to the calculation, are somewhat duplicated. IAH uses a smaller range of slots, so the amount of duplication is smaller. But the two main issues are (1) the cache files from IAH are not reused with a normal, full accounts hash. This negatively impacts full accounts hash calculations, as it'll need to redo the work to create the input cache files that IAH already did. Since IAH is not enabled on any cluster, we'll see a perf degradation for full accounts hash calculations once IAH is turned on. And (2), since the full accounts hash calculation will have to redo the same work as IAH, we'll see an unnecessary increase in disk usage. The amount of extra disk usage should be small (I'm guessing single-digit GBs), but why waste the space unnecessarily? I don't expect the perf degradation from (1) to be large either, and since the accounts hash calculation is in the background, it likely will not impact much. Both issues (1) and (2) get worse as (a) the number of accounts grows, and (b) the snapshot intervals grow. (b) is unlikely to change, but we will continue to see the number of accounts grow. Edit: So it would be advantageous to get these fixes in before IAH is turned on, so no issues are even experienced. Additionally, correctness is not an issue, just some cache and its subsequent performance implications. All of this is resolved by changing paths, which feels like a low risk resolution to me. |
@brooksprumo , thanks for the detailed reply! This feels like a bug fix for a v1.16-timeline feature, so I am on board with the backport. |
yes this. please be sure that #releng is aware |
This is an automatic backport of pull request #33164 done by Mergify.
Cherry-pick of 6298c6c has failed:
To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/github/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally
Mergify commands and options
More conditions and actions can be found in the documentation.
You can also trigger Mergify actions by commenting on this pull request:
@Mergifyio refresh
will re-evaluate the rules@Mergifyio rebase
will rebase this PR on its base branch@Mergifyio update
will merge the base branch into this PR@Mergifyio backport <destination>
will backport this PR on<destination>
branchAdditionally, on Mergify dashboard you can:
Finally, you can contact us on https://mergify.com