Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.16: Shares accounts hash cache data between full and incremental (backport of #33164) #33212

Merged
merged 2 commits into from
Sep 13, 2023

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Sep 11, 2023

This is an automatic backport of pull request #33164 done by Mergify.
Cherry-pick of 6298c6c has failed:

On branch mergify/bp/v1.16/pr-33164
Your branch is up to date with 'origin/v1.16'.

You are currently cherry-picking commit 6298c6c31e.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   runtime/src/accounts_db.rs
	both modified:   runtime/src/cache_hash_data.rs

no changes added to commit (use "git add" and/or "git commit -a")

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/github/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally


Mergify commands and options

More conditions and actions can be found in the documentation.

You can also trigger Mergify actions by commenting on this pull request:

  • @Mergifyio refresh will re-evaluate the rules
  • @Mergifyio rebase will rebase this PR on its base branch
  • @Mergifyio update will merge the base branch into this PR
  • @Mergifyio backport <destination> will backport this PR on <destination> branch

Additionally, on Mergify dashboard you can:

  • look at your merge queues
  • generate the Mergify configuration with the config editor.

Finally, you can contact us on https://mergify.com

(cherry picked from commit 6298c6c)

# Conflicts:
#	runtime/src/accounts_db.rs
#	runtime/src/cache_hash_data.rs
@mergify mergify bot added the conflicts label Sep 11, 2023
@brooksprumo brooksprumo force-pushed the mergify/bp/v1.16/pr-33164 branch from 790f82f to 946fb0e Compare September 11, 2023 21:33
@brooksprumo brooksprumo force-pushed the mergify/bp/v1.16/pr-33164 branch from 946fb0e to d542486 Compare September 11, 2023 21:40
@@ -2370,7 +2368,7 @@ impl AccountsDb {
(accounts_hash_cache_path, None)
} else {
let temp_dir = TempDir::new().expect("new tempdir");
let cache_path = temp_dir.path().join(Self::DEFAULT_ACCOUNTS_HASH_CACHE_DIR);
let cache_path = temp_dir.path().to_path_buf();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This temp dir is only used by the accounts hash cache. By using it directly, we don't need to create a subdir within it.

@codecov
Copy link

codecov bot commented Sep 11, 2023

Codecov Report

Merging #33212 (d542486) into v1.16 (2cbbf4b) will decrease coverage by 0.1%.
The diff coverage is 85.7%.

@@            Coverage Diff            @@
##            v1.16   #33212     +/-   ##
=========================================
- Coverage    81.9%    81.9%   -0.1%     
=========================================
  Files         762      762             
  Lines      207822   207824      +2     
=========================================
  Hits       170355   170355             
- Misses      37467    37469      +2     

@brooksprumo
Copy link
Contributor

@HaoranYi I had to resolve merge conflicts in this backport. Would you be able to review this one again?

@brooksprumo brooksprumo requested review from brooksprumo and removed request for brooksprumo September 11, 2023 23:05
Copy link
Contributor

@HaoranYi HaoranYi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@brooksprumo
Copy link
Contributor

brooksprumo commented Sep 12, 2023

@t-nelson Requesting your review since you've been fielding some of the validator questions/concerns on this. This will help cleanup some files on disk for validators.

These subsequent PRs will also make their way to v1.16, #33181 and #33183, which will remove all redundant/slightly-confusing files/directories.

@brooksprumo
Copy link
Contributor

@CriesofCarrots Requesting your review for 🦅 👀

Copy link
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will orphan any old dirs lying around, no? we should hold this until a pr that cleans them up is ready if os

Comment on lines +145 to +146
pre_existing_cache_files: Arc<Mutex<HashSet<PathBuf>>>,
should_delete_old_cache_files_on_drop: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the inconsistent terminology here isn't great "old" vs. "pre-existing"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the inconsistent terminology here isn't great "old" vs. "pre-existing"

Yes, it's non-intuitive.

When an instance of CacheHashData is created, we call get_cache_files() on it. That method populates pre_existing_cache_files will all the cache files it finds on disk. As this instance gets used, it is queried for cache files via get_file_reference_to_map_later(). As part of that function, we remove the path from the pre_existing_cache_files.

So, on drop, the remaining paths left in pre_existing_cache_files are all the ones that have not been used. This indicates they are now old, since they were not part of the accounts hash calculation, and can be deleted.

So the tl;dr is that on drop, "pre-existing" means "old", which is why both the method (delete_old_cache_files()) and the flag (should_delete_old_cache_files_on_drop) have "old" in the name.

We could tweak that names to reflect this, or add documentation, or a second HashSet/Container for the pre-existing vs the old files. I would image that those changes should go to master first.

@brooksprumo
Copy link
Contributor

brooksprumo commented Sep 12, 2023

this will orphan any old dirs lying around, no? we should hold this until a pr that cleans them up is ready if os

That v1.16 backport PR is here: #33224. We need to merge this one first, otherwise we'll purge the in-use directory on every restart.

Edit: Ideally we land all these backport PRs in the same release, and then there will be no issue.

@CriesofCarrots
Copy link
Contributor

@brooksprumo , can you make the actual case for backporting? ie. why is it important these changes go in v1.16 instead of waiting for v1.17?

@brooksprumo
Copy link
Contributor

brooksprumo commented Sep 12, 2023

@brooksprumo , can you make the actual case for backporting? ie. why is it important these changes go in v1.16 instead of waiting for v1.17?

Once we enable Incremental Accounts Hash, we'll start doing a different accounts hash for incremental snapshots. This will greatly reduce the amount of work each incremental snapshot needs to do for its accompanying (now-incremental) accounts hash.

As part of that, the accounts hash cache files, which are the input to the calculation, are somewhat duplicated. IAH uses a smaller range of slots, so the amount of duplication is smaller.

But the two main issues are (1) the cache files from IAH are not reused with a normal, full accounts hash. This negatively impacts full accounts hash calculations, as it'll need to redo the work to create the input cache files that IAH already did. Since IAH is not enabled on any cluster, we'll see a perf degradation for full accounts hash calculations once IAH is turned on.

And (2), since the full accounts hash calculation will have to redo the same work as IAH, we'll see an unnecessary increase in disk usage. The amount of extra disk usage should be small (I'm guessing single-digit GBs), but why waste the space unnecessarily?

I don't expect the perf degradation from (1) to be large either, and since the accounts hash calculation is in the background, it likely will not impact much. Both issues (1) and (2) get worse as (a) the number of accounts grows, and (b) the snapshot intervals grow. (b) is unlikely to change, but we will continue to see the number of accounts grow.

Edit:

So it would be advantageous to get these fixes in before IAH is turned on, so no issues are even experienced. Additionally, correctness is not an issue, just some cache and its subsequent performance implications. All of this is resolved by changing paths, which feels like a low risk resolution to me.

@CriesofCarrots
Copy link
Contributor

@brooksprumo , thanks for the detailed reply! This feels like a bug fix for a v1.16-timeline feature, so I am on board with the backport.

@t-nelson
Copy link
Contributor

Edit: Ideally we land all these backport PRs in the same release, and then there will be no issue.

yes this. please be sure that #releng is aware

@t-nelson t-nelson merged commit c0dcc77 into v1.16 Sep 13, 2023
@t-nelson t-nelson deleted the mergify/bp/v1.16/pr-33164 branch September 13, 2023 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants