Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: improve downsampling behavior on KmerMinHash; fix RevIndex::gather bug around scaled. #3342

Merged
merged 25 commits into from
Oct 15, 2024

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Oct 5, 2024

This PR does five things:

First, it swaps the implementation of KmerMinHash::downsample_max_hash with KmerMinHash::downsample_scaled, and the same for KmerMinHashBTree. Previously a call to downsample_scaled calculated the right max_hash from scaled, then called downsample_max_hash, which then converted max_hash back to scaled. This reverses the logic so that (slightly) less work is done and, more importantly, the code is a bit more straightforward.

Second, it changes the downsample_* functions so that they do not downsample when no downsampling is needed. As part of this the method signatures are changed to take an object, rather than a reference. This lets the functions return an unmodified KmerMinHash when no downsampling is needed.

Third, it turns out the downsample_* functions didn't check to make sure that the new scaled value was larger than the old one, i.e. they didn't prevent upsampling. That check was added and a new error, CannotUpsampleScaled, was added to sourmash core.

Fourth, this uncovered a bug in RevIndex::gather where the query was downsampled to the match, even when the match was lower scaled. This PR rejiggers the code so that downsampling is done appropriately in the gather and calculate_gather_stats. Since RevIndex::gather isn't used in the the sourmash CLI, the bug only presented in the test suite and in the branchwater plugin; see sourmash-bio/sourmash_plugin_branchwater#468 and sourmash-bio/sourmash_plugin_branchwater#467, where a fastmultigather test had to be fixed because of the incorrect scaled values output by RevIndex::gather.

Fifth, it includes #3348 from @luizirber, which adds a Signature::try_into() to KmerMinHash to support the elimination of some clones.

Because of the method signature change for the downsample_* functions, the sourmash-core version needs to be bumped to a new major version, 0.16.0.

It's been a fun journey! 😅

Fixes #3343

Some notes on further changes and performance implications:

As a consequence of the RevIndex::gather changes, redundant downsampling has to be done in RevIndex::gather and calculate_gather_stats, unless we want to change the method signature of calculate_gather_stats. I decided the PR was big enough that I didn't want to do that in addition. It should not affect most use cases where scaled is the same, and we will see if it results in any slowdowns over in the branchwater plugin. See #3196 for an issue on all of this.

We could also just insist that the query scaled is the one to pay attention to, per #2951. This would simplify the code in Python-land as well.

Overall, the performance implications of this PR are not clear. Previously downsampling was being done even when it wasn't needed, so this may speed things up quite a lot for our typical use case! On the other hand, redundant downsampling will happen in cases where there are scaled mismatches. We just need to benchmark it, I think.

Some preliminary benchmarking reported in sourmash-bio/sourmash_plugin_branchwater#430 (comment) suggests that fastgather is now much more memory effficient 🎉 so that's good!

TODO:

@ctb ctb changed the title WIP: implement downsample_max_hash in terms of downsample_scaled WIP: implement downsample_max_hash in terms of downsample_scaled Oct 5, 2024
@ctb ctb added the rust label Oct 5, 2024
Copy link

codecov bot commented Oct 5, 2024

Codecov Report

Attention: Patch coverage is 69.23077% with 20 lines in your changes missing coverage. Please review.

Project coverage is 86.42%. Comparing base (1a6312b) to head (80c3404).
Report is 1 commits behind head on latest.

Files with missing lines Patch % Lines
src/core/src/signature.rs 10.00% 9 Missing ⚠️
src/core/src/sketch/minhash.rs 80.00% 9 Missing ⚠️
src/core/src/storage/mod.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           latest    #3342      +/-   ##
==========================================
- Coverage   86.50%   86.42%   -0.09%     
==========================================
  Files         137      137              
  Lines       16046    16069      +23     
  Branches     2211     2211              
==========================================
+ Hits        13881    13888       +7     
- Misses       1858     1874      +16     
  Partials      307      307              
Flag Coverage Δ
hypothesis-py 25.47% <ø> (ø)
python 92.39% <ø> (ø)
rust 62.05% <69.23%> (-0.24%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ctb ctb changed the title WIP: implement downsample_max_hash in terms of downsample_scaled WIP: improve downsampling behavior on KmerMinHash; fix RevIndex::gather bug around scaled. Oct 12, 2024
@ctb ctb changed the title WIP: improve downsampling behavior on KmerMinHash; fix RevIndex::gather bug around scaled. MRG: improve downsampling behavior on KmerMinHash; fix RevIndex::gather bug around scaled. Oct 12, 2024
@ctb
Copy link
Contributor Author

ctb commented Oct 12, 2024

Ready for review & merge @sourmash-bio/devs !!

@ctb
Copy link
Contributor Author

ctb commented Oct 12, 2024

(if & when this is merged, I will cut a new core release too.)

luizirber and others added 3 commits October 13, 2024 04:15
Copy link

codspeed-hq bot commented Oct 13, 2024

CodSpeed Performance Report

Merging #3342 will degrade performances by 16.6%

Comparing refactor_rs_downsample (80c3404) with latest (1a6312b)

Summary

❌ 4 regressions
✅ 17 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark latest refactor_rs_downsample Change
abund0_ani_ci0 560.6 µs 672.2 µs -16.6%
abund0_ani_ci1 581.4 µs 692.9 µs -16.09%
abund1_ani_ci0 787.2 µs 898.7 µs -12.41%
abund1_ani_ci1 809.7 µs 921.2 µs -12.1%

(self, other)
} else {
(other, self)
};
let downsampled_mh = second.downsample_max_hash(first.max_hash)?;
let downsampled_mh = second.clone().downsample_scaled(first.scaled())?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can do (in a future PR, not this one) a new .downsampled_iter(scaled) for operations like count_common, and avoid the conversion.

The downsampled iter would iterate over values that are in the appropriate scaled value, but wouldn't need to create new minhash sketches (can reuse the largest one and stop returning values once they go over max_hash, for example)

@ctb ctb merged commit 7d11173 into latest Oct 15, 2024
40 of 44 checks passed
@ctb ctb deleted the refactor_rs_downsample branch October 15, 2024 17:12
ctb added a commit that referenced this pull request Oct 15, 2024
…ing (#3352)

This PR builds on the refactoring in #3342 to do less downsampling and
also avoids doing intersections twice (per #3196).

Benchmarks in
sourmash-bio/sourmash_plugin_branchwater#471 are
pretty astonishing...

Fixes #3196

---------

Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
ctb added a commit that referenced this pull request Oct 15, 2024
## [0.16.0] - 2024-10-15

MSRV: 1.65

Changes/additions:

* refactor `calculate_gather_stats` to disallow repeated downsampling
(#3352)
* improve downsampling behavior on `KmerMinHash`; fix `RevIndex::gather`
bug around `scaled`. (#3342)
* derive Hash for `HashFunctions` (#3344)

Updates:

* Bump web-sys from 0.3.70 to 0.3.72 (#3354)
* Bump tempfile from 3.12.0 to 3.13.0 (#3340)
@ctb ctb mentioned this pull request Dec 5, 2024
ctb added a commit that referenced this pull request Dec 5, 2024
Developer updates:

* build: move ORCID to metadata in pyproject.toml, fix pixi (#3416)
* build: simplify Rust release (#3392)
* fix: Avoid re-calculating md5sum on clone and conversion to
KmerMinHashBTree (#3385)
* r0.15.1 release (#3304)
* update sourmash core to r0.17.0 (#3381)
* Added union method to HLL (#3293)
* Build: upgrade to newer maturin (#3366)
* CI: use supported ubuntu for codspeed (#3350)
* Fix clippy lints from 1.83 beta (#3357)
* Implement resumability for revindex (#3275)
* add `Manifest::intersect_manifest` to Rust core (#3305)
* bump sourmash core to r0.17.2 (#3399)
* change `sig_from_record` to use scaled from `Record` to downsample
(#3387)
* derive Hash for `HashFunctions` (#3344)
* enforce a single scaled on a `CollectionSet` (#3397)
* fix formatting from #3306 (#3307)
* have ruff ignore ipynb so as to avoid triggering an error during CI
(#3325)
* improve downsampling behavior on `KmerMinHash`; fix `RevIndex::gather`
bug around `scaled`. (#3342)
* panic when `FSStorage::load_sig` encounters more than one `Signature`
in a JSON record (#3333)
* propagate error from `RocksDB::open` on bad directory (#3306)
* refactor `calculate_gather_stats` to disallow repeated downsampling
(#3352)
* release core r0.17.1 (#3388)
* release sourmash rust core r0.16.0 (#3356)
* standardize on u32 for scaled, and introduce `ScaledType` (#3364)
* update plugin documentation for users (#3286)
* update sourmash core to r0.15.2 (#3338)
* when lingroups are provided, use them for `csv_summary` (#3311)
* Misc Rust updates to core (#3297)
* Resolve issue for high precision MLE estimation (#3296)

Dependabot and pre-commit CI updates:

* Bump DeterminateSystems/magic-nix-cache-action from 7 to 8 (#3319)
* Bump DeterminateSystems/nix-installer-action from 13 to 14 (#3320)
* Bump DeterminateSystems/nix-installer-action from 14 to 15 (#3374)
* Bump DeterminateSystems/nix-installer-action from 15 to 16 (#3401)
* Bump camino from 1.1.7 to 1.1.9 (#3301)
* Bump codspeed-criterion-compat from 2.6.0 to 2.7.2 (#3324)
* Bump conda-incubator/setup-miniconda from 3.0.4 to 3.1.0 (#3373)
* Bump csv from 1.3.0 to 1.3.1 (#3390)
* Bump getset from 0.1.2 to 0.1.3 (#3328)
* Bump histogram from 0.11.0 to 0.11.1 (#3377)
* Bump js-sys from 0.3.72 to 0.3.74 (#3412)
* Bump memmap2 from 0.9.4 to 0.9.5 (#3326)
* Bump myst-parser from 3.0.1 to 4.0.0 (#3277)
* Bump needletail from 0.5.1 to 0.6.0 (#3376)
* Bump pypa/cibuildwheel from 2.19.2 to 2.20.0 (#3278)
* Bump pypa/cibuildwheel from 2.20.0 to 2.21.1 (#3332)
* Bump pypa/cibuildwheel from 2.21.1 to 2.21.2 (#3345)
* Bump pypa/cibuildwheel from 2.21.2 to 2.21.3 (#3353)
* Bump pypa/cibuildwheel from 2.21.3 to 2.22.0 (#3408)
* Bump roaring from 0.10.6 to 0.10.7 (#3413)
* Bump serde from 1.0.204 to 1.0.207 (#3289)
* Bump serde from 1.0.207 to 1.0.208 (#3298)
* Bump serde from 1.0.208 to 1.0.209 (#3310)
* Bump serde from 1.0.209 to 1.0.210 (#3318)
* Bump serde from 1.0.210 to 1.0.214 (#3368)
* Bump serde from 1.0.214 to 1.0.215 (#3403)
* Bump serde_json from 1.0.120 to 1.0.121 (#3267)
* Bump serde_json from 1.0.121 to 1.0.122 (#3280)
* Bump serde_json from 1.0.122 to 1.0.124 (#3288)
* Bump serde_json from 1.0.124 to 1.0.125 (#3302)
* Bump serde_json from 1.0.125 to 1.0.127 (#3309)
* Bump serde_json from 1.0.127 to 1.0.128 (#3316)
* Bump serde_json from 1.0.128 to 1.0.132 (#3358)
* Bump serde_json from 1.0.132 to 1.0.133 (#3402)
* Bump sphinx-design from 0.5.0 to 0.6.0 (#3268)
* Bump sphinx-design from 0.6.0 to 0.6.1 (#3276)
* Bump tempfile from 3.10.1 to 3.11.0 (#3279)
* Bump tempfile from 3.11.0 to 3.12.0 (#3287)
* Bump tempfile from 3.12.0 to 3.13.0 (#3340)
* Bump tempfile from 3.13.0 to 3.14.0 (#3391)
* Bump thiserror from 1.0.63 to 1.0.64 (#3335)
* Bump thiserror from 1.0.64 to 1.0.65 (#3367)
* Bump thiserror from 1.0.65 to 1.0.68 (#3379)
* Bump thiserror from 1.0.68 to 2.0.3 (#3389)
* Bump web-sys from 0.3.69 to 0.3.70 (#3299)
* Bump web-sys from 0.3.70 to 0.3.72 (#3354)
* Bump web-sys from 0.3.72 to 0.3.74 (#3411)
* Update pytest-cov requirement from <6.0,>=4 to >=4,<7.0 (#3375)
* Update sphinx requirement from <8,>=6 to >=6,<9 (#3269)
* Upgrade rocksdb to 0.22.0, bump MSRV to 1.66  (#3383)
* [pre-commit.ci] pre-commit autoupdate (#3281)
* [pre-commit.ci] pre-commit autoupdate (#3290)
* [pre-commit.ci] pre-commit autoupdate (#3312)
* [pre-commit.ci] pre-commit autoupdate (#3330)
* [pre-commit.ci] pre-commit autoupdate (#3336)
* [pre-commit.ci] pre-commit autoupdate (#3341)
* [pre-commit.ci] pre-commit autoupdate (#3346)
* [pre-commit.ci] pre-commit autoupdate (#3360)
* [pre-commit.ci] pre-commit autoupdate (#3369)
* [pre-commit.ci] pre-commit autoupdate (#3380)
* [pre-commit.ci] pre-commit autoupdate (#3393)
* [pre-commit.ci] pre-commit autoupdate (#3404)
* [pre-commit.ci] pre-commit autoupdate (#3409)
* [pre-commit.ci] pre-commit autoupdate (#3414)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

downsample_* functions in minhash.rs _always_ downsample, even when downsampling is not necessary
2 participants