-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: add generic support for any type of sketch collection as query or database #430
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…water into ctb_misc_cleanup
…water into ctb_misc2
…water into ctb_misc2
…water into ctb_misc_cleanup
4 tasks
#434) * preliminary victory * compiles and mostly runs * cleanup, split to new module * cleanup and comment * more cleanup of diff * cargo fmt * fix fmt * restore n_failed * comment failing test * cleanup and de-vec * create module/submodule structure * comment for later * get rid of vec * beg for help * cleanup and doc
…water into ctb_misc2
This was referenced Oct 12, 2024
ctb
changed the title
WIP: add generic support for any type of sketch collection as query or database
MRG: add generic support for any type of sketch collection as query or database
Oct 13, 2024
…water into ctb_misc2
ctb
added a commit
to sourmash-bio/sourmash
that referenced
this pull request
Oct 15, 2024
…ather` bug around `scaled`. (#3342) This PR does five things: First, it swaps the implementation of `KmerMinHash::downsample_max_hash` with `KmerMinHash::downsample_scaled`, and the same for `KmerMinHashBTree`. Previously a call to `downsample_scaled` calculated the right `max_hash` from `scaled`, then called `downsample_max_hash`, which then converted `max_hash` back to `scaled`. This reverses the logic so that (slightly) less work is done and, more importantly, the code is a bit more straightforward. Second, it changes the `downsample_*` functions so that they do not downsample when no downsampling is needed. As part of this the method signatures are changed to take an object, rather than a reference. This lets the functions return an unmodified `KmerMinHash` when no downsampling is needed. Third, it turns out the `downsample_*` functions didn't check to make sure that the new `scaled` value was larger than the old one, i.e. they didn't prevent upsampling. That check was added and a new error, `CannotUpsampleScaled`, was added to sourmash core. Fourth, this uncovered a bug in `RevIndex::gather` where the query was downsampled to the match, even when the match was lower scaled. This PR rejiggers the code so that downsampling is done appropriately in the `gather` and `calculate_gather_stats`. Since `RevIndex::gather` isn't used in the the sourmash CLI, the bug only presented in the test suite and in the branchwater plugin; see sourmash-bio/sourmash_plugin_branchwater#468 and sourmash-bio/sourmash_plugin_branchwater#467, where a fastmultigather test had to be fixed because of the incorrect scaled values output by `RevIndex::gather`. Fifth, it includes #3348 from @luizirber, which adds a `Signature::try_into()` to `KmerMinHash` to support the elimination of some clones. Because of the method signature change for the `downsample_*` functions, the sourmash-core version needs to be bumped to a new major version, 0.16.0. It's been a fun journey! 😅 Fixes #3343 Some notes on further changes and performance implications: As a consequence of the `RevIndex::gather` changes, redundant downsampling has to be done in `RevIndex::gather` and `calculate_gather_stats`, unless we want to change the method signature of `calculate_gather_stats`. I decided the PR was big enough that I didn't want to do that in addition. It should not affect most use cases where `scaled` is the same, and we will see if it results in any slowdowns over in the branchwater plugin. See #3196 for an issue on all of this. We could also just insist that the query scaled is the one to pay attention to, per #2951. This would simplify the code in Python-land as well. Overall, the performance implications of this PR are not clear. Previously downsampling was being done even when it wasn't needed, so this may speed things up quite a lot for our typical use case! On the other hand, redundant downsampling will happen in cases where there are scaled mismatches. We just need to benchmark it, I think. Some preliminary benchmarking reported in sourmash-bio/sourmash_plugin_branchwater#430 (comment) suggests that fastgather is now much more memory effficient 🎉 so that's good! TODO: - [x] resolve the scaled mismatch stuff. do we return an `Err` or what if the downsampling can't be performed? - [x] update PR description - [x] add more tests for downsampling, and maybe for gather - [x] play with this code over in the branchwater plugin too! sourmash-bio/sourmash_plugin_branchwater#467 --------- Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
Ready for review & maybe merge! @bluegenes 🎉 |
bluegenes
approved these changes
Oct 15, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
@bluegenes says "go for it" on slack. On my head be it! 🎲 |
ctb
added a commit
that referenced
this pull request
Oct 15, 2024
…#471) * refactor & rename & consolidate * remove 'lower' * add cargo doc output for private fn * add a few comments/docs * switch to dev version of sourmash * tracking * cleaner * cleanup * load rocksdb natively * foo * cargo fmt * upd * upd * fix fmt * MRG: create `MultiCollection` for collections that span multiple files (#434) * preliminary victory * compiles and mostly runs * cleanup, split to new module * cleanup and comment * more cleanup of diff * cargo fmt * fix fmt * restore n_failed * comment failing test * cleanup and de-vec * create module/submodule structure * comment for later * get rid of vec * beg for help * cleanup and doc * clippy fixes * compiling again * cleanup * bump sourmash to v0.15.1 * check if is rocksdb * weird error * use remove_unwrap branch of sourmash * get index to work with MultiCollection * old bug now fixed * clippy, format, and fix * make names clearer * ditch MultiCollection for index, at least for now * testy testy * getting closer * update sourmash * mark failing tests * upd * cargo fmt * MRG: test exit from `pairwise` and `multisearch` if no loaded sketches (#437) * upd * check for appropriate multisearch error exits * add more tests for pairwise, too * cargo fmt * MRG: switch to more efficient use of `Collection` by removing cloning (#438) * remove unnecessary clones by switch to references in SmallSignature * switch away from references for collections => avoid clones * remove MultiCollection::iter * MRG: add tests for RocksDB/RevIndex, standalone manifests, and flexible pathlists (#436) * test using rocksdb as source of sketches * test file lists of zips * cargo fmt * hackity hack hack a picklist * ok that makes more sense * it works * comments around future par_iter * support loading from a .sig.gz for index * test pairwise loading from rocksdb * add test for queries from Rocksdb * decide not to implement lists of manifests :) * reenable and fix test_fastgather.py::test_indexed_against * impl Deref for MultiCollection * clippy * switch to using load_sketches method * deref doesn't actually make sense for MultiCollection * update to latest sourmash code * update to latest sourmash code * simplify * update to latest sourmash code * remove unnecessary flag * MRG: support & test loading of standalone manifests within pathlists (#450) * use recursion to load paths into a MultiCollection => mf support * MRG: clean up index to use `MultiCollection` (#451) * try making index work with standard code * kinda working * fmt * refactor * clear up the tests * refactor/clean up * cargo fmt * add tests for index warning & error * comment * MRG: documentation updates based on new collection loading (#444) * update docs for #430 * upd documentation * upd * Update src/lib.rs Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com> * switch unwrap to expect * move unwrap to expect * minor cleanup * cargo fmt * provide legacy method to avoid xfail on index loading * switch to using reference * update docs to reflect pathlist behavior * test recursive nature of MultiCollection * re-enable test that is now passing * update to latest sourmash * upd sourmash * update sourmash * mut MultiCollection * cleanup * update after merge of sourmash-bio/sourmash#3305 * fix contains_revindex * add trace commands for tracing loading * use released version of sourmash * add support for ignoring abundance * cargo fmt * avoid downsampling until we know there is overlap * change downsample to true; add panic assertion * move downsampling side guard * eliminate redundant overlap check * move calc_abund_stats * extract abundance code into own function; avoid downsampling if poss * cleanup * fmt * update to next sourmash release * cargo fmt * upd sourmash * correct numbers * upd sourmash * upd sourmash * upd sourmash * upd sourmash * use new try_into() and eliminate several clone()s * refactor a bit more * use new try_into() in manysearch; flag clones * avoid a few more clones * eliminate more clone * fix mismatched clauses * note minhash * fix mastiff_manygather * avoid more clone * resolve comments * microchange * microchange 2 * eliminate more clone: fastgather * avoid more clone: fastmultigather * refactor to avoid more clones * rm one more clone * cleanup * cargo fmt * cargo fmt * deallocate collection? * deallocate collection? * upd sourmash * cargo fmt * fix merge foo * try out new sourmash PR * upd latest sourmash branch * upd sourmash --------- Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
This was referenced Oct 15, 2024
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds
MultiCollection
, a wrapper around multipleCollection
structs, as an implementation for many important features, including loading of standalone manifests and zip files from pathlists. As part of this it also adds direct sketch loading from RocksDB/RevIndex.This PR:
multisearch
continues past "no query signatures loaded, exiting" message #280It does, however, break some functionality in
index
, becauseRevIndex
es with external storage cannot be created from multipleCollectionSet
s, so we had to use specialized loading code.Note also that there is a bug around loading multisketch files from pathlists in (see #445); this PR does not adjust the code to deal with this.
TODO:
manysearch
performance slowdown between v0.9.5 and v0.9.6 #463, 'unreleased with 430 and 464')Manifest::intersect_manifest
to Rust core sourmash#3305)test_fastgather.py::test_indexed_against
MultiCollection
/SmallSignature
struct to not cloneCollection
s each timeindex
to take more/better inputsFrom #436 -
From #437 -
multisearch
error exit/better reporting, add tests, etc.Punting to issues, to be created before merge:
test_fastgather.py::test_against_multisigfile
(see WIP: debug multisigfile test #445)env_logger
, perhaps via features?Collection
andStorage
: consider how to support more flexibleCollection
inRevIndex
for external storage sourmash#3321