Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: add generic support for any type of sketch collection as query or database #430

Merged
merged 111 commits into from
Oct 15, 2024

Conversation

ctb
Copy link
Collaborator

@ctb ctb commented Aug 19, 2024

This PR adds MultiCollection, a wrapper around multiple Collection structs, as an implementation for many important features, including loading of standalone manifests and zip files from pathlists. As part of this it also adds direct sketch loading from RocksDB/RevIndex.

This PR:

It does, however, break some functionality in index, because RevIndexes with external storage cannot be created from multiple CollectionSets, so we had to use specialized loading code.

Note also that there is a bug around loading multisketch files from pathlists in (see #445); this PR does not adjust the code to deal with this.

TODO:

From #436 -

  • add tests for standalone manifests containing zip files
  • add tests for pathlists containing zip files
  • add tests for fastgather loading a query sketch from RocksDB
  • add tests for fastgather loading against sketches from RocksDB
  • add tests for multisearch loading query & against sketches from RocksDB
  • add tests for pairwise loading sketches from RocksDB
  • add check for warning about loading all sketches from a RocksDB

From #437 -

  • regularize the code for multisearch error exit/better reporting, add tests, etc.

Punting to issues, to be created before merge:

@ctb ctb changed the base branch from main to ctb_misc_cleanup August 19, 2024 23:00
ctb added 10 commits August 20, 2024 10:32
#434)

* preliminary victory

* compiles and mostly runs

* cleanup, split to new module

* cleanup and comment

* more cleanup of diff

* cargo fmt

* fix fmt

* restore n_failed

* comment failing test

* cleanup and de-vec

* create module/submodule structure

* comment for later

* get rid of vec

* beg for help

* cleanup and doc
@ctb ctb changed the title WIP: add generic support for any type of sketch collection as query or database MRG: add generic support for any type of sketch collection as query or database Oct 13, 2024
ctb added a commit to sourmash-bio/sourmash that referenced this pull request Oct 15, 2024
…ather` bug around `scaled`. (#3342)

This PR does five things:

First, it swaps the implementation of `KmerMinHash::downsample_max_hash`
with `KmerMinHash::downsample_scaled`, and the same for
`KmerMinHashBTree`. Previously a call to `downsample_scaled` calculated
the right `max_hash` from `scaled`, then called `downsample_max_hash`,
which then converted `max_hash` back to `scaled`. This reverses the
logic so that (slightly) less work is done and, more importantly, the
code is a bit more straightforward.

Second, it changes the `downsample_*` functions so that they do not
downsample when no downsampling is needed. As part of this the method
signatures are changed to take an object, rather than a reference. This
lets the functions return an unmodified `KmerMinHash` when no
downsampling is needed.

Third, it turns out the `downsample_*` functions didn't check to make
sure that the new `scaled` value was larger than the old one, i.e. they
didn't prevent upsampling. That check was added and a new error,
`CannotUpsampleScaled`, was added to sourmash core.

Fourth, this uncovered a bug in `RevIndex::gather` where the query was
downsampled to the match, even when the match was lower scaled. This PR
rejiggers the code so that downsampling is done appropriately in the
`gather` and `calculate_gather_stats`. Since `RevIndex::gather` isn't
used in the the sourmash CLI, the bug only presented in the test suite
and in the branchwater plugin; see
sourmash-bio/sourmash_plugin_branchwater#468
and
sourmash-bio/sourmash_plugin_branchwater#467,
where a fastmultigather test had to be fixed because of the incorrect
scaled values output by `RevIndex::gather`.

Fifth, it includes #3348
from @luizirber, which adds a `Signature::try_into()` to `KmerMinHash`
to support the elimination of some clones.

Because of the method signature change for the `downsample_*` functions,
the sourmash-core version needs to be bumped to a new major version,
0.16.0.

It's been a fun journey! 😅 

Fixes #3343

Some notes on further changes and performance implications:

As a consequence of the `RevIndex::gather` changes, redundant
downsampling has to be done in `RevIndex::gather` and
`calculate_gather_stats`, unless we want to change the method signature
of `calculate_gather_stats`. I decided the PR was big enough that I
didn't want to do that in addition. It should not affect most use cases
where `scaled` is the same, and we will see if it results in any
slowdowns over in the branchwater plugin. See
#3196 for an issue on all
of this.

We could also just insist that the query scaled is the one to pay
attention to, per #2951.
This would simplify the code in Python-land as well.

Overall, the performance implications of this PR are not clear.
Previously downsampling was being done even when it wasn't needed, so
this may speed things up quite a lot for our typical use case! On the
other hand, redundant downsampling will happen in cases where there are
scaled mismatches. We just need to benchmark it, I think.

Some preliminary benchmarking reported in
sourmash-bio/sourmash_plugin_branchwater#430 (comment)
suggests that fastgather is now much more memory effficient 🎉 so that's
good!

TODO:
- [x] resolve the scaled mismatch stuff. do we return an `Err` or what
if the downsampling can't be performed?
- [x] update PR description
- [x] add more tests for downsampling, and maybe for gather
- [x] play with this code over in the branchwater plugin too!
sourmash-bio/sourmash_plugin_branchwater#467

---------

Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
@ctb
Copy link
Collaborator Author

ctb commented Oct 15, 2024

Ready for review & maybe merge! @bluegenes 🎉

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@ctb
Copy link
Collaborator Author

ctb commented Oct 15, 2024

@bluegenes says "go for it" on slack. On my head be it! 🎲

@ctb ctb merged commit 02fcbc5 into main Oct 15, 2024
1 check passed
@ctb ctb deleted the ctb_misc2 branch October 15, 2024 21:50
ctb added a commit that referenced this pull request Oct 15, 2024
…#471)

* refactor & rename & consolidate

* remove 'lower'

* add cargo doc output for private fn

* add a few comments/docs

* switch to dev version of sourmash

* tracking

* cleaner

* cleanup

* load rocksdb natively

* foo

* cargo fmt

* upd

* upd

* fix fmt

* MRG: create `MultiCollection` for collections that span multiple files (#434)

* preliminary victory

* compiles and mostly runs

* cleanup, split to new module

* cleanup and comment

* more cleanup of diff

* cargo fmt

* fix fmt

* restore n_failed

* comment failing test

* cleanup and de-vec

* create module/submodule structure

* comment for later

* get rid of vec

* beg for help

* cleanup and doc

* clippy fixes

* compiling again

* cleanup

* bump sourmash to v0.15.1

* check if is rocksdb

* weird error

* use remove_unwrap branch of sourmash

* get index to work with MultiCollection

* old bug now fixed

* clippy, format, and fix

* make names clearer

* ditch MultiCollection for index, at least for now

* testy testy

* getting closer

* update sourmash

* mark failing tests

* upd

* cargo fmt

* MRG: test exit from `pairwise` and `multisearch` if no loaded sketches (#437)

* upd

* check for appropriate multisearch error exits

* add more tests for pairwise, too

* cargo fmt

* MRG: switch to more efficient use of `Collection` by removing cloning (#438)

* remove unnecessary clones by switch to references in SmallSignature

* switch away from references for collections => avoid clones

* remove MultiCollection::iter

* MRG: add tests for RocksDB/RevIndex, standalone manifests, and flexible pathlists (#436)

* test using rocksdb as source of sketches

* test file lists of zips

* cargo fmt

* hackity hack hack a picklist

* ok that makes more sense

* it works

* comments around future par_iter

* support loading from a .sig.gz for index

* test pairwise loading from rocksdb

* add test for queries from Rocksdb

* decide not to implement lists of manifests :)

* reenable and fix test_fastgather.py::test_indexed_against

* impl Deref for MultiCollection

* clippy

* switch to using load_sketches method

* deref doesn't actually make sense for MultiCollection

* update to latest sourmash code

* update to latest sourmash code

* simplify

* update to latest sourmash code

* remove unnecessary flag

* MRG: support & test loading of standalone manifests within pathlists (#450)

* use recursion to load paths into a MultiCollection => mf support

* MRG: clean up index to use `MultiCollection` (#451)

* try making index work with standard code

* kinda working

* fmt

* refactor

* clear up the tests

* refactor/clean up

* cargo fmt

* add tests for index warning & error

* comment

* MRG: documentation updates based on new collection loading (#444)

* update docs for #430

* upd documentation

* upd

* Update src/lib.rs

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* switch unwrap to expect

* move unwrap to expect

* minor cleanup

* cargo fmt

* provide legacy method to avoid xfail on index loading

* switch to using reference

* update docs to reflect pathlist behavior

* test recursive nature of MultiCollection

* re-enable test that is now passing

* update to latest sourmash

* upd sourmash

* update sourmash

* mut MultiCollection

* cleanup

* update after merge of sourmash-bio/sourmash#3305

* fix contains_revindex

* add trace commands for tracing loading

* use released version of sourmash

* add support for ignoring abundance

* cargo fmt

* avoid downsampling until we know there is overlap

* change downsample to true; add panic assertion

* move downsampling side guard

* eliminate redundant overlap check

* move calc_abund_stats

* extract abundance code into own function; avoid downsampling if poss

* cleanup

* fmt

* update to next sourmash release

* cargo fmt

* upd sourmash

* correct numbers

* upd sourmash

* upd sourmash

* upd sourmash

* upd sourmash

* use new try_into() and eliminate several clone()s

* refactor a bit more

* use new try_into() in manysearch; flag clones

* avoid a few more clones

* eliminate more clone

* fix mismatched clauses

* note minhash

* fix mastiff_manygather

* avoid more clone

* resolve comments

* microchange

* microchange 2

* eliminate more clone: fastgather

* avoid more clone: fastmultigather

* refactor to avoid more clones

* rm one more clone

* cleanup

* cargo fmt

* cargo fmt

* deallocate collection?

* deallocate collection?

* upd sourmash

* cargo fmt

* fix merge foo

* try out new sourmash PR

* upd latest sourmash branch

* upd sourmash

---------

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants