Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: support & test loading of standalone manifests within pathlists #450

Merged
merged 3 commits into from
Sep 9, 2024

Conversation

ctb
Copy link
Collaborator

@ctb ctb commented Sep 8, 2024

NOTE: PR into #430

Includes #451

This PR adjusts MultiCollection::load_set_of_paths to return a MultiCollection itself, so that we can do recursive loading of multi-collections from within multi-collections, e.g. manifests from within pathlists (and maybe vice versa?). It also includes modifications from #450 that let index use anything loadable from MultiCollection with internal storage, by loading it into memory and then saving it internally.

Also adds a few tests.

The code ends up being ~simpler too, which is nice.

ctb added 2 commits September 8, 2024 17:27
* try making index work with standard code

* kinda working

* fmt

* refactor

* clear up the tests

* refactor/clean up

* cargo fmt

* add tests for index warning & error
@ctb ctb changed the title WIP: support & test loading of standalone manifests within pathlists MRG: support & test loading of standalone manifests within pathlists Sep 9, 2024
@ctb ctb merged commit d120547 into ctb_misc2 Sep 9, 2024
1 check passed
@ctb ctb deleted the multi_tests branch September 9, 2024 06:11
ctb added a commit that referenced this pull request Oct 15, 2024
…r database (#430)

* refactor & rename & consolidate

* remove 'lower'

* add cargo doc output for private fn

* add a few comments/docs

* switch to dev version of sourmash

* tracking

* cleaner

* cleanup

* load rocksdb natively

* foo

* cargo fmt

* upd

* upd

* fix fmt

* MRG: create `MultiCollection` for collections that span multiple files (#434)

* preliminary victory

* compiles and mostly runs

* cleanup, split to new module

* cleanup and comment

* more cleanup of diff

* cargo fmt

* fix fmt

* restore n_failed

* comment failing test

* cleanup and de-vec

* create module/submodule structure

* comment for later

* get rid of vec

* beg for help

* cleanup and doc

* clippy fixes

* compiling again

* cleanup

* bump sourmash to v0.15.1

* check if is rocksdb

* weird error

* use remove_unwrap branch of sourmash

* get index to work with MultiCollection

* old bug now fixed

* clippy, format, and fix

* make names clearer

* ditch MultiCollection for index, at least for now

* testy testy

* getting closer

* update sourmash

* mark failing tests

* upd

* cargo fmt

* MRG: test exit from `pairwise` and `multisearch` if no loaded sketches (#437)

* upd

* check for appropriate multisearch error exits

* add more tests for pairwise, too

* cargo fmt

* MRG: switch to more efficient use of `Collection` by removing cloning (#438)

* remove unnecessary clones by switch to references in SmallSignature

* switch away from references for collections => avoid clones

* remove MultiCollection::iter

* MRG: add tests for RocksDB/RevIndex, standalone manifests, and flexible pathlists (#436)

* test using rocksdb as source of sketches

* test file lists of zips

* cargo fmt

* hackity hack hack a picklist

* ok that makes more sense

* it works

* comments around future par_iter

* support loading from a .sig.gz for index

* test pairwise loading from rocksdb

* add test for queries from Rocksdb

* decide not to implement lists of manifests :)

* reenable and fix test_fastgather.py::test_indexed_against

* impl Deref for MultiCollection

* clippy

* switch to using load_sketches method

* deref doesn't actually make sense for MultiCollection

* update to latest sourmash code

* update to latest sourmash code

* simplify

* update to latest sourmash code

* remove unnecessary flag

* MRG: support & test loading of standalone manifests within pathlists (#450)

* use recursion to load paths into a MultiCollection => mf support

* MRG: clean up index to use `MultiCollection` (#451)

* try making index work with standard code

* kinda working

* fmt

* refactor

* clear up the tests

* refactor/clean up

* cargo fmt

* add tests for index warning & error

* comment

* MRG: documentation updates based on new collection loading (#444)

* update docs for #430

* upd documentation

* upd

* Update src/lib.rs

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* switch unwrap to expect

* move unwrap to expect

* minor cleanup

* cargo fmt

* provide legacy method to avoid xfail on index loading

* switch to using reference

* update docs to reflect pathlist behavior

* test recursive nature of MultiCollection

* re-enable test that is now passing

* update to latest sourmash

* upd sourmash

* update sourmash

* mut MultiCollection

* cleanup

* update after merge of sourmash-bio/sourmash#3305

* fix contains_revindex

* add trace commands for tracing loading

* use released version of sourmash

* add support for ignoring abundance

* cargo fmt

* avoid downsampling until we know there is overlap

* change downsample to true; add panic assertion

* move downsampling side guard

* eliminate redundant overlap check

* move calc_abund_stats

* extract abundance code into own function; avoid downsampling if poss

* cleanup

* fmt

* update to next sourmash release

* cargo fmt

* upd sourmash

* correct numbers

* upd sourmash

* upd sourmash

* upd sourmash

* upd sourmash

* use new try_into() and eliminate several clone()s

* refactor a bit more

* deallocate collection?

* upd sourmash

* cargo fmt

* fix merge foo

---------

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
ctb added a commit that referenced this pull request Oct 15, 2024
…#471)

* refactor & rename & consolidate

* remove 'lower'

* add cargo doc output for private fn

* add a few comments/docs

* switch to dev version of sourmash

* tracking

* cleaner

* cleanup

* load rocksdb natively

* foo

* cargo fmt

* upd

* upd

* fix fmt

* MRG: create `MultiCollection` for collections that span multiple files (#434)

* preliminary victory

* compiles and mostly runs

* cleanup, split to new module

* cleanup and comment

* more cleanup of diff

* cargo fmt

* fix fmt

* restore n_failed

* comment failing test

* cleanup and de-vec

* create module/submodule structure

* comment for later

* get rid of vec

* beg for help

* cleanup and doc

* clippy fixes

* compiling again

* cleanup

* bump sourmash to v0.15.1

* check if is rocksdb

* weird error

* use remove_unwrap branch of sourmash

* get index to work with MultiCollection

* old bug now fixed

* clippy, format, and fix

* make names clearer

* ditch MultiCollection for index, at least for now

* testy testy

* getting closer

* update sourmash

* mark failing tests

* upd

* cargo fmt

* MRG: test exit from `pairwise` and `multisearch` if no loaded sketches (#437)

* upd

* check for appropriate multisearch error exits

* add more tests for pairwise, too

* cargo fmt

* MRG: switch to more efficient use of `Collection` by removing cloning (#438)

* remove unnecessary clones by switch to references in SmallSignature

* switch away from references for collections => avoid clones

* remove MultiCollection::iter

* MRG: add tests for RocksDB/RevIndex, standalone manifests, and flexible pathlists (#436)

* test using rocksdb as source of sketches

* test file lists of zips

* cargo fmt

* hackity hack hack a picklist

* ok that makes more sense

* it works

* comments around future par_iter

* support loading from a .sig.gz for index

* test pairwise loading from rocksdb

* add test for queries from Rocksdb

* decide not to implement lists of manifests :)

* reenable and fix test_fastgather.py::test_indexed_against

* impl Deref for MultiCollection

* clippy

* switch to using load_sketches method

* deref doesn't actually make sense for MultiCollection

* update to latest sourmash code

* update to latest sourmash code

* simplify

* update to latest sourmash code

* remove unnecessary flag

* MRG: support & test loading of standalone manifests within pathlists (#450)

* use recursion to load paths into a MultiCollection => mf support

* MRG: clean up index to use `MultiCollection` (#451)

* try making index work with standard code

* kinda working

* fmt

* refactor

* clear up the tests

* refactor/clean up

* cargo fmt

* add tests for index warning & error

* comment

* MRG: documentation updates based on new collection loading (#444)

* update docs for #430

* upd documentation

* upd

* Update src/lib.rs

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* switch unwrap to expect

* move unwrap to expect

* minor cleanup

* cargo fmt

* provide legacy method to avoid xfail on index loading

* switch to using reference

* update docs to reflect pathlist behavior

* test recursive nature of MultiCollection

* re-enable test that is now passing

* update to latest sourmash

* upd sourmash

* update sourmash

* mut MultiCollection

* cleanup

* update after merge of sourmash-bio/sourmash#3305

* fix contains_revindex

* add trace commands for tracing loading

* use released version of sourmash

* add support for ignoring abundance

* cargo fmt

* avoid downsampling until we know there is overlap

* change downsample to true; add panic assertion

* move downsampling side guard

* eliminate redundant overlap check

* move calc_abund_stats

* extract abundance code into own function; avoid downsampling if poss

* cleanup

* fmt

* update to next sourmash release

* cargo fmt

* upd sourmash

* correct numbers

* upd sourmash

* upd sourmash

* upd sourmash

* upd sourmash

* use new try_into() and eliminate several clone()s

* refactor a bit more

* use new try_into() in manysearch; flag clones

* avoid a few more clones

* eliminate more clone

* fix mismatched clauses

* note minhash

* fix mastiff_manygather

* avoid more clone

* resolve comments

* microchange

* microchange 2

* eliminate more clone: fastgather

* avoid more clone: fastmultigather

* refactor to avoid more clones

* rm one more clone

* cleanup

* cargo fmt

* cargo fmt

* deallocate collection?

* deallocate collection?

* upd sourmash

* cargo fmt

* fix merge foo

* try out new sourmash PR

* upd latest sourmash branch

* upd sourmash

---------

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant