-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] expand signature selection and compatibility checking in database loading code #1420
[MRG] expand signature selection and compatibility checking in database loading code #1420
Conversation
This comment has been minimized.
This comment has been minimized.
Ready for review and merge @luizirber @bluegenes! |
return False | ||
|
||
# 'scaled' and 'num' are incompatible | ||
if scaled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning for not checking the value of scaled
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh, great question! I have two answers :). One is pragmatic, and one is conceptual.
The pragmatic answer is, we didn't need it in order for all the tests to pass 😆
The conceptual answer is, this selects signatures that could be converted, but doesn't actually do the conversion. This is a tension that I haven't figured out how to resolve - should selectors return appropriately downsampled signatures, or should they just select signatures that could be downsampled?
I think the right thing to do is to punt this to a new issue for discussion, since I don't think we have any situations in the codebase that depend on answering it right now (see first point, "all tests pass").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I will take care of punting to the new issue, but would appreciate pushback and/or your thoughts!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conceptual answer is, this selects signatures that could be converted, but doesn't actually do the conversion. This is a tension that I haven't figured out how to resolve - should selectors return appropriately downsampled signatures, or should they just select signatures that could be downsampled?
I like the approach of selecting sigs that could be converted! ...but not all scaled sigs can be downsampled (desired scaled < sig scaled), right?
I think the right thing to do is to punt this to a new issue for discussion, since I don't think we have any situations in the codebase that depend on answering it right now (see first point, "all tests pass").
😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conceptual answer is, this selects signatures that could be converted, but doesn't actually do the conversion. This is a tension that I haven't figured out how to resolve - should selectors return appropriately downsampled signatures, or should they just select signatures that could be downsampled?
I like the approach of selecting sigs that could be converted! ...but not all scaled sigs can be downsampled (desired scaled < sig scaled), right?
yep. I think this is a great idea for a targeted test, I'll see what I can do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer returning without explicit downsampling, and maybe make another function that uses select_signatures
to do the actual downsampling.
(depending on the application that can be a lot of downsampling without really needing it...)
# check ksize. | ||
if ksize is not None and db_mh.ksize != ksize: | ||
raise ValueError(f"search ksize {ksize} is different from database ksize {db_mh.ksize}") | ||
|
||
# check moltype. | ||
if moltype is not None and db_mh.moltype != moltype: | ||
raise ValueError(f"search moltype {moltype} is different from database moltype {db_mh.moltype}") | ||
|
||
# containment requires 'scaled'. | ||
if containment: | ||
if not scaled: | ||
raise ValueError("'containment' requires 'scaled' in SBT.select'") | ||
if not db_mh.scaled: | ||
raise ValueError("cannot search this SBT for containment; signatures are not calculated with scaled") | ||
|
||
# 'num' and 'scaled' do not mix. | ||
if num: | ||
if not db_mh.num: | ||
raise ValueError(f"this database was created with 'scaled' MinHash sketches, not 'num'") | ||
if num != db_mh.num: | ||
raise ValueError(f"num mismatch for SBT: num={num}, {db_mh.num}") | ||
|
||
if scaled: | ||
if not db_mh.scaled: | ||
raise ValueError(f"this database was created with 'num' MinHash sketches, not 'scaled'") | ||
|
||
# we can downsample SBTs for containment operations. | ||
if scaled > db_mh.scaled and not containment: | ||
raise ValueError(f"search scaled value {scaled} is less than database scaled value of {db_mh.scaled}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a lot like select_signature
, are you duplicating to have better error messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
precisely - better error messages! ( it's not actually identical, I don't think, and I shied away from forcing it into the same code - if, after a lot of testing, it turns out to be identical, we can refactor.)
CI thoughts: only 4 checks were executed, because this is a PR into a branch that is not |
whoops, didn't catch that this was being reviewed before #1406 :). Not sure it matters one way or another? |
(in the sense that presumably this code would be checked thoroughly before merge into latest) |
…1406) * add an IndexOfIndexes class * rename to MultiIndex * switch to using MultiIndex for loading from a directory * some more MultiIndex tests * add test of MultiIndex.signatures * add docstring for MultiIndex * stop special-casing SIGLISTs * fix test to match more informative error message * switch to using LinearIndex.load for stdin, too * add __len__ to MultiIndex * add check_csv to check for appropriate filename loading info * add comment * fix databases load * more tests needed * add tests for incompatible signatures * add filter to LinearIndex and MultiIndex * clean up sourmash_args some more * shift loading over to Index classes * refactor, fix tests * switch to a list of loader functions * comments, docstrings, and tests passing * update to use f strings throughout sourmash_args.py * add docstrings * update comments * remove unnecessary changes * revert to original test * remove unneeded comment * clean up a bit * debugging update * better exception raising and capture for signature parsing * more specific error message * revert change in favor of creating new issue * add commentary => TODO * add tests for MultiIndex.load_from_directory; fix traverse code * switch lca summarize over to usig MultiIndex * switch to using MultiIndex in categorize * remove LoadSingleSignatures * test errors in lca database loading * remove unneeded categorize code * add testme info * verified that this was tested * remove testme comments * add tests for MultiIndex.load_from_file_list * Expand signature selection and compatibility checking in database loading code (#1420) * refactor select, add scaled/num/abund * fix scaled check for LCA database * add debug_literal * fix scaled check for SBT * fix LCA database ksize message & test * add 'containment' to 'select' * added 'is_database' flag for nicer UX * remove overly broad exception catching * document downsampling foo * fix file_list -> pathlist * fix typo
This is a PR into #1406.
Building off of the refactored database loading in #1406, this PR introduces expanded signature selection and compatibility checking for
Index
classes.This results in much simpler code with much less special-casing. 🎉
Fixes #1376
Addresses #1072
Some searches may now work where they didn't before - for example,
test_search_traverse_incompatible
will now succeed, because it now selects the compatible signatures and ignores the incompatible ones.TODO:
select
interface for keyIndex
subclasses:LinearIndex
,MultiIndex
,SBT
,LCA_Database
- punted to write more comprehensiveIndex.select(...)
tests #1427Index.select(...)
tests #1427LCA_Database.select(...)
logicselect(...)
ref write up downsampling details #407Checklist
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?