-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-scaled queries with gather #928
Comments
This is certainly do-able (and easy) at the implementation level! A few thoughts -- scaled=200 signatures would be really large for metagenomes. Is it really that much of a saving in space or time at that scaled value? You're going to have an incredibly large signature file in JSON format. Since we're looking at k-mers and not reads it might still be quite a bit smaller than the original metagenome file, but even at scaled=1000 metagenome signatures are rather unwieldy and this is 5x bigger! My immediate intuitive concern is for reporting/summarizing, as you say. How would we compare a match at a scaled of 200 against a match at a scaled of 1000? But the answer is that these are estimates and it's not clear to me that there's any particular problem with it - you simply report the match. I think the bigger problem is subtraction, though. If you have a database D1 with a scaled of 1000, and a database D2 with a scaled of 100, when you subtract a D1-match from the query at a scaled of 1000, you're failing to remove the 90% of hashes that would be there for a scaled of 100. This would skew all future gather results. And I think this is why we downsample the signature to the largest scaled of all the databases. The only solution I see right now is to include the full scaled=100 signature in the database D1, so that when you find it at a scaled=1000, you can subtract at scaled=100. From a compute-efficiency perspective, this is do-able - you could build the internal indices for both SBT and LCA databases at a higher scaled than the leaf node signatures, while retaining the full leaf node information - but it has big drawbacks in terms of size... |
#407 is relevant. |
I'm going to close this as impractical, for now; @luizirber feel free to reopen if you want to discuss more. Linked it back to #407 more clearly for the general topic. |
In #538 (comment), I said:
Another alternative: what if we allow multi-scaled queries with gather?
Let's say we have one index built with scaled=2000 for bacteria, and scaled=200 for viruses. At the moment we downsample the query to something both indices support (
2000
, in this case) or any other value specified with--scaled
(if the downsampling is possible). But if we have a query built with scaled=200, we could be downsampling to2000
to search in the bacterial index, and use the original query for the viral index.The drawback is that the query signature will have to be the size of the smallest
scaled
value used, and summarizing the result will also be more challenging.The text was updated successfully, but these errors were encountered: