Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-scaled queries with gather #928

Closed
luizirber opened this issue Apr 3, 2020 · 3 comments
Closed

multi-scaled queries with gather #928

luizirber opened this issue Apr 3, 2020 · 3 comments

Comments

@luizirber
Copy link
Member

In #538 (comment), I said:

This would also allow a mash screen-like index, for datasets that are too small for scaled (think: viruses)

Another alternative: what if we allow multi-scaled queries with gather?

Let's say we have one index built with scaled=2000 for bacteria, and scaled=200 for viruses. At the moment we downsample the query to something both indices support (2000, in this case) or any other value specified with --scaled (if the downsampling is possible). But if we have a query built with scaled=200, we could be downsampling to 2000 to search in the bacterial index, and use the original query for the viral index.

The drawback is that the query signature will have to be the size of the smallest scaled value used, and summarizing the result will also be more challenging.

@ctb
Copy link
Contributor

ctb commented Apr 4, 2020

This is certainly do-able (and easy) at the implementation level! A few thoughts --

scaled=200 signatures would be really large for metagenomes. Is it really that much of a saving in space or time at that scaled value? You're going to have an incredibly large signature file in JSON format. Since we're looking at k-mers and not reads it might still be quite a bit smaller than the original metagenome file, but even at scaled=1000 metagenome signatures are rather unwieldy and this is 5x bigger!

My immediate intuitive concern is for reporting/summarizing, as you say. How would we compare a match at a scaled of 200 against a match at a scaled of 1000? But the answer is that these are estimates and it's not clear to me that there's any particular problem with it - you simply report the match.

I think the bigger problem is subtraction, though. If you have a database D1 with a scaled of 1000, and a database D2 with a scaled of 100, when you subtract a D1-match from the query at a scaled of 1000, you're failing to remove the 90% of hashes that would be there for a scaled of 100. This would skew all future gather results. And I think this is why we downsample the signature to the largest scaled of all the databases.

The only solution I see right now is to include the full scaled=100 signature in the database D1, so that when you find it at a scaled=1000, you can subtract at scaled=100. From a compute-efficiency perspective, this is do-able - you could build the internal indices for both SBT and LCA databases at a higher scaled than the leaf node signatures, while retaining the full leaf node information - but it has big drawbacks in terms of size...

@ctb
Copy link
Contributor

ctb commented Apr 4, 2020

#407 is relevant.

@ctb
Copy link
Contributor

ctb commented May 3, 2020

I'm going to close this as impractical, for now; @luizirber feel free to reopen if you want to discuss more. Linked it back to #407 more clearly for the general topic.

@ctb ctb closed this as completed May 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants