multi-scaled queries with gather #928

luizirber · 2020-04-03T17:07:23Z

This would also allow a mash screen-like index, for datasets that are too small for scaled (think: viruses)

Another alternative: what if we allow multi-scaled queries with gather?

Let's say we have one index built with scaled=2000 for bacteria, and scaled=200 for viruses. At the moment we downsample the query to something both indices support (2000, in this case) or any other value specified with --scaled (if the downsampling is possible). But if we have a query built with scaled=200, we could be downsampling to 2000 to search in the bacterial index, and use the original query for the viral index.

The drawback is that the query signature will have to be the size of the smallest scaled value used, and summarizing the result will also be more challenging.

The text was updated successfully, but these errors were encountered:

ctb · 2020-04-04T12:23:07Z

This is certainly do-able (and easy) at the implementation level! A few thoughts --

scaled=200 signatures would be really large for metagenomes. Is it really that much of a saving in space or time at that scaled value? You're going to have an incredibly large signature file in JSON format. Since we're looking at k-mers and not reads it might still be quite a bit smaller than the original metagenome file, but even at scaled=1000 metagenome signatures are rather unwieldy and this is 5x bigger!

My immediate intuitive concern is for reporting/summarizing, as you say. How would we compare a match at a scaled of 200 against a match at a scaled of 1000? But the answer is that these are estimates and it's not clear to me that there's any particular problem with it - you simply report the match.

I think the bigger problem is subtraction, though. If you have a database D1 with a scaled of 1000, and a database D2 with a scaled of 100, when you subtract a D1-match from the query at a scaled of 1000, you're failing to remove the 90% of hashes that would be there for a scaled of 100. This would skew all future gather results. And I think this is why we downsample the signature to the largest scaled of all the databases.

The only solution I see right now is to include the full scaled=100 signature in the database D1, so that when you find it at a scaled=1000, you can subtract at scaled=100. From a compute-efficiency perspective, this is do-able - you could build the internal indices for both SBT and LCA databases at a higher scaled than the leaf node signatures, while retaining the full leaf node information - but it has big drawbacks in terms of size...

ctb · 2020-04-04T13:39:40Z

#407 is relevant.

ctb · 2020-05-03T13:35:09Z

I'm going to close this as impractical, for now; @luizirber feel free to reopen if you want to discuss more. Linked it back to #407 more clearly for the general topic.

ctb mentioned this issue May 3, 2020

write up downsampling details #407

Closed

ctb closed this as completed May 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-scaled queries with gather #928

multi-scaled queries with gather #928

luizirber commented Apr 3, 2020

ctb commented Apr 4, 2020

ctb commented Apr 4, 2020

ctb commented May 3, 2020

multi-scaled queries with gather #928

multi-scaled queries with gather #928

Comments

luizirber commented Apr 3, 2020

ctb commented Apr 4, 2020

ctb commented Apr 4, 2020

ctb commented May 3, 2020