Extracting reads based on Taxonomy assignments in Sourmash, similar to Kraken #2824

wyanren · 2023-10-27T03:06:21Z

Hello! Thank you for your dedicated work on sourmash!

I've been closely following sourmash for taxonomy assignments and I'm keen on extracting specific sequences from my metagenomic datasets based on their assigned taxonomies (allowing me to extract reads in batch). I noticed how Kraken provides a mechanism for this using extract_kraken_reads.py (KrakenTools).

Having gone through the discussions in issues #2566 and #1237, I understand there were some relevant points mentioned, but I'm still unclear on a few matters:

Given the taxonomy from sourmash gather and the table with md5 hash numbers, is there a potential for an integrated tool in sourmash that would allow for the extraction of reads based on their taxonomy assignments?

Your insights would be greatly beneficial :)

ctb · 2023-10-28T21:51:38Z

hi @wyanren thanks for asking!

since sourmash tax is a taxonomic profiler, and operates only on a "sketch" of the original data, sourmash itself doesn't have access to the reads. (This is one reason why it can be reasonably fast & and low memory.) And, for any of our recommended parameter settings, the low density of sampling of the k-mers from the reads would mean that most reads aren't actually classifiable directly by sourmash.

Note also that sourmash gather profiles data sets by matches to their genomes, and only then uses the genomes to get to taxonomy. This is important because I don't think of reads themselves having taxonomy - they come from genomes that belong to a particular taxon.

That all having been said, there are two practical ways to go about getting your hands on reads that belong to a particular taxon.

First, you can map your metagenome data set to all of the matching genomes, and then select those genome(s) that belong within your taxon of choice, and then extract those reads. This is annoying to do by hand but genome-grist will automate everything but the last step for you. If you end up trying it out and have feedback I can perhaps help improve genome-grist to do something better or more easily, too.

Second, you can use spacegraphcats to pull out all the reads that match to the k-mers in taxon-specific genomes. We do this in situations where we think the metagenome doesn't exactly match to the genomes in the reference database, and we're looking at strain variation.

I would recommend doing the mapping-based approach to start with, tho, since spacegraphcats is not particularly fast or frugal in memory...

I would be interested in hearing more about your end goal, too! What do you want to do with the reads? There may be a faster/easier way to get there that I can recommend to you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting reads based on Taxonomy assignments in Sourmash, similar to Kraken #2824

Extracting reads based on Taxonomy assignments in Sourmash, similar to Kraken #2824

wyanren commented Oct 27, 2023

ctb commented Oct 28, 2023

Extracting reads based on Taxonomy assignments in Sourmash, similar to Kraken #2824

Extracting reads based on Taxonomy assignments in Sourmash, similar to Kraken #2824

Comments

wyanren commented Oct 27, 2023

ctb commented Oct 28, 2023