Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting reads based on Taxonomy assignments in Sourmash, similar to Kraken #2824

Open
wyanren opened this issue Oct 27, 2023 · 1 comment

Comments

@wyanren
Copy link

wyanren commented Oct 27, 2023

Hello! Thank you for your dedicated work on sourmash!

I've been closely following sourmash for taxonomy assignments and I'm keen on extracting specific sequences from my metagenomic datasets based on their assigned taxonomies (allowing me to extract reads in batch). I noticed how Kraken provides a mechanism for this using extract_kraken_reads.py (KrakenTools).

Having gone through the discussions in issues #2566 and #1237, I understand there were some relevant points mentioned, but I'm still unclear on a few matters:

Given the taxonomy from sourmash gather and the table with md5 hash numbers, is there a potential for an integrated tool in sourmash that would allow for the extraction of reads based on their taxonomy assignments?

Your insights would be greatly beneficial :)

@ctb
Copy link
Contributor

ctb commented Oct 28, 2023

hi @wyanren thanks for asking!

since sourmash tax is a taxonomic profiler, and operates only on a "sketch" of the original data, sourmash itself doesn't have access to the reads. (This is one reason why it can be reasonably fast & and low memory.) And, for any of our recommended parameter settings, the low density of sampling of the k-mers from the reads would mean that most reads aren't actually classifiable directly by sourmash.

Note also that sourmash gather profiles data sets by matches to their genomes, and only then uses the genomes to get to taxonomy. This is important because I don't think of reads themselves having taxonomy - they come from genomes that belong to a particular taxon.

That all having been said, there are two practical ways to go about getting your hands on reads that belong to a particular taxon.

First, you can map your metagenome data set to all of the matching genomes, and then select those genome(s) that belong within your taxon of choice, and then extract those reads. This is annoying to do by hand but genome-grist will automate everything but the last step for you. If you end up trying it out and have feedback I can perhaps help improve genome-grist to do something better or more easily, too.

Second, you can use spacegraphcats to pull out all the reads that match to the k-mers in taxon-specific genomes. We do this in situations where we think the metagenome doesn't exactly match to the genomes in the reference database, and we're looking at strain variation.

I would recommend doing the mapping-based approach to start with, tho, since spacegraphcats is not particularly fast or frugal in memory...

I would be interested in hearing more about your end goal, too! What do you want to do with the reads? There may be a faster/easier way to get there that I can recommend to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants