-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the best way to de-replicate denovo genome set #3531
Comments
Hi @yuzie0314 this should all work, to my mind. The default parameters should be fine, I think - k=31 in particular - and we have quite a few results suggesting that sourmash ANI estimates are reasonably accurate (i.e. compare well with ANIm/ANIb). Where are you seeing problems, or mismatches with what you expect? The only potential issue that I see is with clustering at the family level. There is relatively little k-mer overlap between genomes at the family level. So the clustering there might not work well. However, you should be able to use clustering on the whole set and get good ~genus-level clusters without any prior taxonomic information. |
Hi @ctb thanks for you quick reply! Why we want to group genomes based on the known tax labels before generate pair-wise ani matrix is because we want to decrease the resources to run branchwater and the greedy cluster (a customed scripts). So from your suggestoins, maybe we can try to run the full set of genomes instead run with subsets of genomes? |
I would expect it to work fine with ~60,000 genomes or so; we've been able to run But, I wanted to come back to this:
I'm curious - what didn't work as expected? I'm happy to help but would need some specifics... |
pairwise benchmarks: sourmash-bio/sourmash_plugin_branchwater#247 (comment) 2.5 hrs, 4.5 GB of RAM, 16 threads for |
Dear the authors,
Our team currently want to use sourmash to down-select a genome set like uhgg or gtdbtk.
However, we found that sourmash didn't work what we expected.
Let's dive into further about this project.
Based on the known taxonomic information for each genome, we cluster them at family level, and under family level we will use sourmash to generate pair-wise ani matrix which will show how similar/different between genomes. Then, we will use our customed scripts to cluster genomes based on ani equal to 95 or 98. Finally we will refer to their completeness and contamination to pick out the representative set from those genome clusters.
The related command we used:
💡 we think that what kind of params should we adjust to increase specificity (true negative). ex. kmer, scaled or bp to match?
💡 we checked the issue #3070 we think that we might have similar doubts about sourmash.
💡 should we chage kmer based on different ani for this case? say kmer=51 for strains and kmer=31 for species resolution ?
thanks for your contrubution on sourmash and branchwater plunge.
The text was updated successfully, but these errors were encountered: