Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the best way to de-replicate denovo genome set #3531

Open
yuzie0314 opened this issue Feb 12, 2025 · 4 comments
Open

the best way to de-replicate denovo genome set #3531

yuzie0314 opened this issue Feb 12, 2025 · 4 comments

Comments

@yuzie0314
Copy link

yuzie0314 commented Feb 12, 2025

Dear the authors,

Our team currently want to use sourmash to down-select a genome set like uhgg or gtdbtk.
However, we found that sourmash didn't work what we expected.
Let's dive into further about this project.
Based on the known taxonomic information for each genome, we cluster them at family level, and under family level we will use sourmash to generate pair-wise ani matrix which will show how similar/different between genomes. Then, we will use our customed scripts to cluster genomes based on ani equal to 95 or 98. Finally we will refer to their completeness and contamination to pick out the representative set from those genome clusters.
The related command we used:

  1. How we compute the signatures for each genome.
    sourmash sketch dna -p scaled=1000,k=21,k=31,k=51,abund --name ${name} ${fasta} -o ./${name}.sig
  1. Generate a big pair-wise ani matrix
        sourmash scripts pairwise --cores ${task.cpus} --ksize ${params.kmer} --output ani_pairwise.csv --ani --write-all sig.path # kmer we assign 31
        sourmash scripts pairwise_to_matrix ani_pairwise.csv -o ani_matrix.numpy -u average_containment_ani
        python ${params.bin}/adjust_ani_matrix.py --input ani_matrix.numpy --output ani_matrix_adjusted.numpy # parse pairwise result to matrix.

💡 we think that what kind of params should we adjust to increase specificity (true negative). ex. kmer, scaled or bp to match?
💡 we checked the issue #3070 we think that we might have similar doubts about sourmash.
💡 should we chage kmer based on different ani for this case? say kmer=51 for strains and kmer=31 for species resolution ?

thanks for your contrubution on sourmash and branchwater plunge.

@ctb
Copy link
Contributor

ctb commented Feb 12, 2025

Hi @yuzie0314 this should all work, to my mind. The default parameters should be fine, I think - k=31 in particular - and we have quite a few results suggesting that sourmash ANI estimates are reasonably accurate (i.e. compare well with ANIm/ANIb).

Where are you seeing problems, or mismatches with what you expect?

The only potential issue that I see is with clustering at the family level. There is relatively little k-mer overlap between genomes at the family level. So the clustering there might not work well. However, you should be able to use clustering on the whole set and get good ~genus-level clusters without any prior taxonomic information.

@yuzie0314
Copy link
Author

Hi @ctb thanks for you quick reply!

Why we want to group genomes based on the known tax labels before generate pair-wise ani matrix is because we want to decrease the resources to run branchwater and the greedy cluster (a customed scripts). So from your suggestoins, maybe we can try to run the full set of genomes instead run with subsets of genomes?

@ctb
Copy link
Contributor

ctb commented Feb 13, 2025

I would expect it to work fine with ~60,000 genomes or so; we've been able to run pairwise on that many genomes with relatively low resources.

But, I wanted to come back to this:

However, we found that sourmash didn't work what we expected.

I'm curious - what didn't work as expected? I'm happy to help but would need some specifics...

@ctb
Copy link
Contributor

ctb commented Feb 13, 2025

pairwise benchmarks: sourmash-bio/sourmash_plugin_branchwater#247 (comment)

2.5 hrs, 4.5 GB of RAM, 16 threads for pairwise on 60,000 genomes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants