the best way to de-replicate denovo genome set #3531

yuzie0314 · 2025-02-12T04:07:02Z

Dear the authors,

Our team currently want to use sourmash to down-select a genome set like uhgg or gtdbtk.
However, we found that sourmash didn't work what we expected.
Let's dive into further about this project.
Based on the known taxonomic information for each genome, we cluster them at family level, and under family level we will use sourmash to generate pair-wise ani matrix which will show how similar/different between genomes. Then, we will use our customed scripts to cluster genomes based on ani equal to 95 or 98. Finally we will refer to their completeness and contamination to pick out the representative set from those genome clusters.
The related command we used:

How we compute the signatures for each genome.

    sourmash sketch dna -p scaled=1000,k=21,k=31,k=51,abund --name ${name} ${fasta} -o ./${name}.sig

Generate a big pair-wise ani matrix

        sourmash scripts pairwise --cores ${task.cpus} --ksize ${params.kmer} --output ani_pairwise.csv --ani --write-all sig.path # kmer we assign 31
        sourmash scripts pairwise_to_matrix ani_pairwise.csv -o ani_matrix.numpy -u average_containment_ani
        python ${params.bin}/adjust_ani_matrix.py --input ani_matrix.numpy --output ani_matrix_adjusted.numpy # parse pairwise result to matrix.

💡 we think that what kind of params should we adjust to increase specificity (true negative). ex. kmer, scaled or bp to match?
💡 we checked the issue #3070 we think that we might have similar doubts about sourmash.
💡 should we chage kmer based on different ani for this case? say kmer=51 for strains and kmer=31 for species resolution ?

thanks for your contrubution on sourmash and branchwater plunge.

The text was updated successfully, but these errors were encountered:

ctb · 2025-02-12T13:50:43Z

Hi @yuzie0314 this should all work, to my mind. The default parameters should be fine, I think - k=31 in particular - and we have quite a few results suggesting that sourmash ANI estimates are reasonably accurate (i.e. compare well with ANIm/ANIb).

Where are you seeing problems, or mismatches with what you expect?

The only potential issue that I see is with clustering at the family level. There is relatively little k-mer overlap between genomes at the family level. So the clustering there might not work well. However, you should be able to use clustering on the whole set and get good ~genus-level clusters without any prior taxonomic information.

yuzie0314 · 2025-02-13T03:38:54Z

Hi @ctb thanks for you quick reply!

Why we want to group genomes based on the known tax labels before generate pair-wise ani matrix is because we want to decrease the resources to run branchwater and the greedy cluster (a customed scripts). So from your suggestoins, maybe we can try to run the full set of genomes instead run with subsets of genomes?

ctb · 2025-02-13T14:30:23Z

I would expect it to work fine with ~60,000 genomes or so; we've been able to run pairwise on that many genomes with relatively low resources.

But, I wanted to come back to this:

However, we found that sourmash didn't work what we expected.

I'm curious - what didn't work as expected? I'm happy to help but would need some specifics...

ctb · 2025-02-13T14:34:00Z

pairwise benchmarks: sourmash-bio/sourmash_plugin_branchwater#247 (comment)

2.5 hrs, 4.5 GB of RAM, 16 threads for pairwise on 60,000 genomes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the best way to de-replicate denovo genome set #3531

the best way to de-replicate denovo genome set #3531

yuzie0314 commented Feb 12, 2025 •

edited

Loading

ctb commented Feb 12, 2025

yuzie0314 commented Feb 13, 2025

ctb commented Feb 13, 2025

ctb commented Feb 13, 2025

the best way to de-replicate denovo genome set #3531

the best way to de-replicate denovo genome set #3531

Comments

yuzie0314 commented Feb 12, 2025 • edited Loading

ctb commented Feb 12, 2025

yuzie0314 commented Feb 13, 2025

ctb commented Feb 13, 2025

ctb commented Feb 13, 2025

yuzie0314 commented Feb 12, 2025 •

edited

Loading