Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scaffolding and reference selection based on ANI #528

Merged
merged 22 commits into from
Mar 25, 2024
Merged

Conversation

dpark01
Copy link
Member

@dpark01 dpark01 commented Mar 19, 2024

This PR introduces ANI-based mechanisms for reference selection using skani. This:

  • updates viral-assemble to 2.3.1.1
  • adds a task select_references to tasks_assembly.wdl that selects a subset of reference genomes based on ANI similarity to a set of given contigs/MAGs. This also clusters similar reference genomes to each other based on ANI and chooses only the top reference per cluster.
  • changes task scaffold's behavior in task_assembly.wdl to chose a reference genome based on ANI when provided multiple reference genomes.
  • changes the behavior of scaffold_and_refine_multitaxa to:
    • select a subset of reference genomes to scaffold against using ANI similarity to the contigs instead of a kraken-based reference selection
    • call scaffolding and refine steps on a per-reference-cluster basis instead of per-reference-taxon, forcing a choice of the most appropriate taxon of highly-related taxa
    • no longer fall-back to reference based assembly, since de novo scaffolding should always work if an ANI hit was found

This updated version of scaffold_and_refine_multitaxa should be far more efficient at metagenomic reference selection, produce less noisy outputs (less secondary taxon hits), still allow for diverse coinfections of unrelated taxa, and should perform well with a very large reference genome database as input.

This PR also introduces an unrelated change to modify the workflow terra_tsv_to_table to accept and concatenate multiple input tsv files.

@dpark01 dpark01 self-assigned this Mar 19, 2024
@dpark01 dpark01 changed the title WiP: scaffolding multitaxa improvements [not ready] scaffolding and reference selection based on ANI Mar 19, 2024
@dpark01 dpark01 marked this pull request as ready for review March 19, 2024 20:33
@dpark01
Copy link
Member Author

dpark01 commented Mar 25, 2024

After a few runs on production data, the general methodological changes here look good. Some parameter tuning probably remains to be optimized in order to increase the clustering/matching between various rhino C species, but that can be experimented with separately and introduced in a later PR.

@dpark01 dpark01 merged commit 6a8f9c1 into master Mar 25, 2024
12 checks passed
@dpark01 dpark01 deleted the dp-scaffold branch March 25, 2024 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant