Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more scaffolding updates #511

Merged
merged 24 commits into from
Feb 17, 2024
Merged

more scaffolding updates #511

merged 24 commits into from
Feb 17, 2024

Conversation

dpark01
Copy link
Member

@dpark01 dpark01 commented Feb 13, 2024

scaffold_and_refine_multitaxa:

  • more resilience to empty outputs or partial recovery of multi-segment genomes
  • update input reference genome data structure to tsv instead of json string
  • auto-filter input reference genome list (optionally) based on kraken hits
  • run polish only if scaffolding successful--if not successful, fallback to refbased assembly as hail mary attempt
  • more top-level (terra table) outputs

classify_single:

  • more top-level workflow outputs summarizing primary viral taxa found, and how much

containers:

  • bump viral-classify 2.2.3.0 to 2.2.4.0

build:

  • update cromwell/womtool version

@@ -79,7 +70,7 @@ workflow scaffold_and_refine_multitaxa {
"assembly_length_unambiguous" : refine.assembly_length_unambiguous,
"reads_aligned" : refine.align_to_self_merged_reads_aligned,
"mean_coverage" : refine.align_to_self_merged_mean_coverage,
"percent_reference_covered" : 1.0 * refine.assembly_length_unambiguous / refine.reference_genome_length,
"percent_reference_covered" : select_first([percent_reference_covered, 0.0]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be nice to break out tax_name and percent_reference_covered for the "top" viral assembly into separate workflow outputs, for easier search and filtering on Terra (where "top" could be defined as the most complete assembly, or the most abundant taxon in terms of # of reads or # of matching distinct k-mers).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added into the TO DO comments at the bottom of the WDL. I think this will require a small bespoke tsv-parsing task for this purpose. It will also need to be reslient to the empty-output scenario (ie, there is no top assembly because none were attempted or were successful).


Int num_read_groups = refine.num_read_groups[0]
Int num_libraries = refine.num_libraries[0]
Array[Map[String,String]] assembly_stats_by_taxon = stats_by_taxon
Copy link
Member

@tomkinsc tomkinsc Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we can't make this type Map[ String, Map[String,String] ], where the outer map String keys are the taxid or tax_name values? (for picking out values for a given taxon in downstream analyses)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just because of how we construct it (see the scatter in the WDL above), and that WDL 1.0 lacks a lot of the basic methods for navigating Maps and converting back and forth with Arrays.

@dpark01 dpark01 marked this pull request as ready for review February 16, 2024 23:08
@dpark01 dpark01 merged commit 847d661 into master Feb 17, 2024
12 checks passed
@dpark01 dpark01 deleted the dp-scaffold branch March 1, 2024 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants