Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more scaffolding updates #511

Merged
merged 24 commits into from
Feb 17, 2024
Merged
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
7064708
defend against rather common empty output scenario
dpark01 Feb 5, 2024
0593083
more compliant wdl
dpark01 Feb 5, 2024
0bedc96
add new wdl task report_primary_kraken_taxa
dpark01 Feb 7, 2024
98f9bbd
add report_primary_kraken_taxa wdl task and add to classify_single
dpark01 Feb 7, 2024
e39919b
add a few more outputs
dpark01 Feb 7, 2024
a85d7c9
Merge remote-tracking branch 'origin/master' into dp-scaffold
dpark01 Feb 8, 2024
9e12088
try wdl 1.1 and see what happens
dpark01 Feb 8, 2024
02cf671
try wdl development and see what happens
dpark01 Feb 8, 2024
d824518
update to take tsv instead of json input for reference/tax map
dpark01 Feb 8, 2024
fa07252
attempt to not fail in scaffolding when some but not all segments of …
dpark01 Feb 13, 2024
031a294
forgot $
dpark01 Feb 13, 2024
8a9b26f
remove random empty newline introduced in this branch
dpark01 Feb 13, 2024
165eb66
fix bash logical construction
dpark01 Feb 14, 2024
8c898c9
Merge remote-tracking branch 'origin/master' into dp-scaffold
dpark01 Feb 14, 2024
1080d49
initial draft of task for filtering reference list
dpark01 Feb 14, 2024
1a77bf7
pre-extract taxdump tarball
dpark01 Feb 14, 2024
d31c14a
add optional kraken-based reference selection to multitaxa
dpark01 Feb 15, 2024
526cece
why cromwell do you behave poorly on edge cases
dpark01 Feb 16, 2024
f02a58b
more stats and outputs, revert to refbased if cant denovo, dont polis…
dpark01 Feb 16, 2024
ca24b2d
Merge remote-tracking branch 'origin/master' into dp-scaffold
dpark01 Feb 16, 2024
bc6bee7
simplify cromwell fix
dpark01 Feb 16, 2024
6a71e1a
Merge branch 'master' into dp-scaffold
dpark01 Feb 16, 2024
93d455f
bump viral-classify 2.2.3.0 to 2.2.4.1
dpark01 Feb 16, 2024
88ca4d1
revert version
dpark01 Feb 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 20 additions & 11 deletions pipes/WDL/workflows/scaffold_and_refine_multitaxa.wdl
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
version 1.0

import "../tasks/tasks_assembly.wdl" as assembly
import "../tasks/tasks_metagenomics.wdl" as metagenomics
import "../tasks/tasks_ncbi.wdl" as ncbi
import "../tasks/tasks_utils.wdl" as utils
import "assemble_refbased.wdl" as assemble_refbased
Expand All @@ -18,11 +19,23 @@ workflow scaffold_and_refine_multitaxa {
File reads_unmapped_bam

File taxid_to_ref_accessions_tsv
File? focal_report_tsv
File? ncbi_taxdump_tgz

# Float min_pct_reference_covered = 0.1
}

Array[Array[String]] taxid_to_ref_accessions = read_tsv(taxid_to_ref_accessions_tsv)
# if kraken reports are available, filter scaffold list to observed hits (output might be empty!)
if(defined(focal_report_tsv) && defined(ncbi_taxdump_tgz)) {
call metagenomics.filter_refs_to_found_taxa {
input:
taxid_to_ref_accessions_tsv = taxid_to_ref_accessions_tsv,
taxdump_tgz = select_first([ncbi_taxdump_tgz]),
focal_report_tsv = select_first([focal_report_tsv])
}
}

Array[Array[String]] taxid_to_ref_accessions = read_tsv(select_first([filter_refs_to_found_taxa.filtered_taxid_to_ref_accessions_tsv, taxid_to_ref_accessions_tsv]))
Array[String] assembly_header = ["sample_id", "taxid", "tax_name", "assembly_fasta", "aligned_only_reads_bam", "coverage_plot", "assembly_length", "assembly_length_unambiguous", "reads_aligned", "mean_coverage", "percent_reference_covered", "intermediate_gapfill_fasta", "assembly_preimpute_length_unambiguous", "replicate_concordant_sites", "replicate_discordant_snps", "replicate_discordant_indels", "replicate_discordant_vcf", "isnvsFile", "aligned_bam", "coverage_tsv", "read_pairs_aligned", "bases_aligned"]

scatter(taxon in taxid_to_ref_accessions) {
Expand Down Expand Up @@ -51,8 +64,8 @@ workflow scaffold_and_refine_multitaxa {
reference_fasta = scaffold.scaffold_fasta,
sample_name = sample_id
}
# to do: if percent_reference_covered > some threshold, run ncbi.rename_fasta_header and ncbi.align_and_annot_transfer_single
# to do: if biosample attributes file provided, run ncbi.biosample_to_genbank
# TO DO: if percent_reference_covered > some threshold, run ncbi.rename_fasta_header and ncbi.align_and_annot_transfer_single
# TO DO: if biosample attributes file provided, run ncbi.biosample_to_genbank

if (refine.reference_genome_length > 0) {
Float percent_reference_covered = 1.0 * refine.assembly_length_unambiguous / refine.reference_genome_length
Expand Down Expand Up @@ -100,14 +113,10 @@ workflow scaffold_and_refine_multitaxa {
}

output {
Array[Map[String,String]] assembly_stats_by_taxon = stats_by_taxon
File assembly_stats_by_taxon_tsv = concatenate.combined

Int num_read_groups = refine.num_read_groups[0]
Int num_libraries = refine.num_libraries[0]
Array[Map[String,String]] assembly_stats_by_taxon = stats_by_taxon
Copy link
Member

@tomkinsc tomkinsc Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we can't make this type Map[ String, Map[String,String] ], where the outer map String keys are the taxid or tax_name values? (for picking out values for a given taxon in downstream analyses)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just because of how we construct it (see the scatter in the WDL above), and that WDL 1.0 lacks a lot of the basic methods for navigating Maps and converting back and forth with Arrays.

File assembly_stats_by_taxon_tsv = concatenate.combined
String assembly_method = "viral-ngs/scaffold_and_refine_multitaxa"

String assembly_method = "viral-ngs/scaffold_and_refine_multitaxa"
String scaffold_viral_assemble_version = scaffold.viralngs_version[0]
String refine_viral_assemble_version = refine.viral_assemble_version[0]
# TO DO: some summary stats on stats_by_taxon: how many rows, numbers from the best row, etc
}
}
Loading