-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add new workflow scaffold_and_refine_multitaxa #506
Conversation
String sample_id | ||
File reads_unmapped_bam | ||
|
||
Array[Pair[Int,Array[String]+]] taxid_to_ref_accessions = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about HAdV?
There are quite a few, but if we do want to include them:
# HAdV reference genomes, via https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=10509&host=human
(129875, ["NC_001460.1"]), # Human mastadenovirus A strain:Huie; serotype:Human adenovirus 12; culture-collection:ATCC:VR-863
(108098, ["NC_011203.1"]), # Human adenovirus B1
(108098, ["NC_011202.1"]), # Human adenovirus B2
(129951, ["NC_001405.1"]), # Human mastadenovirus C serotype:Human adenovirus 2
(130310, ["NC_010956.1"]), # Human mastadenovirus D strain:Hicks; NIAID V-209-003-014; serotype:Human adenovirus 9
(130308, ["NC_003266.2"]), # Human mastadenovirus E strain:vaccine (CL 68578); serotype:human adenovirus 4
(130309, ["NC_001454.1"]), # Human mastadenovirus F strain:Dugan; serotype:Human adenovirus 40
(310540, ["NC_006879.1"]), # Simian adenovirus 1 strain:ATCC VR-195
(1123958, ["NC_017825.1"]), # Chimpanzee adenovirus Y25
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll get some proposed edits to this default list from Jillian soon
reference_fasta = scaffold.scaffold_fasta, | ||
sample_name = sample_id | ||
} | ||
# to do: if pre-impute unambig length > some fraction of ref genome, run ncbi.rename_fasta_header and ncbi.align_and_annot_transfer_single |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some fraction of ref genome
will be a workflow parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, maybe we try Liftoff for annotation transfer (no chain file required—it does the alignment via minimap2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that's right... this is all obviously to do at some future stage.
-f "~{reference_fasta}" "~{reads_aligned_bam}" \ | ||
| bcftools call \ | ||
-P 0 -m --ploidy 1 \ | ||
--threads $(nproc) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe reserve a core for the system, writes, etc. ($(nproc --ignore=1)
)?
This PR adds a new workflow called
scaffold_and_refine_multitaxa
which runs scaffold_and_refine on one input sample (contigs + reads) against many reference genomes from different taxa of interest. This is designed to attempt to assemble all taxa of interest for every sample, and will produce partial and empty outputs for all unsuccessful sample x taxon combinations. It is intended for high throughput metagenomic analyses.This includes a few updates to tasks to make them more resilient to empty fasta inputs/outputs:
scaffold
run_discordance
alignment_metrics