-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor codebase - Part I #896
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like these changes to the subworkflows. Here are some comments which I think could be useful for the import of the subworkflows to nf-core
// Convert all sample vcfs into a genomicsdb workspace using genomicsdbimport | ||
GATK4_GENOMICSDBIMPORT(gendb_input, false, false, false) | ||
|
||
genotype_input = GATK4_GENOMICSDBIMPORT.out.genomicsdb.map{ meta, genomicsdb -> [ meta, genomicsdb, [], [], [] ] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't adding the intervals here also help to speed up the genotyping a little bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea there, I just refactored as it was. But it's worth checking it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to the intervals are part of the channel?! so should be fine and are used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intervals aren't part of the input of GenotypeGVCFS? Or am I overlooking something here? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought your comment was aimed towards genomicsdbimport. IIRC the genotypegvcfs are then run on the subset of genomicsdb, however you are right the intervals file is not provided. The gvcfs files are merged afterwards though. @nickhsmith do you remember if it maybe wasn't necessary to provide the interval file to the tool?
GATK4_GENOTYPEGVCFS(genotype_input, fasta, fai, dict, dbsnp, dbsnp_tbi) | ||
|
||
BCFTOOLS_SORT(GATK4_GENOTYPEGVCFS.out.vcf) | ||
vcfs_sorted_input = BCFTOOLS_SORT.out.vcf.branch{ | ||
intervals: it[0].num_intervals > 1 | ||
no_intervals: it[0].num_intervals <= 1 | ||
} | ||
|
||
vcfs_sorted_input_no_intervals = vcfs_sorted_input.no_intervals.map{ meta, vcf -> | ||
[[ | ||
id: "joint_variant_calling", | ||
num_intervals: meta.num_intervals, | ||
patient: "all_samples", | ||
variantcaller: "haplotypecaller" | ||
], vcf ] | ||
} | ||
|
||
// Index vcf files if no scatter/gather by intervals | ||
TABIX(vcfs_sorted_input_no_intervals) | ||
|
||
//Merge scatter/gather vcfs & index | ||
//Rework meta for variantscalled.csv and annotation tools | ||
MERGE_GENOTYPEGVCFS(vcfs_sorted_input.intervals.map{ meta, vcf -> | ||
[[ | ||
id: "joint_variant_calling", | ||
num_intervals: meta.num_intervals, | ||
patient: "all_samples", | ||
variantcaller: "haplotypecaller" | ||
], vcf ] | ||
}.groupTuple(), | ||
dict.map{ it -> [[id:it[0].baseName], it]}) | ||
|
||
vqsr_input = Channel.empty().mix( | ||
MERGE_GENOTYPEGVCFS.out.vcf.join(MERGE_GENOTYPEGVCFS.out.tbi), | ||
vcfs_sorted_input_no_intervals.join(TABIX.out.tbi) | ||
) | ||
gvcf_to_merge = BCFTOOLS_SORT.out.vcf.map{ meta, vcf -> [ meta.subMap('num_intervals') + [ id:'joint_variant_calling', patient:'all_samples', variantcaller:'haplotypecaller' ], vcf ]}.groupTuple() | ||
|
||
// Group resource labels for SNP and INDEL | ||
snp_resource_labels = Channel.empty().mix(known_snps_vqsr,dbsnp_vqsr).collect() | ||
indel_resource_labels = Channel.empty().mix(known_indels_vqsr,dbsnp_vqsr).collect() | ||
// Merge scatter/gather vcfs & index | ||
// Rework meta for variantscalled.csv and annotation tools | ||
MERGE_GENOTYPEGVCFS(gvcf_to_merge, dict.map{ it -> [ [ id:'dict' ], it ] } ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you were to step over to bcftools concat
you can use the vcf_gather_bcftools
subworkflow here (it does exactly the same, aka sorting and merging), but with bcftools concat
instead of merging with GATK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and you said that bcftools concat was faster, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I've never had to wait long for it to complete. But I've never done a proper benchmark to compare it with the GATK merge tool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on what input data have you tested it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly on our institutional data, but as I said we only tested bcftools concat
as it was very fast already
ch_versions = ch_versions.mix(VCF_ANNOTATE_MERGE.out.versions.first()) | ||
reports = reports.mix(VCF_ANNOTATE_MERGE.out.reports) | ||
vcf_ann = vcf_ann.mix(VCF_ANNOTATE_MERGE.out.vcf_tbi) | ||
versions = versions.mix(VCF_ANNOTATE_MERGE.out.versions) | ||
} | ||
|
||
if (tools.split(',').contains('vep')) { | ||
VCF_ANNOTATE_ENSEMBLVEP(vcf, fasta, vep_genome, vep_species, vep_cache_version, vep_cache, vep_extra_files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been running vep per chromosome lately to speed up the annotation with a scatter/gather. We could maybe also add this functionality in the nf-core
subwf? https://github.com/CenterForMedicalGeneticsGhent/nf-cmgg-germline/blob/ede21872cb7e446d8184b94b6adc16931d7eac6e/subworkflows/local/annotation.nf#L32-L89
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love that, but here, we're just calling the nf-core modules for vep and snpeff, but we're getting closer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allright hit me up when I can help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how much gain is there? I am starting to be more wary of the costs for splitting vs speed gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on how big the VCF is. For the smaller ones it doesn't make that much of a difference, but it does for the bigger VCFs (e.g. for whole genomes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also haven't done any proper tests with timestamps etc but has made a difference for those big files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expected as such.
Using a spread and gather strategy was always something I considered for the annotation.
VEP is slow because of the huge DB for human, so splitting it up make sense.
Just afraid that we might need to use different intervals than the one we use for variant calling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we currently split on the contigs available in the fasta index. VEP has an option to specify the chromosome(s) that it should annotate for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it's what the ENSEMBL team is recommending: https://github.com/Ensembl/ensembl-vep/tree/main/nextflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know :) Only difference is that we don't split the VCF but use the --chr
argument (but should be the same result :) )
…lter -> haplotypecaller
Managed to fix the joint_germline tests, I'll try to fix the rest (umi and gatk4_spark). |
cnvkit is failing on conda, but I'm merging anway |
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).