Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor codebase - Part I #896

Merged
merged 88 commits into from
Jan 25, 2023
Merged

Refactor codebase - Part I #896

merged 88 commits into from
Jan 25, 2023

Conversation

maxulysse
Copy link
Member

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs- [ ] If necessary, also make a PR on the nf-core/sarek branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@github-actions
Copy link

github-actions bot commented Dec 16, 2022

nf-core lint overall result: Passed ✅

Posted for pipeline commit d3823f2

+| ✅ 152 tests passed       |+
#| ❔   8 tests were ignored |#

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.7.2
  • Run at 2023-01-24 13:31:42

@maxulysse maxulysse changed the title Refactor codebase Refactor codebase - Part I Jan 5, 2023
@maxulysse maxulysse marked this pull request as ready for review January 5, 2023 10:37
Copy link

@nvnieuwk nvnieuwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like these changes to the subworkflows. Here are some comments which I think could be useful for the import of the subworkflows to nf-core

// Convert all sample vcfs into a genomicsdb workspace using genomicsdbimport
GATK4_GENOMICSDBIMPORT(gendb_input, false, false, false)

genotype_input = GATK4_GENOMICSDBIMPORT.out.genomicsdb.map{ meta, genomicsdb -> [ meta, genomicsdb, [], [], [] ] }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't adding the intervals here also help to speed up the genotyping a little bit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea there, I just refactored as it was. But it's worth checking it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to the intervals are part of the channel?! so should be fine and are used

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intervals aren't part of the input of GenotypeGVCFS? Or am I overlooking something here? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought your comment was aimed towards genomicsdbimport. IIRC the genotypegvcfs are then run on the subset of genomicsdb, however you are right the intervals file is not provided. The gvcfs files are merged afterwards though. @nickhsmith do you remember if it maybe wasn't necessary to provide the interval file to the tool?

Comment on lines +49 to +56
GATK4_GENOTYPEGVCFS(genotype_input, fasta, fai, dict, dbsnp, dbsnp_tbi)

BCFTOOLS_SORT(GATK4_GENOTYPEGVCFS.out.vcf)
vcfs_sorted_input = BCFTOOLS_SORT.out.vcf.branch{
intervals: it[0].num_intervals > 1
no_intervals: it[0].num_intervals <= 1
}

vcfs_sorted_input_no_intervals = vcfs_sorted_input.no_intervals.map{ meta, vcf ->
[[
id: "joint_variant_calling",
num_intervals: meta.num_intervals,
patient: "all_samples",
variantcaller: "haplotypecaller"
], vcf ]
}

// Index vcf files if no scatter/gather by intervals
TABIX(vcfs_sorted_input_no_intervals)

//Merge scatter/gather vcfs & index
//Rework meta for variantscalled.csv and annotation tools
MERGE_GENOTYPEGVCFS(vcfs_sorted_input.intervals.map{ meta, vcf ->
[[
id: "joint_variant_calling",
num_intervals: meta.num_intervals,
patient: "all_samples",
variantcaller: "haplotypecaller"
], vcf ]
}.groupTuple(),
dict.map{ it -> [[id:it[0].baseName], it]})

vqsr_input = Channel.empty().mix(
MERGE_GENOTYPEGVCFS.out.vcf.join(MERGE_GENOTYPEGVCFS.out.tbi),
vcfs_sorted_input_no_intervals.join(TABIX.out.tbi)
)
gvcf_to_merge = BCFTOOLS_SORT.out.vcf.map{ meta, vcf -> [ meta.subMap('num_intervals') + [ id:'joint_variant_calling', patient:'all_samples', variantcaller:'haplotypecaller' ], vcf ]}.groupTuple()

// Group resource labels for SNP and INDEL
snp_resource_labels = Channel.empty().mix(known_snps_vqsr,dbsnp_vqsr).collect()
indel_resource_labels = Channel.empty().mix(known_indels_vqsr,dbsnp_vqsr).collect()
// Merge scatter/gather vcfs & index
// Rework meta for variantscalled.csv and annotation tools
MERGE_GENOTYPEGVCFS(gvcf_to_merge, dict.map{ it -> [ [ id:'dict' ], it ] } )

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you were to step over to bcftools concat you can use the vcf_gather_bcftools subworkflow here (it does exactly the same, aka sorting and merging), but with bcftools concat instead of merging with GATK

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and you said that bcftools concat was faster, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I've never had to wait long for it to complete. But I've never done a proper benchmark to compare it with the GATK merge tool

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on what input data have you tested it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly on our institutional data, but as I said we only tested bcftools concat as it was very fast already

ch_versions = ch_versions.mix(VCF_ANNOTATE_MERGE.out.versions.first())
reports = reports.mix(VCF_ANNOTATE_MERGE.out.reports)
vcf_ann = vcf_ann.mix(VCF_ANNOTATE_MERGE.out.vcf_tbi)
versions = versions.mix(VCF_ANNOTATE_MERGE.out.versions)
}

if (tools.split(',').contains('vep')) {
VCF_ANNOTATE_ENSEMBLVEP(vcf, fasta, vep_genome, vep_species, vep_cache_version, vep_cache, vep_extra_files)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been running vep per chromosome lately to speed up the annotation with a scatter/gather. We could maybe also add this functionality in the nf-core subwf? https://github.com/CenterForMedicalGeneticsGhent/nf-cmgg-germline/blob/ede21872cb7e446d8184b94b6adc16931d7eac6e/subworkflows/local/annotation.nf#L32-L89

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love that, but here, we're just calling the nf-core modules for vep and snpeff, but we're getting closer

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allright hit me up when I can help!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much gain is there? I am starting to be more wary of the costs for splitting vs speed gain.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on how big the VCF is. For the smaller ones it doesn't make that much of a difference, but it does for the bigger VCFs (e.g. for whole genomes)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also haven't done any proper tests with timestamps etc but has made a difference for those big files

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expected as such.
Using a spread and gather strategy was always something I considered for the annotation.
VEP is slow because of the huge DB for human, so splitting it up make sense.
Just afraid that we might need to use different intervals than the one we use for variant calling.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we currently split on the contigs available in the fasta index. VEP has an option to specify the chromosome(s) that it should annotate for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's what the ENSEMBL team is recommending: https://github.com/Ensembl/ensembl-vep/tree/main/nextflow

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know :) Only difference is that we don't split the VCF but use the --chr argument (but should be the same result :) )

@maxulysse
Copy link
Member Author

Managed to fix the joint_germline tests, I'll try to fix the rest (umi and gatk4_spark).

@maxulysse
Copy link
Member Author

cnvkit is failing on conda, but I'm merging anway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants