Refactor codebase - Part I #896

maxulysse · 2022-12-16T13:34:19Z

PR checklist

This comment contains a description of changes (with reason).
If you've fixed a bug or added code that should be tested, add tests!
If you've added a new tool - have you followed the pipeline conventions in the contribution docs- [ ] If necessary, also make a PR on the nf-core/sarek branch on the nf-core/test-datasets repository.
Make sure your code lints (nf-core lint).
Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
Usage Documentation in docs/usage.md is updated.
Output Documentation in docs/output.md is updated.
CHANGELOG.md is updated.
README.md is updated (including new tool citations and authors/contributors).

github-actions · 2022-12-16T13:36:42Z

`nf-core lint` overall result: Passed ✅

Posted for pipeline commit d3823f2

+| ✅ 152 tests passed       |+
#| ❔   8 tests were ignored |#

❔ Tests ignored:

files_exist - File is ignored: conf/modules.config
files_exist - File is ignored: conf/test.config
files_exist - File is ignored: conf/test_full.config
files_unchanged - File ignored due to lint config: assets/nf-core-sarek_logo_light.png
files_unchanged - File ignored due to lint config: docs/images/nf-core-sarek_logo_light.png
files_unchanged - File ignored due to lint config: docs/images/nf-core-sarek_logo_dark.png
files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy
template_strings - template_strings

✅ Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .nf-core.yml
files_exist - File found: .editorconfig
files_exist - File found: .prettierignore
files_exist - File found: .prettierrc.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: assets/nf-core-sarek_logo_light.png
files_exist - File found: docs/images/nf-core-sarek_logo_light.png
files_exist - File found: docs/images/nf-core-sarek_logo_dark.png
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: lib/nfcore_external_java_deps.jar
files_exist - File found: lib/NfcoreSchema.groovy
files_exist - File found: lib/NfcoreTemplate.groovy
files_exist - File found: lib/Utils.groovy
files_exist - File found: lib/WorkflowMain.groovy
files_exist - File found: main.nf
files_exist - File found: assets/multiqc_config.yml
files_exist - File found: conf/base.config
files_exist - File found: conf/igenomes.config
files_exist - File found: .github/workflows/awstest.yml
files_exist - File found: .github/workflows/awsfulltest.yml
files_exist - File found: lib/WorkflowSarek.groovy
files_exist - File found: modules.json
files_exist - File found: pyproject.toml
files_exist - File not found check: Singularity
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: .nf-core.yaml
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File not found check: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File not found check: docs/images/nf-core-sarek_logo.png
files_exist - File not found check: .markdownlint.yml
files_exist - File not found check: .yamllint.yml
files_exist - File not found check: lib/Checks.groovy
files_exist - File not found check: lib/Completion.groovy
files_exist - File not found check: lib/Workflow.groovy
files_exist - File not found check: .travis.yml
nextflow_config - Config variable found: manifest.name
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: manifest.homePage
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: params.show_hidden_params
nextflow_config - Config variable found: params.schema_ignore_params
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config variable (correctly) not found: params.enable_conda
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config manifest.name began with nf-core/
nextflow_config - Config variable manifest.homePage began with https://github.com/nf-core/
nextflow_config - Config dag.file ended with .html
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - Config manifest.version ends in dev: '3.2dev'
nextflow_config - Config params.custom_config_version is set to master
nextflow_config - Config params.custom_config_base is set to https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Lines for loading custom profiles found
files_unchanged - .gitattributes matches the template
files_unchanged - .prettierrc.yml matches the template
files_unchanged - CODE_OF_CONDUCT.md matches the template
files_unchanged - LICENSE matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/CONTRIBUTING.md matches the template
files_unchanged - .github/ISSUE_TEMPLATE/bug_report.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/config.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.yml matches the template
files_unchanged - .github/PULL_REQUEST_TEMPLATE.md matches the template
files_unchanged - .github/workflows/branch.yml matches the template
files_unchanged - .github/workflows/linting_comment.yml matches the template
files_unchanged - .github/workflows/linting.yml matches the template
files_unchanged - assets/email_template.html matches the template
files_unchanged - assets/email_template.txt matches the template
files_unchanged - assets/sendmail_template.txt matches the template
files_unchanged - docs/README.md matches the template
files_unchanged - lib/nfcore_external_java_deps.jar matches the template
files_unchanged - lib/NfcoreSchema.groovy matches the template
files_unchanged - .gitignore matches the template
files_unchanged - .prettierignore matches the template
files_unchanged - pyproject.toml matches the template
actions_ci - '.github/workflows/ci.yml' is triggered on expected events
actions_ci - '.github/workflows/ci.yml' checks minimum NF version
actions_awstest - '.github/workflows/awstest.yml' is triggered correctly
actions_awsfulltest - .github/workflows/awsfulltest.yml is triggered correctly
actions_awsfulltest - .github/workflows/awsfulltest.yml does not use -profile test
readme - README Nextflow minimum version badge matched config. Badge: 22.10.1, Config: 22.10.1
readme - README Nextflow minimum version in Quick Start section matched config. README: 22.10.1, Config: 22.10.1
pipeline_todos - No TODO strings found
pipeline_name_conventions - Name adheres to nf-core convention
schema_lint - Schema lint passed
schema_lint - Schema title + description lint passed
schema_lint - Input mimetype lint passed: 'text/csv'
schema_params - Schema matched params returned from nextflow config
actions_schema_validation - Workflow validation passed: awsfulltest_germline.yml
actions_schema_validation - Workflow validation passed: pytest-workflow.yml
actions_schema_validation - Workflow validation passed: ci.yml
actions_schema_validation - Workflow validation passed: fix-linting.yml
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: awstest.yml
actions_schema_validation - Workflow validation passed: awsfulltest.yml
actions_schema_validation - Workflow validation passed: branch.yml
actions_schema_validation - Workflow validation passed: linting_comment.yml
merge_markers - No merge markers found in pipeline files
modules_json - Only installed modules found in modules.json
multiqc_config - 'assets/multiqc_config.yml' follows the ordering scheme of the minimally required plugins.
multiqc_config - 'assets/multiqc_config.yml' contains a matching 'report_comment'.
multiqc_config - 'assets/multiqc_config.yml' contains 'export_plots: true'.
modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'

Run details

nf-core/tools version 2.7.2
Run at 2023-01-24 13:31:42

subworkflows/local/bam_applybqsr_spark/main.nf

nvnieuwk

I really like these changes to the subworkflows. Here are some comments which I think could be useful for the import of the subworkflows to nf-core

subworkflows/local/bam_joint_calling_germline_gatk/main.nf

nvnieuwk · 2023-01-12T08:43:41Z

subworkflows/local/bam_joint_calling_germline_gatk/main.nf

+    // Convert all sample vcfs into a genomicsdb workspace using genomicsdbimport
+    GATK4_GENOMICSDBIMPORT(gendb_input, false, false, false)
+
+    genotype_input = GATK4_GENOMICSDBIMPORT.out.genomicsdb.map{ meta, genomicsdb -> [ meta, genomicsdb, [], [], [] ] }


Wouldn't adding the intervals here also help to speed up the genotyping a little bit?

No idea there, I just refactored as it was. But it's worth checking it out.

need to the intervals are part of the channel?! so should be fine and are used

intervals aren't part of the input of GenotypeGVCFS? Or am I overlooking something here? :)

I thought your comment was aimed towards genomicsdbimport. IIRC the genotypegvcfs are then run on the subset of genomicsdb, however you are right the intervals file is not provided. The gvcfs files are merged afterwards though. @nickhsmith do you remember if it maybe wasn't necessary to provide the interval file to the tool?

nvnieuwk · 2023-01-12T08:45:15Z

subworkflows/local/bam_joint_calling_germline_gatk/main.nf

+    GATK4_GENOTYPEGVCFS(genotype_input, fasta, fai, dict, dbsnp, dbsnp_tbi)

    BCFTOOLS_SORT(GATK4_GENOTYPEGVCFS.out.vcf)
-    vcfs_sorted_input = BCFTOOLS_SORT.out.vcf.branch{
-        intervals:    it[0].num_intervals > 1
-        no_intervals: it[0].num_intervals <= 1
-    }
-
-    vcfs_sorted_input_no_intervals =  vcfs_sorted_input.no_intervals.map{ meta, vcf ->
-        [[
-            id:             "joint_variant_calling",
-            num_intervals:  meta.num_intervals,
-            patient:        "all_samples",
-            variantcaller:  "haplotypecaller"
-        ], vcf ]
-    }
-
-    // Index vcf files if no scatter/gather by intervals
-    TABIX(vcfs_sorted_input_no_intervals)
-
-    //Merge scatter/gather vcfs & index
-    //Rework meta for variantscalled.csv and annotation tools
-    MERGE_GENOTYPEGVCFS(vcfs_sorted_input.intervals.map{ meta, vcf ->
-            [[
-                id:             "joint_variant_calling",
-                num_intervals:  meta.num_intervals,
-                patient:        "all_samples",
-                variantcaller:  "haplotypecaller"
-            ], vcf ]
-        }.groupTuple(),
-        dict.map{ it -> [[id:it[0].baseName], it]})
-
-    vqsr_input = Channel.empty().mix(
-        MERGE_GENOTYPEGVCFS.out.vcf.join(MERGE_GENOTYPEGVCFS.out.tbi),
-        vcfs_sorted_input_no_intervals.join(TABIX.out.tbi)
-    )
+    gvcf_to_merge = BCFTOOLS_SORT.out.vcf.map{ meta, vcf -> [ meta.subMap('num_intervals') + [ id:'joint_variant_calling', patient:'all_samples', variantcaller:'haplotypecaller' ], vcf ]}.groupTuple()

-    // Group resource labels for SNP and INDEL
-    snp_resource_labels   = Channel.empty().mix(known_snps_vqsr,dbsnp_vqsr).collect()
-    indel_resource_labels = Channel.empty().mix(known_indels_vqsr,dbsnp_vqsr).collect()
+    // Merge scatter/gather vcfs & index
+    // Rework meta for variantscalled.csv and annotation tools
+    MERGE_GENOTYPEGVCFS(gvcf_to_merge, dict.map{ it -> [ [ id:'dict' ], it ] } )


If you were to step over to bcftools concat you can use the vcf_gather_bcftools subworkflow here (it does exactly the same, aka sorting and merging), but with bcftools concat instead of merging with GATK

and you said that bcftools concat was faster, right?

Yes I've never had to wait long for it to complete. But I've never done a proper benchmark to compare it with the GATK merge tool

on what input data have you tested it?

Mainly on our institutional data, but as I said we only tested bcftools concat as it was very fast already

subworkflows/local/bam_joint_calling_germline_gatk/main.nf

subworkflows/local/bam_variant_calling_haplotypecaller/main.nf

subworkflows/local/bam_variant_calling_mpileup/main.nf

subworkflows/local/bam_variant_calling_single_strelka/main.nf

nvnieuwk · 2023-01-12T09:12:46Z

subworkflows/local/vcf_annotate_all/main.nf

-        ch_versions = ch_versions.mix(VCF_ANNOTATE_MERGE.out.versions.first())
+        reports = reports.mix(VCF_ANNOTATE_MERGE.out.reports)
+        vcf_ann = vcf_ann.mix(VCF_ANNOTATE_MERGE.out.vcf_tbi)
+        versions = versions.mix(VCF_ANNOTATE_MERGE.out.versions)
    }

    if (tools.split(',').contains('vep')) {
        VCF_ANNOTATE_ENSEMBLVEP(vcf, fasta, vep_genome, vep_species, vep_cache_version, vep_cache, vep_extra_files)


We've been running vep per chromosome lately to speed up the annotation with a scatter/gather. We could maybe also add this functionality in the nf-core subwf? https://github.com/CenterForMedicalGeneticsGhent/nf-cmgg-germline/blob/ede21872cb7e446d8184b94b6adc16931d7eac6e/subworkflows/local/annotation.nf#L32-L89

Would love that, but here, we're just calling the nf-core modules for vep and snpeff, but we're getting closer

Allright hit me up when I can help!

how much gain is there? I am starting to be more wary of the costs for splitting vs speed gain.

It depends on how big the VCF is. For the smaller ones it doesn't make that much of a difference, but it does for the bigger VCFs (e.g. for whole genomes)

Also haven't done any proper tests with timestamps etc but has made a difference for those big files

I'd expected as such.
Using a spread and gather strategy was always something I considered for the annotation.
VEP is slow because of the huge DB for human, so splitting it up make sense.
Just afraid that we might need to use different intervals than the one we use for variant calling.

Yes we currently split on the contigs available in the fasta index. VEP has an option to specify the chromosome(s) that it should annotate for.

Also, it's what the ENSEMBL team is recommending: https://github.com/Ensembl/ensembl-vep/tree/main/nextflow

Good to know :) Only difference is that we don't split the VCF but use the --chr argument (but should be the same result :) )

…lter -> haplotypecaller

maxulysse · 2023-01-13T16:17:44Z

Managed to fix the joint_germline tests, I'll try to fix the rest (umi and gatk4_spark).

maxulysse · 2023-01-25T10:43:59Z

cnvkit is failing on conda, but I'm merging anway

use subMap

48a1464

maxulysse added 4 commits December 16, 2022 14:58

more subMap

a889676

more subMap

f177e30

code polishing

e47ac72

code polish

5f5a057

maxulysse commented Dec 19, 2022

View reviewed changes

subworkflows/local/bam_applybqsr_spark/main.nf Outdated Show resolved Hide resolved

maxulysse added 2 commits December 19, 2022 09:41

code polish

bb26cd0

more polish

389919e

maxulysse mentioned this pull request Dec 19, 2022

add def to all new_meta assignments #888

Closed

9 tasks

maxulysse added 17 commits December 19, 2022 11:58

polish + refactor germline

2c3edb8

fix typos

43eb5d1

more refactor + code polish

d68ea45

Merge branch 'dev' into refactor

855b4fd

fix manta

fc6d094

fix tiddit

80cbc25

handle ascat_genome with json schema

55b7de1

code polish + simplify post_variant_calling

c8670b0

fix tests for md from bam

0c7a0b3

fix freebayes

b2c2e00

fix nf-core lint schema

bbfe38a

update modules nf-core/cnvkit

cec0726

update mosdepth module

4311f4e

cache Nextflow installation

7071d2c

fix freebayes

b3b8f1b

try to fix nextflow installation cache

c46acb0

fix cram_qc_mosdepth_samtools subworkflow

a7fd373

maxulysse changed the title ~~Refactor codebase~~ Refactor codebase - Part I Jan 5, 2023

maxulysse marked this pull request as ready for review January 5, 2023 10:37

improve/fix intervals usage

835deec

nvnieuwk reviewed Jan 12, 2023

View reviewed changes

maxulysse added 11 commits January 12, 2023 13:04

Merge branch 'dev' into refactor

b48188d

code polish

29cc4b2

separate config for joint germline

5ace250

separate config for joint germline

4f33873

remove num_intervals from meta mao

b4bde5f

remove num_intervals from meta map

68ace9c

remove num_intervals from meta map

4d7e1e9

code polish

aa01962

split haplotype caller tests

074631a

haplotypecaller -> haplotypecaller_skip_filter and haplotypecaller_fi…

b3f2901

…lter -> haplotypecaller

fix md5sum

6924b60

maxulysse added 3 commits January 13, 2023 17:59

code polish

135e94d

code polish

71f1f46

polish

cd1ba0c

maxulysse mentioned this pull request Jan 18, 2023

GatherPileupSummaries in wgs mode failing because of sample mix-up #818

Closed

maxulysse added 2 commits January 18, 2023 14:07

numLanes -> num_lanes

55a83d6

numLanes -> num_lanes

1bf6e10

maxulysse mentioned this pull request Jan 24, 2023

First version of the pipeline nf-core/crisprseq#1

Merged

10 tasks

maxulysse added 3 commits January 24, 2023 13:50

id: fasta-> baseName

7d19725

update md5sum

5881425

bring back all tests

d3823f2

FriederikeHanssen approved these changes Jan 24, 2023

View reviewed changes

maxulysse merged commit a719b3f into nf-core:dev Jan 25, 2023

maxulysse deleted the refactor branch January 25, 2023 10:44

maxulysse mentioned this pull request Feb 1, 2023

Sample metadata out of sync when running BAM_VARIANT_CALLING_TUMOR_ONLY_MUTECT2 #887

Closed

FriederikeHanssen mentioned this pull request Feb 9, 2023

GroupKey Logic for multiple lanes with different sizes based on line splitting #853

Closed

FriederikeHanssen mentioned this pull request Feb 20, 2023

DeepVariant: Mismatching input samples #946

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor codebase - Part I #896

Refactor codebase - Part I #896

maxulysse commented Dec 16, 2022

github-actions bot commented Dec 16, 2022 •

edited

Loading

❔ Tests ignored:

✅ Tests passed:

Run details

nvnieuwk left a comment

nvnieuwk Jan 12, 2023

maxulysse Jan 12, 2023

FriederikeHanssen Jan 13, 2023

nvnieuwk Jan 16, 2023

FriederikeHanssen Jan 24, 2023

nvnieuwk Jan 12, 2023

maxulysse Jan 12, 2023

nvnieuwk Jan 12, 2023

FriederikeHanssen Jan 13, 2023

nvnieuwk Jan 16, 2023

nvnieuwk Jan 12, 2023

maxulysse Jan 12, 2023

nvnieuwk Jan 12, 2023

FriederikeHanssen Jan 13, 2023

nvnieuwk Jan 16, 2023

nvnieuwk Jan 16, 2023

maxulysse Jan 16, 2023

nvnieuwk Jan 16, 2023

maxulysse Jan 16, 2023

nvnieuwk Jan 16, 2023

maxulysse commented Jan 13, 2023

maxulysse commented Jan 25, 2023

Refactor codebase - Part I #896

Refactor codebase - Part I #896

Conversation

maxulysse commented Dec 16, 2022

PR checklist

github-actions bot commented Dec 16, 2022 • edited Loading

nf-core lint overall result: Passed ✅

❔ Tests ignored:

✅ Tests passed:

Run details

nvnieuwk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxulysse commented Jan 13, 2023

maxulysse commented Jan 25, 2023

github-actions bot commented Dec 16, 2022 •

edited

Loading

`nf-core lint` overall result: Passed ✅