The new GATK-based pipeline for wild isolate Caenorhabditis strains
_______ _______ _______ __ __ _______ _______
| __| _ |_ _| |/ | | | | ___|
| | | | | | | < | | ___|
|_______|___|___| |___| |__|\__| |__|____|___|
To run the pipeline:
nextflow main.nf --help
nextflow main.nf --debug
nextflow main.nf --sample_sheet=/path/sample_sheet.txt --species c_elegans --bam_location=/path/to/bams --gvcf_location=/path/to/gvcfs
parameters description Set/Default
========== =========== ========================
--debug Use --debug to indicate debug mode false
--species Species to call variants from null
--sample_sheet Sample sheet null
--bam_location Directory of BAM files {dataDir}/{species}/WI/alignments
--gvcf_location Directory of gVCF files {dataDir}/{species}/WI/gVCFs
--mito_name Contig not to polarize hetero sites MtDNA
--partition Partition size in bp for subsetting 1000000
--gvcf_only Create sample gVCFs and stop false
--split_samples Create individual sample vcfs false
--username {user}
Reference Genome
---------------
--reference The fa.gz reference file to use {dataDir}/{species}/genomes/{project}/{ws_build}/{species}.{project}.{ws_build}.genome.fa.gz
Variant Filters
---------------
--min_depth Minimum variant depth 3
--qual Variant QUAL score 30.0
--strand_odds_ratio SOR_strand_odds_ratio 5.0
--quality_by_depth QD_quality_by_depth 20.0
--fisherstrand FS_fisher_strand 100.0
--high_missing Max % missing genotypes 0.95
--high_heterozygosity Max % max heterozygosity 0.10
- The latest update requires Nextflow version 24.10.0+. On Rockfish, you can access this version by loading the
nf24_env
conda environment prior to running the pipeline command:
ml anaconda
conda activate /data/eande106/software/conda_envs/nf24_env
nextflow run -latest andersenlab/wi-gatk --debug
nextflow run -latest andersenlab/wi-gatk --sample_sheet=/path/sample_sheet_GATK.txt --species c_elegans --bam_location=/path/to/bams --gvcf_location=/path/to/gvcfs
default = rockfish
There are three configuration profiles for this pipeline.
rockfish
- Used for running on Rockfishquest
- Used for running on Questlocal
- Used for local development
The sample sheet is a list of strains to jointly call variants from, one strain per line
AB1 |
AB4 |
BRC20067 |
Path to directory holding all the alignment files for strains in the analysis. Defaults to /vast/eande106/data/{species}/WI/alignments/
Path to directory holding all the gVCF files for strains. If one or more strain doesn't have a gVCF file, for example a new strain, one is created. Defaults to /vast/eande106/data/{species}/WI/alignments/
default = null
Options: c_elegans, c_briggsae, or c_tropicalis (for default reference, bam, and gvcf paths) or any other with reference, bam directory, and gvcf directory specified
default = PRJNA13758
WormBase project ID for selected species. Choose from some examples here
default = WS283
WormBase version to use for reference genome.
A fasta reference indexed with BWA. On Rockfish, the reference is available here:
/vast/eande106/data/c_elegans/genomes/PRJNA13758/WS283/c_elegans.PRJNA13758.WS283.genome.fa.gz
Note
If running on Rockfish, instead of changing the reference
parameter, opt to change the species
, project
, and ws_build
for other reference build (and then the reference will change automatically)
default = MtDNA
Name of contig to skip het polarization. Might need to change for other species besides c_elegans if the mitochondria contig is named differently
default = 1000000
Size in bp for partitioning the genome for subprocessing genotyping and filtering
default = false
Run only the gVCF creation steps of the workflow
default = false
Create individual vcf files, one per sample, from the hard-filtered vcf file
default = WI-{today's date} where the date is formatted as YYYYMMDD
A directory in which to output results
Note
This option is a nextflow parameter and so only uses a single dash to specify it
The final output directory looks like this:
├── variation
│ ├── *.hard-filter.vcf.gz
│ ├── *.hard-filter.vcf.tbi
│ ├── *.hard-filter.stats.txt
│ ├── *.hard-filter.filter_stats.txt
│ ├── *.soft-filter.vcf.gz
│ ├── *.soft-filter.vcf.tbi
│ ├── *.soft-filter.stats.txt
│ ├── *.soft-filter.filter_stats.txt
│ └── strain_vcf
│ ├── *.vcf.gz
│ └── *.vcf.gz.tbi
└── report
├── multiqc.html
└── multiqc_data
└── multiqc_*.json
quay.io-biocontainers-samtools-1.21--h50ea8bc_0
(link): Docker image maintained by biocontainers for samtoolsquay.io-biocontainers-bcftools-1.16--hfe4b78e_1
(link): Docker image maintained by biocontainers for bcftoolsquay.io-biocontainers-gatk4-4.6.1.0--py310hdfd78af_0
(link): Docker image maintained by biocontainers for gatkquay.io-biocontainers-multiqc-1.8--py_1
(link): Docker image maintained by biocontainers for multiqcandersenlab-hetpolarization-1.10
(link): Docker image is created within this pipeline using GitHub actions. Whenever a change is made toenv/hetpolarization.Dockerfile
or.github/workflows/build_docker.yml
GitHub actions will create a new docker image and push if successfulandersenlab-r_packages-v0.7
(link): Docker image is created manually, code can be found in the dockerfile repo.
Make sure that you have followed the Nextflow configuration described in the dry-guide prior to running the workflow.