VariantMerging is a workflow for combining variant calls from SNV analyses done with different callers (such as muTect2, strelka2). The workflow pre-processes input vcf files by removing non-canonical contigs, fixing fields and inferring missing values from available data. It combines calls, annotating them with caller-specific tags which allows identification of consensus variants. The workflow also uses GATK for producing merged results. In this case, all calls appear as-as. Essentially, this is a simple concatenation of the inputs.
The script used at this step performs the following tasks:
- removes non-canonical contigs
- adds GT and AD fields (dot or calculated based on NT, SGT, if available)
- removes tool-specific header lines
java -jar cromwell.jar run variantMerging.wdl --inputs inputs.json
Parameter | Value | Description |
---|---|---|
reference |
String | Reference assmbly id, passed by the respective olive |
inputVcfs |
Array[Pair[File,String]] | Pairs of vcf files (SNV calls from different callers) and metadata string (producer of calls). |
tumorName |
String | Tumor id to use in vcf headers |
outputFileNamePrefix |
String | Output prefix to prefix output file names with. |
Parameter | Value | Default | Description |
---|---|---|---|
normalName |
String? | None | Normal id to use in vcf headers, Optional |
Parameter | Value | Default | Description |
---|---|---|---|
preprocessVcf.preprocessScript |
String | "$VARMERGE_SCRIPTS_ROOT/bin/vcfVetting.py" | path to preprocessing script |
preprocessVcf.jobMemory |
Int | 12 | memory allocated to preprocessing, in gigabytes |
preprocessVcf.timeout |
Int | 10 | timeout in hours |
mergeVcfs.timeout |
Int | 20 | timeout in hours |
mergeVcfs.jobMemory |
Int | 12 | Allocated memory, in GB |
combineVariants.combiningScript |
String | "$VARMERGE_SCRIPTS_ROOT/bin/vcfCombine.py" | Path to combining script |
combineVariants.jobMemory |
Int | 12 | memory allocated to preprocessing, in GB |
combineVariants.timeout |
Int | 20 | timeout in hours |
postprocessMerged.postprocessScript |
String | "$VARMERGE_SCRIPTS_ROOT/bin/vcfVetting.py" | path to postprocessing script, this is the same script we use for pre-processing |
postprocessMerged.jobMemory |
Int | 12 | memory allocated to preprocessing, in gigabytes |
postprocessMerged.timeout |
Int | 10 | timeout in hours |
postprocessCombined.postprocessScript |
String | "$VARMERGE_SCRIPTS_ROOT/bin/vcfVetting.py" | path to postprocessing script, this is the same script we use for pre-processing |
postprocessCombined.jobMemory |
Int | 12 | memory allocated to preprocessing, in gigabytes |
postprocessCombined.timeout |
Int | 10 | timeout in hours |
Output | Type | Description | Labels |
---|---|---|---|
mergedVcf |
File | vcf file containing all variant calls | vidarr_label: mergedVcf |
mergedIndex |
File | tabix index of the vcf file containing all variant calls | vidarr_label: mergedIndex |
combinedVcf |
File | combined vcf file containing all variant calls | vidarr_label: combinedVcf |
combinedIndex |
File | index of combined vcf file containing all variant calls | vidarr_label: combinedIndex |
This section lists command(s) run by variantMerging workflow
Detect NORMAL/TUMOR swap, impute missing fields (i.e. in case of such callers as strelka)
python3 PREPROCESSING_SCRIPT VCF_FILE -o VCF_FILE_BASENAME_tmp.vcf -r REFERENCE_ID
bgzip -c VCF_FILE_BASENAME_tmp.vcf > VCF_FILE_BASENAME_tmp.vcf.gz
gatk SortVcf -I VCF_FILE_BASENAME_tmp.vcf.gz
-R REF_FASTA
-O VCF_FILE_BASENAME_processed.vcf.gz
This is a simple concatenation of input vcfs, there may be duplicate entries for the same call if multiple callers discover the same variant.
gatk MergeVcfs -I INPUT_VCFS -O PREFIX_mergedVcfs.vcf.gz
A more complex merging: matching fields will be annotated by caller.
set -euxo pipefail
python3 <<CODE
import sys
v = "~{sep=' ' inputVcfs}"
vcfFiles = v.split()
with open("vcf_list", 'w') as l:
for v in vcfFiles:
l.write(v + "\n")
CODE
python3 COMBINING_SCRIPT vcf_list -c OUTPUT_PREFIX_tmp.vcf -n ~{sep=',' inputNames}
gatk SortVcf -I OUTPUT_PREFIX_tmp.vcf -R REFERENCE_FASTA -O OUTPUT_PREFIX_combined.vcf.gz
The same script used for preprocessing injects names of samples into the header if argumants are passed
set -euxo pipefail
python3 ~{postprocessScript} ~{vcfFile} -o ~{basename(vcfFile, '.vcf.gz')}_tmp.vcf -r ~{referenceId} -t ~{tumorName} ~{"-n " + normalName}
bgzip -c ~{basename(vcfFile, '.vcf.gz')}_tmp.vcf > ~{basename(vcfFile, '.vcf.gz')}.vcf.gz
tabix -p vcf ~{basename(vcfFile, '.vcf.gz')}.vcf.gz
For support, please file an issue on the Github project or send an email to gsi@oicr.on.ca .
Generated with generate-markdown-readme (https://github.com/oicr-gsi/gsi-wdl-tools/)