Tip
To import the workflow into your Terra workspace, click on the above Dockstore badge, and select 'Terra' from the 'Launch with' widget on the Dockstore workflow page.
This repository contains a WDL (Workflow Description Language) workflow for extracting information from a set of imputed VCF files using a list of query variants or sample IDs.
The workflow extracts the following information:
- Chromosome
- Position
- Reference allele
- Alternate allele
- Allele frequency (AF)
- Minor allele frequency (MAF)
- Imputation accuracy (R2)
- Empirical R-square (ER2)
- Genotype (GT)
- Estimated Alternate Allele Dosage (DS)
- Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 (GP)
The output is a set of files containing the extracted information.
query_variants
: A tab-delimited file with a list of query variants. Each line should be formatted as: Chromosome, Pos, ID, Ref, Alt. Each field should be separated by a tab. The Chromosome field should have a "chr" prefix (e.g., chr1, chr2, etc.). (required)query_samples
: A file with a list of sample IDs. Each line should contain one sample ID. (optional)imputed_vcf
: Array of imputed VCF files and their indices. VCF files should be in .vcf.gz format and indices in CSI or TBI format. (required)prefix
: Prefix for the output files. (required)extract_item
: A string specifying the information to extract from the FORMAT field of the VCF file. The available choices are GT, DS, and GP. Please provide as a comma-separated string. Example:GT,DS
(required)use_GT_from_PED
: A boolean flag indicating whether to source the genotype encoding from a PED file generated by Plink2 software. If set to true, the genotype encoding will be sourced from the PED file. If not specified or set to false, the genotype encoding will be extracted from the VCF file. (optional)match_pos_only
: A boolean flag indicating whether to match the variants based on position only. If set to true, the variants will be matched based on chromosome and position only. If not specified or set to false, the variants will be matched based on chromosome, position, reference allele, and alternate allele from thequery_variants
file. (optional)
-
SNP_INFO
:*_extracted_SNP_INFO.tsv
file contains the following columns:CHROM:POS:REF:ALT
: A combination of chromosome, position, reference allele, and alternate alleleCHROM
: ChromosomePOS
: PositionREF
: Reference alleleALT
: Alternate alleleAF
: Allele frequencyMAF
: Minor allele frequencyR2
: Imputation accuracyER2
: Empirical R-squareINFO
: Additional information indicating if the variant was imputed, typed, or typed only
-
genotype_info
:*_extracted_GT.csv
file contains the following columns (only generated ifGT
is specified as input inextract_item
parameter of the workflow):IID
: Sample IDCHROM:POS:REF:ALT
: A combination of chromosome, position, reference allele, and alternate allele. The values correspond to the genotype for each sample, following a custom order: Both 0/1 and 1/0 are represented as Ref/Alt, 0/0 is represented as Ref/Ref, and 1/1 is represented as Alt/Alt. If theuse_GT_from_PED
flag is set totrue
, the genotype encoding will be sourced from a PED file generated by Plink2 software.
-
dosage_info
:*_extracted_DS.csv
file contains the following columns (only generated ifDS
is specified as input inextract_item
parameter of the workflow):IID
: Sample IDCHROM:POS:REF:ALT
: A combination of chromosome, position, reference allele, and alternate allele, with the values corresponding to the estimated alternate allele dosage for each sample.
-
geno_prob_info
:*_extracted_GP.csv
file contains the following columns (only generated ifGP
is specified as input inextract_item
parameter of the workflow):IID
: Sample IDCHROM:POS:REF:ALT
: A combination of chromosome, position, reference allele, and alternate allele, with the values corresponding to the estimated posterior probabilities for genotypes 0/0, 0/1, and 1/1 for each sample.