Tool for calculating genotypic diagnostic values for discrete traits or classes from phenotypic and variant calling data to aid in marker development
Get predictive value (PV), true positive rate (TPR), false positive rate (FPR) and positive likelihood ratio (LR) of genotypes (GTs) as potential markers for discrete traits. Takes flapjack formatted genotype, phenotype and map files.
- For a given marker site, the PV (a.k.a precision) of a GT for a particular trait is: PV=n(GT|trait)/n(GT) =TP/(TP+FP) where a positive test represents a known GT call - i.e. the proportion of all samples with a particular GT that actually have the trait.
- The TPR (=TP/(TP+FN)) can also be considered the reverse PV - i.e. the proportion of all samples having the trait actually positive for the GT (a.k.a. recall or sensitivity).
- FPR=FP/(FP+TN), which is the proportion of FPs out of all samples without the trait.
- False discovery rate (FDR), false negative rate (FNR) and true negative rate (TNR) are reciprocals of PV, TPR and FPR (FDR=1-PV, FNR=1-TPR, TNR=1-FPR).
- LR=TPR/FPR and is used for assessing the value of a positive GT call in usefully changing the odds that the trait exists in a test sample.
- An ideal marker should have both PV and TPR close to 1, with a LR of 10 or more (assuming a sufficient number and diversity of input samples).
Genotype file must first be transposed with transpose.awk script. Alternate orders of alleles (i.e. A/T or T/A) are treated as distinct. Nucleotide groupings are also tested where applicable, e.g. "hasC" comprises all GTs containing "C" (C, C/T, T/C) for a given site. CalcGDV Does not adjust for population stratification and/or linkage disequilibrium (LD) which the user should factor into their marker/sample selection beforehand. Writes to STDOUT.
Here we will use a toy multisample VCF file (variants.vcf.gz) containing ~300 variant calls for 182 samples. A phenotype matrix (phenos.tsv) is also provided containing sample rows and two trait columns labelled "symptoms" and "sex".
variants.vcf.gz has already been quality filtered but we will run additional selection using BCFtools to inlcude only biallelic SNPs with no more than 10% missing calls.
bcftools view -m2 -M2 -v snps -i 'F_MISSING < 0.1' --output-type z -o variants.qc.vcf.gz variants.vcf.gz
new vcf is written to variants.qc.vcf.gz
We first need to convert the vcf to a flapjack formatted genotype file. In the absence of flapjack software, open source tools such as NGSEPcore can be used. A map file is also generated containing marker position information.
java -jar NGSEPcore_5.0.0.jar VCFConverter -flapjack -i variants.qc.vcf.gz -o genos
outputs written to genos_fj.gen and genos_fj.map. NGSEPcore automatically generates marker names for variants
calcGDV requires the .gen file to be transposed (markers as rows instead of columns). We can do this using the provided transpose.awk script to generate genos_fj.tr.gen
./transpose.awk genos_fj.gen > genos_fj.tr.gen
Now we can run calcGDV.py, trying out different parameters. Use python calcGDV.py -h
to see a full list of options.
i) run with minimum arguments - this will output values for every genotype in genos_fj.tr.gen using the "symptoms" column of phenos.tsv
python calcGDV.py -g genos_fj.tr.gen -t phenos.tsv,symptoms > gdv_out.tsv
output for a single marker (i.e. variant site) might look something like:
...
#markerID=Marker143,GTavail=182,GTmissing=0
#trait=immune,GTavail=91,GTmissing=0
#trait=mild,GTavail=47,GTmissing=0
#trait=susceptible,GTavail=44,GTmissing=0
#<ID> <GT> <n> <trait> <PV> <TPR> <FPR> <LR>
Marker143 G 97 immune 0.701 0.747 0.319 2.345
Marker143 G 97 mild 0.237 0.489 0.548 0.893
Marker143 G 97 susceptible 0.062 0.136 0.659 0.207
Marker143 T 19 mild 0.053 0.021 0.133 0.16
Marker143 T 19 susceptible 0.947 0.409 0.007 56.455
Marker143 G/T 66 immune 0.348 0.253 0.473 0.535
Marker143 G/T 66 mild 0.348 0.489 0.319 1.536
Marker143 G/T 66 susceptible 0.303 0.455 0.333 1.364
Marker143 hasG 163 immune 0.558 1.0 0.791 1.264
Marker143 hasG 163 mild 0.282 0.979 0.867 1.129
Marker143 hasG 163 susceptible 0.16 0.591 0.993 0.595
Marker143 hasT 85 immune 0.271 0.253 0.681 0.371
Marker143 hasT 85 mild 0.282 0.511 0.452 1.13
Marker143 hasT 85 susceptible 0.447 0.864 0.341 2.536
...
in this example, Marker143 comprises the genotypes 'G', 'T' and 'G/T'. Lines with 'hasG' give the predictiveness of either 'G' or 'G/T' genotypes (likewise for 'hasT'). Values for each of the three symptom traits are shown for every genotype. We can see that genotype 'T' has good precision and specificity (high PV and low FPR) for the "susceptible" trait, but has poor sensitivity (low TPR). Thus, with a LR of 56.455, a positive test for 'T' may be useful in a diagnostic context (when symptoms are already present) but not ideal for screening tests (missing true cases).
ii) run on selection of markers - supply a comma separated list of marker IDs to -s
option
python calcGDV.py -g genos_fj.tr.gen -t phenos.tsv,symptoms -s Marker142,Marker143 > gdv_out.tsv
this will output only results for Marker142 and Marker143
alternatively, a text file containing a single column list of marker IDs can be provided to -s
python calcGDV.py -g genos_fj.tr.gen -t phenos.tsv,symptoms -s markers.txt > gdv_out.tsv
iii) filtering output with cutoffs - output only the most significant genotypes by applying thresholds to PV, TPR, FPR and LR
python calcGDV.py -g genos_fj.tr.gen -t phenos.tsv,symptoms -s markers.txt -n 10 -lr 10 -tpr 0.6 > gdv_out.tsv
here
-n 10
indicates the min occurences of a genotype required to be reported,-lr 10
specifies a LR of 10 or greater,-tpr 0.6
specifies a TPR of 0.6 or greater
this now reports just a single genotype satisfying all cutoffs:
#<ID> <GT> <n> <trait> <PV> <TPR> <FPR> <LR>
Marker142 T 32 susceptible 0.938 0.682 0.014 47.045
iv) disregard traits or missing pheno data - label for missing or undesired trait can be added the -t
string to omit samples with that label
python calcGDV.py -g genos_fj.tr.gen -t phenos.tsv,symptoms,mild -s markers.txt -lr 10 -n 10 -tpr 0.6 > gdv_out.tsv
samples with the "mild" symptom are not included in calculating diagnostic values
using the same cutoffs as before, the output has now changed:
#<ID> <GT> <n> <trait> <PV> <TPR> <FPR> <LR>
Marker142 A 65 immune 0.985 0.703 0.023 30.945
Marker142 T 30 susceptible 1.0 0.682 0.0 nan
omitting "mild" samples results in zero false positives for 'T', making the LR undefined (treated as infinite)
v) run on markers within specific genomic ranges - as an alternative to -s
selection, a referece sequence range can be given with option -r
(requires map file)
python calcGDV.py -g genos_fj.tr.gen -t phenos.tsv,symptoms -r genos_fj.map,chr1_A:4022300-4022500 -lr 10 -n 10 -tpr 0.6 > gdv_out.tsv
first argument of
-r
is the flapjack map file genos_fj.map followed by single genomic range in the format seqid:start-end
output lines contain additional seqid and position fields:
#<ID> <GT> <n> <trait> <PV> <TPR> <FPR> <LR> <seq> <pos>
Marker142 T 32 susceptible 0.938 0.682 0.014 47.045 chr1_A 4022348
...
Marker143 T 19 susceptible 0.947 0.409 0.007 56.455 chr1_A 4022497
instead of a single range, a BED file containing multiple ranges can be supplied in place of seqid:start-end coordinate
python calcGDV.py -g genos_fj.tr.gen -t phenos.tsv,symptoms -r genos_fj.map,regions.bed > gdv_out.tsv
vi) range and marker selection can be combined - even if their respective regions don't overlap
egrep "Marker142|Marker143" genos_fj.map
--> Marker142 chr1_A 4022348
Marker143 chr1_A 4022497
python calcGDV.py -g genos_fj.tr.gen -t phenos.tsv,symptoms -s Marker142 -r genos_fj.map,chr1_A:4022400-4022500 -pv 0.9 > gdv_out.tsv
vii) sample selection can be applied using label from another trait column - diagnostic values are derived from only samples of a particular group specified in -f
python calcGDV.py -g genos_fj.tr.gen -t phenos.tsv,symptoms -lr 50 -n 10 -f sex,male > gdv_out.tsv
here
-f sex,male
tells calcGDV to only consider samples labelled "male" in the "sex" column