📊Evaluating, filtering, comparing, and visualising genomic variants
conda install -c vladsaveliev -c conda-forge -c bioconda vcf_stuff
If conda is not installed on your computer, preliminary run before the above:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh # Linux
# wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh # macOS
bash miniconda.sh -b -p ./miniconda && rm miniconda.sh
. miniconda/etc/profile.d/conda.sh
Clone source and install on top of conda:
git clone https://github.com/umccr/vcf_stuff
cd vcf_stuff
pip install -e .
If you want to explore Jupyter notebooks, install additionally:
conda install -y jupyter matplotlib matplotlib-venn
nosetests -s tests/test.py
eval_vcf
is a tool that can compare somatic VCFs to a gold standard and report metrics like sensitivity and specificity. The following commands compares test-ensemble.vcf.gz
and test-vardict.vcf.gz
to a gold standard data/test-benchmark.vcf.gz
:
cd tests
eval_vcf data/test-benchmark.vcf.gz \
data/test-ensemble.vcf.gz \
data/test-vardict.vcf.gz \
-g GRCh37 \
-o results
The tool will normalize all input VCFs (see norm_vcf
below), overlap calls in test-ensemble.vcf.gz
and test-vardict.vcf.gz
with the reference VCF test-benchmark.vcf.gz
, evaluate statistics and print them to stdout:
Sample SNP INDEL
TP FP FN Prec Recall TP FP FN Prec Recall
test-ensemble 1 0 0 100.00% 100.00% 0 0 0 0.00% 0.00%
test-vardict 1 5 0 16.67% 100.00% 0 0 0 0.00% 0.00%
Truth 1 0 1 100.00% 100.00% 0 0 0 100.00% 100.00%
And also save them into results/report.tsv
in a parsable TSV format.
Internally, the tools uses bcftools isec
to overlap VCFs. Intermediate results are saved into results/eval/{sample}_bcftools_isec/
for futher analysis if needed:
eval/test-ensemble_bcftools_isec/0000.vcf # FP (records private to test-ensemble.vcf.gz)
eval/test-ensemble_bcftools_isec/0001.vcf # FN (records private to test-benchmark.vcf.gz)
eval/test-ensemble_bcftools_isec/0002.vcf # TP (records from test-ensemble.vcf.gz shared by both)
eval/test-ensemble_bcftools_isec/0003.vcf # TP (records from test-benchmark.vcf.gz shared by both)
Optionally, a BED file can be specified with -r
:
eval_vcf benchmark.vcf sample.vcf -g GRCh37 -r callable_regions.bed
If provided, evaluation area will be restricted to those regions. I.e. both benchmark and target VCFs will be subset to those regions, after normalization, but before overlapping.
The tools finds the needed reference fasta file location on Spartan and Raijin, or if umccrise is loaded in PATH. Otherwise, you can specify the reference fasta explicitly with --ref-fasta
, e.g.
eval_vcf benchmark.vcf sample.vcf --ref-fasta /genomes/seq/hg38.fa -o results
On Spartan and Raijin, instead of feeding the truth VCF directly, one can use presets, e.g. mb
stands for the ICGC medulloblastoma study T/N somatic variant calling benchmark:
eval_vcf mb data/test-ensemble.vcf.gz data/test-vardict.vcf.gz -g GRCh37 -o results
See for available options at https://github.com/umccr/reference_data/blob/master/reference_data/paths.yml#L132-L138), e.g. mb
, colo
(COLO829 metastatic melanoma cell line), giab
(GiaB NA12878 germline variants), dream
(DREAM synthetic challenge 3). For giab
and dream
, additionally truth regions BED files are applied, which are merged automatically with -r
regions if those are also provided.
Similar to the VCF evaluation, you can compare CNV calls to the truth sets. The following callers are supported:
- cnvkit
- facets
- purple
- manta
The following truth sets are supported:
- HCC2218 exome
- COLO829 WGS (Hartwig's)
- COLO820 WGS (Craig's study)
Usage:
eval_cnv -g GRCh37 -o results_eval_cnv \
data/cnv/hcc2218/HCC2218_truthset_cnv_bcbio.tsv \
data/cnv/hcc2218/HCC2218_cnvkit-call.cns \
data/cnv/hcc2218/HCC2218_purple.cnv.tsv \
data/cnv/hcc2218/HCC2218_facets_cncf.tsv
The tool produces 3 tables with standard performance statistics (true/false positive, false negatic, recall and precision rates):
Gene level comparison
Sample TP FP FN Recall Prec
0 HCC2218_cnvkit-call 6399 2631 13 99.80% 70.86%
1 HCC2218_facets_cncf 6399 5280 13 99.80% 54.79%
2 HCC2218_manta 606 89 5806 9.45% 87.19%
3 HCC2218_purple.cnv 4602 5924 1810 71.77% 43.72%
Event level comparison (Amp, Del)
Sample TP FP FN Recall Prec
0 HCC2218_cnvkit-call 6396 2638 20 99.69% 70.80%
1 HCC2218_facets_cncf 6398 5289 18 99.72% 54.74%
2 HCC2218_manta 317 378 6099 4.94% 45.61%
3 HCC2218_purple.cnv 4022 6508 2394 62.69% 38.20%
CN level comparison
Sample TP FP FN Recall Prec
0 HCC2218_cnvkit-call 5950 3125 494 92.33% 65.56%
1 HCC2218_facets_cncf 5821 5876 623 90.33% 49.76%
2 HCC2218_manta 0 695 6444 0.00% 0.00%
3 HCC2218_purple.cnv 3111 7423 3333 48.28% 29.53%
Each table represents the different level of comparison:
-
Gene level comparison
compares the sets of gene in which any event is occured. E.g. the truth set has a deletion in EGFR, and the sample has also any other event in EGFR (e.g. an amplification), it will count as a true positive. -
Event level comparison (Amp, Del)
also requires the types of events per gene to be the same. It supports 2 types of events: Amp and Del.DUP
for certain callers is authomatically translated intoAmp
, andDEL
intoDel
. Callers that do not report event types but report CN values, CN>2 translates intoAmp
, CN<2 translates intoDel
, and CN=2 is ignored (we don't support copy-neutral LOHs). -
CN level comparison
requires also the integer copy number estimation values to be the same. Only generated for callers and truth sets that contain CN values. For example, for WGS COLO829, it would looks like the following:
Gene level comparison
Sample TP FP FN Recall Prec
COLO_TGEN_bwa-cnvkit-call 6716 4680 6 99.91% 58.93%
Event level comparison (Amp, Del)
Sample TP FP FN Recall Prec
COLO_TGEN_bwa-cnvkit-call 6714 4700 8 99.88% 58.82%
In addition to that overall stats table, the tool will produce a per-gene table for details exploration, like the following:
truth HCC2218_cnvkit-call HCC2218_facets_cncf HCC2218_manta HCC2218_purple.cnv
1 AADACL3 Del:1 Del:1 Del:1 Del:1
1 AADACL4 Del:1 Del:1 Del:1 Del:1
1 ABCB10 Amp:6 Amp:6 Amp:6 Amp:4
1 ABL2 Amp:3 Amp:3 Amp:3
1 AC004824.2 Del:1 Del:1 Del:1 Del:1
1 AC092782.1 Amp:4 Amp:4 Amp:4 Amp:3
1 AC092811.1 Amp:5 Amp:5 Amp:5 Amp:4
...
The tools will write the report into results_eval_cnv/report.tsv
, and the per-gene table into results_eval_cnv/table.tsv
.
To consistently determine the genes affected by events, the tools re-annotates all events with the bed_annotation package that assigns gene names to Ensembl genomic regions.
Removing variants detected as germline in a set of unrelated normal tissue samples helps to reduce the FP rate when it was caused by unbalanced coverage in matching regions normals.
Below showing stats for the evaluation of the ICGC MB T/N variant calling with 300x tumor coverage, and 50 normal coverage. The number in vardict_n1
means how many heats in PoN we allow before we filter out the variant.
Annotate a VCF against a panel of normals:
pon_anno data/test-ensemble.vcf.gz -g GRCh37 -o test-ensemble.pon.vcf
This adds PoN_CNT
INFO
flags indicating the number of hits in the panel; writes output VCF to test-ensemble.pon.vcf
Annotate and soft-filter variants with at least 2 hits (adds PoN
flag into FILTER
):
pon_anno data/test-ensemble.vcf.gz -h 2 -g GRCh37 -o test-ensemble.pon.vcf
By default, pon_anno
only checks the positions of the variants. If you want to compare exact alleles, use the --check-allele
flag.
To process multiple samples with multiple threshold hits, you can use the pon_pipeline
script:
pon_pipeline data/test-vardict.vcf.gz data/test-strelka.vcf.gz -o results_pon -h1,2,3 -g GRCh37
The filtered VCF files will be written to results_pon/pon_filter/
.
Scripts only know about the panel location on Spartan and Raijin, so to work outside, provide the path to the panel with --pon-dir
.
The script is also used withing a bigger anno_somatic_vcf
, which
annotates against various sources that then can be used by filter_somatic_vcf
to filter a somatic VCF. Both script are used in Umccrise.
On Spartan:
cd /data/cephfs/punim0010/extras/panel_of_normals
snakemake -s prep_normals.smk -p -j30
Normalise VCF file:
norm_vcf data/test-ensemble.vcf.gz -g GRCh37 > test-ensemble.norm.vcf
This script does the following steps:
- Split multi-allelic variants into single sample records.
For instance, split one record
#CHROM POS ID REF ALT
1 10 . A T,C
Into 2 separate records
#CHROM POS ID REF ALT
1 10 . A T
1 10 . A C
For that, we are using vt tools:
vt decompose -s vcf_file
- Decompose biallelic block substitutions.
For instance, split the following one records:
#CHROM POS ID REF ALT
1 20 . AG CT
into 2 separate ones:
#CHROM POS ID REF ALT
1 20 . A C
1 20 . G T
We are using for that vcflib's vcfallelicprimitives
:
vcfallelicprimitives -t DECOMPOSED --keep-geno vcf_file
- Left-align and normalize indels, check if REF alleles match the reference.
For instance, given that the reference chromosome 1 starts with GCTCCG
, split the following records
#CHROM POS ID REF ALT
1 2 . CTCC CCC,C,CCCC
into the following 3:
#CHROM POS ID REF ALT
1 1 . GCTC G
1 2 . CT C
1 3 . T C
These steps are applied to each input VCF for the eval_vcf
pipeline above.
VCFs coming from bcbio-nextgen are called with different callers, each using its own way to report quality, depth and allelic frequencies (if at all). To facilitate processing and reporting VCFs in PCGR, we prepared a script that calculates and populates TUMOR_AF
, NORMAL_AF
, TUMOR_DP
, NORMAL_DP
fields, this way standardizing output from Strelka2, Mutect2, Freebayes, GATK Haplotype Caller, and VarDict, and consequenctly, Ensemble calls.
pcgr_prep data/test-ensemble.vcf.gz -g GRCh37 > test-ensemble.pcgr_prep.vcf
This repository provides description, code and results for the approaches to somatic variant filtering in UMCCR.
We summarize the filtering ideas into a spreadsheet, where for problem and idea we provide a corresponding solution (if available) used in CaVEMan, bcbio, and AZ VarDict pipeline. The basic factors and ideas are from Matt Eldridge's slides.
Commands to filter VarDict VCF files.
Moves main sample AF and DP fields into INFO, for PCGR post-processing:
proc_vardict_vcf fix_fields vardict.vcf.gz > out.vcf
Applies special AF threshold filtering to homopolimers based on MSI
INFO
fields generated by VarDict. Writes MSI_FILTER
into the FILTER
field.
proc_vardict_vcf filter_low_af_msi vardict.vcf.gz > out.vcf