Author: William Rowell wrowell@pacificbiosciences.com
In this case study we describe applying DeepVariant to PacBio HiFi reads to call variants. We will call small variants from a publicly available whole genome HiFi dataset from PacBio.
Starting in v1.4.0, PacBio calling uses one-step variant calling. If you're looking for documentation for the two-step process, please look at v1.3.0.
Singularity will be used to run DeepVariant and hap.py, and we'll use miniconda and a conda environment to handle the other dependencies for the case study and samtools.
- singularity (must be installed by
root
user; outside of the scope of this case study) - samtools
# add channels to conda configuration
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
# create the environment and install dependencies
conda create -y -n deepvariant_env
conda activate deepvariant_env
conda install -y samtools==1.10
We will be using GRCh38 for this case study.
mkdir -p reference
# download and decompress
curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > reference/GRCh38_no_alt_analysis_set.fasta
# index reference
samtools faidx reference/GRCh38_no_alt_analysis_set.fasta
We will benchmark our variant calls against v4.2.1 of the Genome in a Bottle small variant benchmarks for HG003.
mkdir -p benchmark
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG003_NA24149_father/NISTv4.2.1/GRCh38
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
We'll use HG003 chr20 HiFi reads publicly available from the PrecisionFDA Truth v2 Challenge.
mkdir -p input
HTTPDIR=https://downloads.pacbcloud.com/public/dataset/HG003/deepvariant-case-study
curl ${HTTPDIR}/HG003.GRCh38.chr20.pFDA_truthv2.bam > input/HG003.GRCh38.chr20.pFDA_truthv2.bam
curl ${HTTPDIR}/HG003.GRCh38.chr20.pFDA_truthv2.bam.bai > input/HG003.GRCh38.chr20.pFDA_truthv2.bam.bai
ulimit -u 10000 # https://stackoverflow.com/questions/52026652/openblas-blas-thread-init-pthread-create-resource-temporarily-unavailable/54746150#54746150
BIN_VERSION="1.4.0"
mkdir -p deepvariant_output
singularity exec --bind /usr/lib/locale/ \
docker://google/deepvariant:${BIN_VERSION} \
/opt/deepvariant/bin/run_deepvariant \
--model_type PACBIO \
--ref reference/GRCh38_no_alt_analysis_set.fasta \
--reads input/HG003.GRCh38.chr20.pFDA_truthv2.bam \
--output_vcf deepvariant_output/output.vcf.gz \
--num_shards $(nproc) \
--regions chr20
NOTE: If you want to run each of the steps separately, add --dry_run=true
to the command above to figure out what flags you need in each step. Based on
the different model types, different flags are needed in the make_examples
step.
mkdir -p happy
singularity exec docker://jmcdani20/hap.py:v0.3.12 \
/opt/hap.py/bin/hap.py \
--threads $(nproc) \
-r reference/GRCh38_no_alt_analysis_set.fasta \
-f benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
-o happy/giab-comparison.v4.2.first_pass \
--engine=vcfeval \
--pass-only \
-l chr20 \
benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
deepvariant_output/output.vcf.gz
Output:
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10628 10562 66 23213 71 12137 41 28 0.993790 0.993590 0.522854 0.993690 NaN NaN 1.748961 2.266841
INDEL PASS 10628 10562 66 23213 71 12137 41 28 0.993790 0.993590 0.522854 0.993690 NaN NaN 1.748961 2.266841
SNP ALL 70166 70137 29 109249 11 39034 4 1 0.999587 0.999843 0.357294 0.999715 2.296566 1.674805 1.883951 1.836323
SNP PASS 70166 70137 29 109249 11 39034 4 1 0.999587 0.999843 0.357294 0.999715 2.296566 1.674805 1.883951 1.836323