Skip to content

Commit

Permalink
add BcfHeader::getFormatType
Browse files Browse the repository at this point in the history
  • Loading branch information
Zilong-Li committed Oct 11, 2023
1 parent caafbd7 commit 98519d2
Show file tree
Hide file tree
Showing 10 changed files with 131 additions and 173 deletions.
74 changes: 37 additions & 37 deletions doc/paper.org
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,15 @@ Email: zilong.dk@gamil.com.}}
C++ API of htslib in a single file, providing an intuitive interface
to manipulate VCF/BCF files rapidly and safely, as well as being
portable. Additionally, this work introduces the vcfppR package to
demonstrate the usage of vcfpp API in writing R script and packages
esaily for analyzing as well as visualizing genomic variations in
R. In the Benchmarking, the R script using vcfpp API shows faster
speed than the python script using cyvcf2 API when streaming variant
analysis. And in a two-step setting, the vcfppR package shows faster
speed of both loading VCF contents and processing genotypes than the
vcfR package. Finally, some useful command line tools using vcfpp
are available to demonstrate the easy-to-use of vcfpp in writing a
C++ program.
demonstrate the usage of developing R packages using vcfpp
seamlessly and esaily for analyzing and visualizing genomic
variations in R. In the Benchmarking, the vcfppR shows the same
performance as the C++ version, and shows faster speed than cyvcf2
when streaming variant analysis. Also, in a two-step setting, the
vcfppR shows faster speed of both loading VCF contents and
processing genotypes than the vcfR package. Finally, some useful
command line tools using vcfpp are available to demonstrate the
easy-to-use of vcfpp in writing a C++ program.
\end{abstract}
\end{titlepage}
#+end_export
Expand Down Expand Up @@ -201,38 +201,38 @@ In addition to simplicity and portability, I showcase the
performance of vcfpp and vcfppR. In the benchmarking, I performed
the same task as the cyvcf2 paper, that is counting heterozygous
genotypes per sample using code in Listing 1. As shown in Table
[[tb:counthets]], the "test-cyvcf2.py" is $1.3\times$ slower than the
"test-vcfpp-1.R", and both use little RAM in a streaming
strategy. As R packages usually load data into tables first and
perform analysis later, I also write a function to load whole VCF
content in R for such two-step comparison although it is not
preferable for this task. Notably, the "test-vcfpp-2.R" script is
only $1.9\times$ slower compared to $70\times$, $85\times$ slower
for "test-vcfR.R" and "test-fread.R" respectively. This is because
genotypes made by both vcfR and data.table::fread are characters,
which are very slow to parse in R despite a faster string library
was used. In contrast, with vcfpp, integer vectors of genotypes
can be returned to R directly for computation. Moreover, regarding
the best practice of data analysis in R, we usually want to
inspect part of the data table first to make further
decisions. And vcfpp has the full functionalities of htslib that
is supporting reading BCF, selecting samples and regions. A rapid
query of VCF content can be achieved by passing a region
parameter.
[[tb:counthets]], the vcfppR has the same performance as the vcfpp C++
API, while the cyvcf2 script is $1.4\times$ slower than the vcfppR
script. In the streaming setting, all three scripts use little RAM
for only loading one variant into memory. However, R packages like
vcfR and data.table usually load all tabular data into memory
first and perform analysis later. Additionaly, I develop /vcftable/
function in vcfppR to load whole VCF content in R for such
two-step comparison. Notably, the vcfppR is only $1.8\times$
slower compared to $70\times$, $85\times$ slower for vcfR and
data.table respectively. This is because genotype values returned
by both vcfR and data.table::fread are characters, which are very
slow to parse. In contrast, with vcfppR, integer matrix of
genotypes can be returned to R directly for fast
computation. Importantly, vcfpp and vcfppR offer the users the
full functionalities of htslib, such as supporting BCF, selecting
samples, regions and variant types. A rapid query of VCF content
in vcfppR can be achieved by passing a region parameter to
/vcftable/.

#+caption: Performance of counting heterozygous genotypes per sample in the 1000 Genome Project for chromosome 21. (*) used by loading data in two-step strategy.
#+name: tb:counthets
#+attr_latex: :align lllllll :placement [H]
|-------------+------------+-------+----------+-----------|
| API/Package | Time (s) | Ratio | RAM (Gb) | Strategy |
|-------------+------------+-------+----------+-----------|
| vcfpp | 118 | 1.0 | 0.006 | streaming |
| vcfppR | 119 | 1.0 | 0.07 | streaming |
| cyvcf2 | 159 | 1.3 | 0.04 | streaming |
| vcfppR | 164*+196 | 1.8 | 73 | two-step |
| vcfR | 651*+1382 | 9.9 | 105 | two-step |
| data.table | 272*+10275 | 85 | 77 | two-step |
|-------------+------------+-------+----------+-----------|
|-------------+------------+-------+----------+---------+-----------|
| API/Package | Time (s) | Ratio | RAM (Gb) | CPU (%) | Strategy |
|-------------+------------+-------+----------+---------+-----------|
| vcfpp | 118 | 1.0 | 0.006 | 99 | streaming |
| vcfppR | 118 | 1.0 | 0.076 | 101 | streaming |
| cyvcf2 | 165 | 1.4 | 0.038 | 101 | streaming |
| vcfppR | 205*+10 | 1.8 | 64.7 | 100 | two-step |
| vcfR | 678*+1147 | 15.5 | 97.5 | 100 | two-step |
| data.table | 263*+11243 | 97.5 | 77.3 | 200 | two-step |
|-------------+------------+-------+----------+---------+-----------|

* Discussion

Expand Down
17 changes: 9 additions & 8 deletions scripts/bench.sh
Original file line number Diff line number Diff line change
@@ -1,28 +1,29 @@
#!/bin/bash

# benchmarking performance of vcfpp-r against vcfR and fread
# benchmarking performance of vcfppR against cyvcf2, vcfR and fread

# use gnu time to record the RAM and TIME
gtime="/usr/bin/time"

# download vcf file from 1000 genome project
vcffile="1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"
if [ ! -f $vcffile ];then
wget -N -r --no-parent --no-directories https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz
wget -N -r --no-parent --no-directories https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz.tbi
fi

wget -N -r --no-parent --no-directories https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz
wget -N -r --no-parent --no-directories https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz.tbi


# first compile a C++ script
x86_64-conda-linux-gnu-c++ test-vcfpp.cpp -o test-vcfpp -std=c++11 -O3 -Wall -I/home/rlk420/mambaforge/envs/R/include -lhts

# big vcf file with very long INFO field
vcffile="1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"

# to be fair, make sure cyvcf2 and vcfpp are compiled against same version of htslib.
# do the benchmarking in same conda env

$gtime -vvv Rscript test-fread.R $vcffile &> test-fread.llog.1 &
$gtime -vvv Rscript test-vcfR.R $vcffile 2 &> test-vcfR.llog.2 &
$gtime -vvv Rscript test-vcfpp-1.R $vcffile &> test-vcfpp.llog.1 &
$gtime -vvv Rscript test-vcfpp-2.R $vcffile &> test-vcfpp.llog.2 &
$gtime -vvv Rscript test-vcfppR.R $vcffile 1 &> test-vcfppR.llog.1 &
$gtime -vvv Rscript test-vcfppR.R $vcffile 2 &> test-vcfppR.llog.2 &
$gtime -vvv python test-cyvcf2.py $vcffile &> test-cyvcf2.llog &
$gtime -vvv ./test-vcfpp -i $vcffile &> test-vcfpp.llog &
wait
Expand Down
11 changes: 4 additions & 7 deletions scripts/test-vcfR.R
Original file line number Diff line number Diff line change
@@ -1,14 +1,11 @@
library(stringr)
library(vcfR)

run <- 1
args <- commandArgs(trailingOnly = TRUE)
vcffile <- args[1]
run <- as.integer(args[2])

system.time(vcf <- read.vcfR(vcffile))
print(system.time(vcf <- read.vcfR(vcffile)))

gt <- extract.gt(vcf[is.biallelic(vcf),], element = 'GT', as.numeric = TRUE)
print(system.time(hets <- colSums(gt==1, na.rm = TRUE)))

if(run == 2) {
gt <- extract.gt(vcf[is.biallelic(vcf),], element = 'GT', as.numeric = TRUE)
system.time(hets <- apply(gt, 2, function(g) sum(g==1)))
}
8 changes: 0 additions & 8 deletions scripts/test-vcfpp-1.R

This file was deleted.

15 changes: 0 additions & 15 deletions scripts/test-vcfpp-2.R

This file was deleted.

56 changes: 9 additions & 47 deletions scripts/test-vcfpp-r.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,56 +4,18 @@ using namespace Rcpp;
using namespace std;

// [[Rcpp::export]]
int getRegionIndex(const string & vcffile, const std::string & region)
IntegerVector heterozygosity(std::string vcffile, std::string region = "", std::string samples = "-")
{
vcfpp::BcfReader vcf(vcffile);
return vcf.getRegionIndex(region);
}

// [[Rcpp::export]]
List readtable(const string & vcffile, const std::string & region)
{
vcfpp::BcfReader vcf(vcffile);
vcfpp::BcfRecord var(vcf.header);
int nsnps = vcf.getRegionIndex(region);
CharacterVector chr(nsnps), ref(nsnps), alt(nsnps), id(nsnps), filter(nsnps), info(nsnps);
IntegerVector pos(nsnps);
NumericVector qual(nsnps);
vector<vector<bool>> GT(nsnps);
vector<bool> gt;
for(int i = 0; i < nsnps; i++)
{
vcf.getNextVariant(var);
var.getGenotypes(gt);
GT[i] = gt;
pos(i) = var.POS();
qual(i) = var.QUAL();
chr(i) = var.CHROM();
id(i) = var.ID();
ref(i) = var.REF();
alt(i) = var.ALT();
filter(i) = var.FILTER();
info(i) = var.INFO();
}
return List::create(Named("chr") = chr, Named("pos") = pos, Named("id") = id, Named("ref") = ref,
Named("alt") = alt, Named("qual") = qual, Named("filter") = filter,
Named("info") = info, Named("gt") = GT);
}

// [[Rcpp::export]]
IntegerVector hetrate(const string & vcffile)
{
vcfpp::BcfReader vcf(vcffile);
vcfpp::BcfReader vcf(vcffile, region, samples);
vcfpp::BcfRecord var(vcf.header);
vector<char> gt;
vector<int> hetsum(vcf.nsamples, 0); // store the het counts
while(vcf.getNextVariant(var))
{
vector<int> gt;
vector<int> hetsum(vcf.nsamples, 0); // store the het counts
while (vcf.getNextVariant(var)) {
var.getGenotypes(gt);
// analyze SNP variant with no genotype missingness
if(!var.isSNP() || !var.isNoneMissing()) continue;
assert(var.ploidy() == 2); // make sure it is diploidy
for(int i = 0; i < gt.size() / 2; i++) hetsum[i] += abs(gt[2 * i + 0] - gt[2 * i + 1]);
if (!var.isSNP()) continue; // analyze SNPs only
assert(var.ploidy() == 2); // make sure it is diploidy
for (int i = 0; i < gt.size() / 2; i++)
hetsum[i] += abs(gt[2 * i + 0] - gt[2 * i + 1]) == 1;
}
return wrap(hetsum);
}
7 changes: 1 addition & 6 deletions scripts/test-vcfpp.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,8 @@ library(Rcpp)
Sys.setenv("PKG_LIBS"="-I/home/rlk420/mambaforge/envs/R/include -lhts")
sourceCpp("test-vcfpp-r.cpp", verbose=TRUE, rebuild=TRUE)

run <- 1
args <- commandArgs(trailingOnly = TRUE)
vcffile <- args[1]
run <- as.integer(args[2])

system.time(vcf <- readtable(vcffile, "chr21"))

if(run == 2) {
system.time(hets <- hetrate(vcffile))
}
system.time(hets <- heterozygosity(vcffile))
15 changes: 8 additions & 7 deletions scripts/test-vcfpp.cpp
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// -*- compile-command: "x86_64-conda-linux-gnu-c++ test-vcfpp.cpp -o test-vcfpp -std=c++11 -O3 -Wall -I/home/rlk420/mambaforge/envs/R/include -lhts" -*-
// -*- compile-command: "x86_64-conda-linux-gnu-c++ test-vcfpp.cpp -o test-vcfpp -std=c++11 -O3 -Wall -I../ -I/home/rlk420/mambaforge/envs/R/include -lhts" -*-

#include <vcfpp.h>
#include "vcfpp.h"
using namespace std;
using namespace vcfpp;

Expand Down Expand Up @@ -37,16 +37,17 @@ int main(int argc, char * argv[])
// ========= core calculation part ===========================================
BcfReader vcf(vcffile, region, samples);
BcfRecord var(vcf.header); // construct a variant record
vector<char> gt; // genotype can be bool, char or int type
vector<int> gt; // genotype can be bool, char or int type
vector<int> hetsum(vcf.nsamples, 0);
while(vcf.getNextVariant(var))
{
var.getGenotypes(gt);
// analyze SNP variant with no genotype missingness
if(!var.isSNP() || !var.isNoneMissing()) continue;
assert(var.ploidy() == 2); // make sure it is diploidy
for(int i = 0; i < gt.size() / 2; i++) hetsum[i] += abs(gt[2 * i + 0] - gt[2 * i + 1]);
if (!var.isSNP()) continue; // analyze SNPs only
assert(var.ploidy() == 2); // make sure it is diploidy
for (int i = 0; i < gt.size() / 2; i++)
hetsum[i] += abs(gt[2 * i + 0] - gt[2 * i + 1]) == 1;
}
for(auto i : hetsum) cout << i << endl;
// for(auto i : hetsum) cout << i << endl;
return 0;
}
17 changes: 10 additions & 7 deletions scripts/test-vcfppR.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,20 @@

library(vcfppR)

run <- 1
args <- commandArgs(trailingOnly = TRUE)
vcffile <- args[1]
run <- as.integer(args[2])

system.time(vcf <- tableGT(vcffile, "chr21"))
if(run == 1) {
print(paste("run", run))
print(system.time(res1 <- heterozygosity(vcffile)))
q(save="no")
}

if(run == 2) {
res <- sapply(vcf[["gt"]], function(a) {
n=length(a)
abs(a[seq(1,n,2)]-a[seq(2,n,2)])
})
hets<-rowSums(res)
print(paste("run", run))
print(system.time(vcf <- vcftable(vcffile, vartype = "snps")))
print(system.time(res2 <- colSums(vcf[["gt"]]==1, na.rm = TRUE)))
q(save="no")
}

Loading

0 comments on commit 98519d2

Please sign in to comment.