add BcfHeader::getFormatType

Zilong-Li · Oct 11, 2023 · 98519d2 · 98519d2
1 parent caafbd7
commit 98519d2
Show file tree

Hide file tree

Showing 10 changed files with 131 additions and 173 deletions.
diff --git a/doc/paper.org b/doc/paper.org
@@ -20,15 +20,15 @@ Email: zilong.dk@gamil.com.}}
   C++ API of htslib in a single file, providing an intuitive interface
   to manipulate VCF/BCF files rapidly and safely, as well as being
   portable. Additionally, this work introduces the vcfppR package to
-  demonstrate the usage of vcfpp API in writing R script and packages
-  esaily for analyzing as well as visualizing genomic variations in
-  R. In the Benchmarking, the R script using vcfpp API shows faster
-  speed than the python script using cyvcf2 API when streaming variant
-  analysis. And in a two-step setting, the vcfppR package shows faster
-  speed of both loading VCF contents and processing genotypes than the
-  vcfR package. Finally, some useful command line tools using vcfpp
-  are available to demonstrate the easy-to-use of vcfpp in writing a
-  C++ program.
+  demonstrate the usage of developing R packages using vcfpp
+  seamlessly and esaily for analyzing and visualizing genomic
+  variations in R. In the Benchmarking, the vcfppR shows the same
+  performance as the C++ version, and shows faster speed than cyvcf2
+  when streaming variant analysis. Also, in a two-step setting, the
+  vcfppR shows faster speed of both loading VCF contents and
+  processing genotypes than the vcfR package. Finally, some useful
+  command line tools using vcfpp are available to demonstrate the
+  easy-to-use of vcfpp in writing a C++ program.
   \end{abstract}
 \end{titlepage}
 #+end_export
@@ -201,38 +201,38 @@ In addition to simplicity and portability, I showcase the
 performance of vcfpp and vcfppR. In the benchmarking, I performed
 the same task as the cyvcf2 paper, that is counting heterozygous
 genotypes per sample using code in Listing 1. As shown in Table
-[[tb:counthets]], the "test-cyvcf2.py" is $1.3\times$ slower than the
-"test-vcfpp-1.R", and both use little RAM in a streaming
-strategy. As R packages usually load data into tables first and
-perform analysis later, I also write a function to load whole VCF
-content in R for such two-step comparison although it is not
-preferable for this task. Notably, the "test-vcfpp-2.R" script is
-only $1.9\times$ slower compared to $70\times$, $85\times$ slower
-for "test-vcfR.R" and "test-fread.R" respectively. This is because
-genotypes made by both vcfR and data.table::fread are characters,
-which are very slow to parse in R despite a faster string library
-was used. In contrast, with vcfpp, integer vectors of genotypes
-can be returned to R directly for computation. Moreover, regarding
-the best practice of data analysis in R, we usually want to
-inspect part of the data table first to make further
-decisions. And vcfpp has the full functionalities of htslib that
-is supporting reading BCF, selecting samples and regions. A rapid
-query of VCF content can be achieved by passing a region
-parameter.
+[[tb:counthets]], the vcfppR has the same performance as the vcfpp C++
+API, while the cyvcf2 script is $1.4\times$ slower than the vcfppR
+script. In the streaming setting, all three scripts use little RAM
+for only loading one variant into memory. However, R packages like
+vcfR and data.table usually load all tabular data into memory
+first and perform analysis later. Additionaly, I develop /vcftable/
+function in vcfppR to load whole VCF content in R for such
+two-step comparison. Notably, the vcfppR is only $1.8\times$
+slower compared to $70\times$, $85\times$ slower for vcfR and
+data.table respectively. This is because genotype values returned
+by both vcfR and data.table::fread are characters, which are very
+slow to parse. In contrast, with vcfppR, integer matrix of
+genotypes can be returned to R directly for fast
+computation. Importantly, vcfpp and vcfppR offer the users the
+full functionalities of htslib, such as supporting BCF, selecting
+samples, regions and variant types. A rapid query of VCF content
+in vcfppR can be achieved by passing a region parameter to
+/vcftable/.
 
 #+caption: Performance of counting heterozygous genotypes per sample in the 1000 Genome Project for chromosome 21. (*) used by loading data in two-step strategy.
 #+name: tb:counthets
 #+attr_latex: :align lllllll :placement [H]
-|-------------+------------+-------+----------+-----------|
-| API/Package | Time (s)   | Ratio | RAM (Gb) | Strategy  |
-|-------------+------------+-------+----------+-----------|
-| vcfpp       | 118        |   1.0 |    0.006 | streaming |
-| vcfppR      | 119        |   1.0 |     0.07 | streaming |
-| cyvcf2      | 159        |   1.3 |     0.04 | streaming |
-| vcfppR      | 164*+196   |   1.8 |       73 | two-step  |
-| vcfR        | 651*+1382  |   9.9 |      105 | two-step  |
-| data.table  | 272*+10275 |    85 |       77 | two-step  |
-|-------------+------------+-------+----------+-----------|
+|-------------+------------+-------+----------+---------+-----------|
+| API/Package | Time (s)   | Ratio | RAM (Gb) | CPU (%) | Strategy  |
+|-------------+------------+-------+----------+---------+-----------|
+| vcfpp       | 118        |   1.0 |    0.006 |      99 | streaming |
+| vcfppR      | 118        |   1.0 |    0.076 |     101 | streaming |
+| cyvcf2      | 165        |   1.4 |    0.038 |     101 | streaming |
+| vcfppR      | 205*+10    |   1.8 |     64.7 |     100 | two-step  |
+| vcfR        | 678*+1147  |  15.5 |     97.5 |     100 | two-step  |
+| data.table  | 263*+11243 |  97.5 |     77.3 |     200 | two-step  |
+|-------------+------------+-------+----------+---------+-----------|
 
 * Discussion
 

diff --git a/scripts/bench.sh b/scripts/bench.sh
@@ -1,28 +1,29 @@
 #!/bin/bash
 
-# benchmarking performance of vcfpp-r against vcfR and fread
+# benchmarking performance of vcfppR against cyvcf2, vcfR and fread
 
 # use gnu time to record the RAM and TIME
 gtime="/usr/bin/time"
 
 # download vcf file from 1000 genome project
+vcffile="1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"
+if [ ! -f $vcffile ];then
+   wget -N -r --no-parent --no-directories https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz
+   wget -N -r --no-parent --no-directories https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz.tbi
+fi
 
-wget -N -r --no-parent --no-directories https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz
-wget -N -r --no-parent --no-directories https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz.tbi
-
+
 # first compile a C++ script
 x86_64-conda-linux-gnu-c++ test-vcfpp.cpp -o test-vcfpp -std=c++11 -O3 -Wall -I/home/rlk420/mambaforge/envs/R/include -lhts
 
-# big vcf file with very long INFO field
-vcffile="1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"
 
 # to be fair, make sure cyvcf2 and vcfpp are compiled against same version of htslib.
 # do the benchmarking in same conda env
 
 $gtime -vvv Rscript test-fread.R $vcffile &> test-fread.llog.1 &
 $gtime -vvv Rscript test-vcfR.R $vcffile 2 &> test-vcfR.llog.2 &
-$gtime -vvv Rscript test-vcfpp-1.R $vcffile &> test-vcfpp.llog.1 &
-$gtime -vvv Rscript test-vcfpp-2.R $vcffile &> test-vcfpp.llog.2 &
+$gtime -vvv Rscript test-vcfppR.R $vcffile 1 &> test-vcfppR.llog.1 &
+$gtime -vvv Rscript test-vcfppR.R $vcffile 2 &> test-vcfppR.llog.2 &
 $gtime -vvv python test-cyvcf2.py $vcffile &> test-cyvcf2.llog &
 $gtime -vvv ./test-vcfpp -i $vcffile &> test-vcfpp.llog &
 wait

diff --git a/scripts/test-vcfR.R b/scripts/test-vcfR.R
@@ -1,14 +1,11 @@
-library(stringr)
 library(vcfR)
 
 run <- 1
 args <- commandArgs(trailingOnly = TRUE)
 vcffile <- args[1]
-run <- as.integer(args[2])
 
-system.time(vcf <- read.vcfR(vcffile))
+print(system.time(vcf <- read.vcfR(vcffile)))
+
+gt <- extract.gt(vcf[is.biallelic(vcf),], element = 'GT', as.numeric = TRUE)
+print(system.time(hets <- colSums(gt==1, na.rm = TRUE)))
 
-if(run == 2) {
-  gt <- extract.gt(vcf[is.biallelic(vcf),], element = 'GT', as.numeric = TRUE)
-  system.time(hets <- apply(gt, 2, function(g) sum(g==1)))
-}
diff --git a/scripts/test-vcfpp-1.R b/scripts/test-vcfpp-1.R
diff --git a/scripts/test-vcfpp-2.R b/scripts/test-vcfpp-2.R
diff --git a/scripts/test-vcfpp-r.cpp b/scripts/test-vcfpp-r.cpp
@@ -4,56 +4,18 @@ using namespace Rcpp;
 using namespace std;
 
 // [[Rcpp::export]]
-int getRegionIndex(const string & vcffile, const std::string & region)
+IntegerVector heterozygosity(std::string vcffile, std::string region = "", std::string samples = "-")
 {
-    vcfpp::BcfReader vcf(vcffile);
-    return vcf.getRegionIndex(region);
-}
-
-// [[Rcpp::export]]
-List readtable(const string & vcffile, const std::string & region)
-{
-    vcfpp::BcfReader vcf(vcffile);
-    vcfpp::BcfRecord var(vcf.header);
-    int nsnps = vcf.getRegionIndex(region);
-    CharacterVector chr(nsnps), ref(nsnps), alt(nsnps), id(nsnps), filter(nsnps), info(nsnps);
-    IntegerVector pos(nsnps);
-    NumericVector qual(nsnps);
-    vector<vector<bool>> GT(nsnps);
-    vector<bool> gt;
-    for(int i = 0; i < nsnps; i++)
-    {
-        vcf.getNextVariant(var);
-        var.getGenotypes(gt);
-        GT[i] = gt;
-        pos(i) = var.POS();
-        qual(i) = var.QUAL();
-        chr(i) = var.CHROM();
-        id(i) = var.ID();
-        ref(i) = var.REF();
-        alt(i) = var.ALT();
-        filter(i) = var.FILTER();
-        info(i) = var.INFO();
-    }
-    return List::create(Named("chr") = chr, Named("pos") = pos, Named("id") = id, Named("ref") = ref,
-                        Named("alt") = alt, Named("qual") = qual, Named("filter") = filter,
-                        Named("info") = info, Named("gt") = GT);
-}
-
-// [[Rcpp::export]]
-IntegerVector hetrate(const string & vcffile)
-{
-    vcfpp::BcfReader vcf(vcffile);
+    vcfpp::BcfReader vcf(vcffile, region, samples);
     vcfpp::BcfRecord var(vcf.header);
-    vector<char> gt;
-    vector<int> hetsum(vcf.nsamples, 0); // store the het counts
-    while(vcf.getNextVariant(var))
-    {
+    vector<int> gt;
+    vector<int> hetsum(vcf.nsamples, 0);  // store the het counts
+    while (vcf.getNextVariant(var)) {
         var.getGenotypes(gt);
-        // analyze SNP variant with no genotype missingness
-        if(!var.isSNP() || !var.isNoneMissing()) continue;
-        assert(var.ploidy() == 2); // make sure it is diploidy
-        for(int i = 0; i < gt.size() / 2; i++) hetsum[i] += abs(gt[2 * i + 0] - gt[2 * i + 1]);
+        if (!var.isSNP()) continue; // analyze SNPs only
+        assert(var.ploidy() == 2);  // make sure it is diploidy
+        for (int i = 0; i < gt.size() / 2; i++)
+            hetsum[i] += abs(gt[2 * i + 0] - gt[2 * i + 1]) == 1;
     }
     return wrap(hetsum);
 }
diff --git a/scripts/test-vcfpp.R b/scripts/test-vcfpp.R
@@ -2,13 +2,8 @@ library(Rcpp)
 Sys.setenv("PKG_LIBS"="-I/home/rlk420/mambaforge/envs/R/include -lhts")
 sourceCpp("test-vcfpp-r.cpp", verbose=TRUE, rebuild=TRUE)
 
-run <- 1
 args <- commandArgs(trailingOnly = TRUE)
 vcffile <- args[1]
-run <- as.integer(args[2])
 
-system.time(vcf <- readtable(vcffile, "chr21"))
 
-if(run == 2) {
-  system.time(hets <- hetrate(vcffile))
-}
+system.time(hets <- heterozygosity(vcffile))
diff --git a/scripts/test-vcfpp.cpp b/scripts/test-vcfpp.cpp
@@ -1,6 +1,6 @@
-// -*- compile-command: "x86_64-conda-linux-gnu-c++ test-vcfpp.cpp -o test-vcfpp -std=c++11 -O3 -Wall -I/home/rlk420/mambaforge/envs/R/include -lhts" -*-
+// -*- compile-command: "x86_64-conda-linux-gnu-c++ test-vcfpp.cpp -o test-vcfpp -std=c++11 -O3 -Wall -I../ -I/home/rlk420/mambaforge/envs/R/include -lhts" -*-
 
-#include <vcfpp.h>
+#include "vcfpp.h"
 using namespace std;
 using namespace vcfpp;
 
@@ -37,16 +37,17 @@ int main(int argc, char * argv[])
     // ========= core calculation part ===========================================
     BcfReader vcf(vcffile, region, samples);
     BcfRecord var(vcf.header); // construct a variant record
-    vector<char> gt; // genotype can be bool, char or int type
+    vector<int> gt; // genotype can be bool, char or int type
     vector<int> hetsum(vcf.nsamples, 0);
     while(vcf.getNextVariant(var))
     {
         var.getGenotypes(gt);
         // analyze SNP variant with no genotype missingness
-        if(!var.isSNP() || !var.isNoneMissing()) continue;
-        assert(var.ploidy() == 2); // make sure it is diploidy
-        for(int i = 0; i < gt.size() / 2; i++) hetsum[i] += abs(gt[2 * i + 0] - gt[2 * i + 1]);
+        if (!var.isSNP()) continue; // analyze SNPs only
+        assert(var.ploidy() == 2);  // make sure it is diploidy
+        for (int i = 0; i < gt.size() / 2; i++)
+            hetsum[i] += abs(gt[2 * i + 0] - gt[2 * i + 1]) == 1;
     }
-    for(auto i : hetsum) cout << i << endl;
+    // for(auto i : hetsum) cout << i << endl;
     return 0;
 }
diff --git a/scripts/test-vcfppR.R b/scripts/test-vcfppR.R
@@ -2,17 +2,20 @@
 
 library(vcfppR)
 
-run <- 1
 args <- commandArgs(trailingOnly = TRUE)
 vcffile <- args[1]
 run <- as.integer(args[2])
 
-system.time(vcf <- tableGT(vcffile, "chr21"))
+if(run == 1) {
+  print(paste("run", run))
+  print(system.time(res1 <- heterozygosity(vcffile)))
+  q(save="no")
+}
 
 if(run == 2) {
-  res <- sapply(vcf[["gt"]], function(a) {
-    n=length(a)
-    abs(a[seq(1,n,2)]-a[seq(2,n,2)])
-  })
-  hets<-rowSums(res)
+  print(paste("run", run))
+  print(system.time(vcf <- vcftable(vcffile, vartype = "snps")))
+  print(system.time(res2 <- colSums(vcf[["gt"]]==1, na.rm = TRUE)))
+  q(save="no")
 }
+