A curated list of bioinformatics benchmarking papers and resources.
The credit for this format goes to Sean Davis for his awesome-single-cell repository and Ming Tang for his ChIP-seq-analysis repository.
If you have a benchmarking study that is not yet included on this list, please make a Pull Request.
- Awesome Bioinformatics Benchmarks
- Benchmarking Theory
- Tool/Method Sections
- DNase, ATAC, and ChIP-seq
- RNA-seq
- CRISPR Screens
- DNA Methylation
- Variant Callers
- Single Cell
- scRNA Transformation/Normalization
- Gene Signature Scoring
- scRNA Sequencing Protocols
- scRNA Analysis Pipelines
- scRNA Imputation Methods
- scRNA Differential Gene Expression
- Trajectory Inference
- Gene Regulatory Network Inference
- Integration/Batch Correction
- Dimensionality Reduction
- Cell Annotation/Inference
- Variant Calling
- ATAC-seq
- Statistics
- Microbiome
- Hi-C/Hi-ChIP
- Contributors
- Papers must be objective comparisons of 3 or more tools/methods.
- Papers must be awesome. This list isn't meant to chronicle every benchmarking study ever performed, only those that are particularly expansive, well done, and/or provide unique insights.
- Papers should generally not be from authors showing why their tool/method is better than others.
- Benchmarking data should be publicly available or simulation code/methods must be well-documented and reproducible.
Additional guidelines/rules may be added as necessary.
Please include the following information when adding papers.
Title:
Authors:
Journal Info:
Description:
Tools/methods compared:
Recommendation(s):
Additional links (optional):
Papers within each section should be ordered by publication date, with more recent papers listed first.
Title: Essential guidelines for computational method benchmarking
Authors: Lukas Weber, et al.
Journal Info: Genome Biology, June 2019
Description: This paper presents 10 main guidelines for conducting and writing benchmark papers covering necessary data, methods and metric choices, reproducible research, and documentation.
Title: Systematic benchmarking of omics computational tools
Authors: Sergei Mangul, et al.
Journal Info: Nature Communications, March 2019
Description: A survey of 25 benchmarking studies published between 2011 and 2017 in terms of design, methods, and information types. Discusses overfitting, sharing, incentives.
Additional sections/sub-sections can be added as needed.
Title: Features that define the best ChIP-seq peak calling algorithms
Authors: Reuben Thomas, et al.
Journal Info: Briefings in Bioinformatics, May 2017
Description: This paper compared six peak calling methods on 300 simulated and three real ChIP-seq data sets across a range of significance values. Methods were scored by sensitivity, precision, and F-score.
Tools/methods compared: GEM
, MACS2
, MUSIC
, BCP
, Threshold-based method (TM)
, ZINBA
.
Recommendation(s): Varies. BCP and MACS2 performed the best across all metrics on the simulated data. For Tbx5 ChIP-seq, GEM performed the best, with BCP also scoring highly. For histone H3K36me3 and H3K4me3 data, all methods performed relatively comparably with the exception of ZINBA, which the authors could not get to run properly. MUSIC and BCP had a slight edge over the others for the histone data.
More generally, they found that methods that utilize variable window sizes and Poisson test to rank peaks are more powerful than those that use a Binomial test.
Title: A Comparison of Peak Callers Used for DNase-Seq Data
Authors: Hashem Koohy, et al.
Journal Info: PLoS ONE, May 2014
Description: This paper compares four peak callers specificity and sensitivity on DNase-seq data from two publications composed of three cell types, using ENCODE data for the same cell types as a benchmark. The authors tested multiple parameters for each caller to determine the best settings for DNase-seq data for each.
Tools/methods compared: F-seq
, Hotspot
, MACS2
, ZINBA
.
Recommendation(s): F-seq was the most sensitive, though MACS2 and Hotspot both performed competitively as well. ZINBA was the least performant by a massive margin, requiring much more time to run, and was also the least sensitive.
Authors: Jake J. Reske, et al.
Journal Info: Epigenetics & Chromatin, April 2020
Description: The study examines how different ATAC-seq data normalization methods impact the analysis and interpretation of differential chromatin accessibility (DA). Using both in vivo and published yeast ATAC-seq datasets, the authors demonstrate that the choice of normalization method can significantly alter the identification of differentially accessible regions in the genome.
Tools/methods compared: MACS2
, DiffBind
, csaw
, voom
, DEseq2
, edgeR
, limma
Recommendation(s): The authors recommend systematic comparison of multiple normalization methods before proceeding with differential accessibility analysis. They stress the importance of understanding the potential biases and assumptions of each method. They also propose a generalized workflow for differential ATAC-seq data analysis, emphasizing the need for standardizing molecular complexity before quantification. This workflow is based on the standardized ENCODE pipeline with modifications for ATAC-seq.
The authors also proposed a generalized workflow for differential accessibility analysis, which can be found in Github
Additional links: For ATAC-Seq data anlysis, there is another paper: From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis
Title: Alignment and mapping methodology influence transcript abundance estimation
Authors: Avi Srivastava* & Laraib Malik*, et al.
Journal Info: bioRXiv, October 2019
Description: This paper compares the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis.
Tools/methods compared: Bowtie2
, STAR
, quasi-mapping
, Selective Alignment
, RSEM
, Salmon
.
Recommendation(s): When trying to choose an approach, a choice can be made by the user performing the analysis based on any time-accuracy tradeoff they wish to make. In terms of speed, quasi-mapping is the fastest approach, followed by Selective Alignment (SA) then STAR. Bowtie2 was considerably slower than all three of these approaches. However, in terms of accuracy, SA yielded the best results, followed by alignment to the genome (with subsequent transcriptomic projection) using STAR and SA (using carefully selected decoy sequences). Bowtie2 generally performed similarly to SA, but without the benefit of decoy sequences, seemed to admit more spurious mappings. Finally, lightweight mapping of sequencing reads to the transcriptome showed the lowest overall consistency with quantifications derived from the oracle alignments. Note: Both Selective Alignment and quasi-mapping are part of the salmon codebase.
Authors: Shanrong Zhao* & Baohong Zhang
Journal Info: BMC Genomics, February 2015
Description: This paper compares the effect of different gene annotations in the context of RNA-seq mapping and gene quantification using data from the Human Body Map 2.0 Project.
Tools/methods compared: Ensembl
, Refseq
, UCSC
.
Recommendation(s): Though the authors warn there is no "best" set of annotations to use, they do emphasize the impact that annotation choice can have on downstream analyses such as differential gene expression, as genes with identical gene symbols can map to completely different regions in different annotations. Though Ensembl annotations are much more comprehensive than the others, the authors recommend a less complex genome annotation, such as the Refseq annotation, if the RNA-seq is being used as a replacement for microarrays. Conversely, the Ensembl annotations are preferrable if non-coding RNAs are of particular interest.
Title: Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions
Authors: Ciaran Evans, et al.
Journal Info: Briefings in Bioinformatics, February 2017
Description: This study underscores the importance of selecting appropriate normalization methods in RNA-Seq studies based on their underlying assumptions. It highlights how normalization choices impact gene behavior analysis under various biological conditions. The paper emphasizes the critical role of assumptions in normalization methods and their effects on the accuracy of downstream analyses, such as detecting differential expression. The study utilized simulations to evaluate normalization methods under varying conditions of mRNA/cell levels and asymmetry in gene expression, assessing error in fold change estimates and empirical error rates in detecting differential expression.
Tools/methods compared: The paper provides a general discussion on RNA-Seq normalization methods, focusing on the types (e.g., normalization by library size, normalization by distribution, normalization by testing) rather than specific tools. The simulations used various methods, including DEGES, DESeq, Oracle, PoissonSeq, TMM, and Total Count.
Recommendation(s): The authors suggest selecting normalization methods based on the specific conditions of the biological experiment. Library size normalization is effective when total mRNA/cell is consistent across conditions, whereas normalization by distribution/testing is better suited for conditions with symmetry in differential expression, regardless of mRNA/cell differences. In scenarios with both asymmetry and varying mRNA/cell levels, both methods are expected to perform poorly.
Additional links: The authors' simulation code is available on Github.
Title: Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment
Authors: Marek Gierliński*, Christian Cole*, Pietá Schofield*, Nicholas J. Schurch*, et al.
Journal Info: Bioinformatics, November 2015
Description: This paper compares the effect of normal, log-normal, and negative binomial distribution assumptions on RNA-seq gene read-counts from 48 RNA-seq replicates.
Tools/methods compared: normal
, log-normal
, negative binomial
.
Recommendation(s): Assuming a normal distribution leads to a large number of false positives during differential gene expression. A log-normal distribution model works well unless a sample contains zero counts. Use tools that assume a negative binomial distribution (edgeR
, DESeq
, DESeq2
, etc).
Authors: Marie-Agnès Dillies, et al.
Journal Info: Briefings in Bioinformatics, November 2013
Description: This paper compared seven RNA-seq normalization methods in the context of differential expression analysis on four real datasets and thousands of simulations.
Tools/methods compared: Total Count (TC)
, Upper Quartile (UQ)
, Median (Med),
DESeq
, edgeR
, Quantile (Q)
, RPKM
.
Recommendation(s): The authors recommend DESeq (now deprecated and replaced by DESeq2) or edgeR, as those methods are robust to the presence of different library sizes and compositions, whereas the (still common) Total Count and RPKM methods are ineffective and should be abandoned.
Authors: Kimon Froussios*, Nick J Schurch*, et al.
Journal Info: Bioinformatics, February 2019
Description: This paper compared nine differential gene expression tools (and their underlying model distributions) in 17 RNA-seq replicates of Arabidopsis thaliana. Handling of inter-replicate variability and false positive fraction were the benchmarking metrics used.
Tools/methods compared: baySeq
, DEGseq
, DESeq
, DESeq2
, EBSeq
, edgeR
, limma
, Poisson-Seq
, SAM-Seq
.
Recommendation(s): Six of the tools that utilize negative binomial or log-normal distributions (edgeR, DESeq2, DESeq (now deprecated and replaced by DESeq2, baySeq, limma, and EBseq control their identification of false positives well.
Additional links: The authors released their benchmarking scripts on Github.
Authors: Nicholas J. Schurch*, Pietá Schofield*, Marek Gierliński*, Christian Cole*, Alexander Sherstnev*, et al.
Journal Info: RNA, March 2016
Description: This paper compared 11 differential expression tools on varying numbers of RNA-seq biological replicates (3-42) between two conditions. Each tool was compared against itself as a standard (using all replicates) and against the other tools.
Tools/methods compared: baySeq
, cuffdiff
, DEGSeq
, DESeq
, DESeq2
, EBSeq
, edgeR (exact and glm modes)
, limma
, NOISeq
, PoissonSeq
, SAMSeq
.
Recommendation(s): With fewer than 12 biological replicates, edgeR and DESeq2 were the top performers. As replicates increased, DESeq (now deprecated and replaced by DESeq2 did a better job minimizing false positives than other tools.
Additionally, the authors recommend at least six biological replicates should be used, rising to at least 12 if users want to identify all significantly differentially expressed genes no matter the fold change magnitude.
Title: Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data
Authors: Franck Rapaport, et al.
Journal Info: Genome Biology, September 2013
Description: This paper compared six differential expression methods on three cell line data sets from ENCODE (GM12878, H1-hESC, and MCF-7) and two samples from the SEQC study, which had a large fraction of differentially expressed genes validated by qRT-PCR. Specificity, sensitivity, and false positive rate were the main benchmarking metrics used.
Tools/methods compared: Cuffdiff
, edgeR
, DESeq
, PoissonSeq
, baySeq
, limma
.
Recommendation(s): Though no method emerged as favorable in all conditions, those that used negative binomial modeling (DESeq (now deprecated and replaced by DESeq2), edgeR, baySeq) generally performed best.
The more replicates, the better. Replicate numbers (both biological and technical) have a greater impact on differential detection accuracy than sequencing depth.
Title: Toward a gold standard for benchmarking gene set enrichment analysis
Authors: Ludwig Geistlinger, et al.
Journal Info: Briefings in Bioinformatics, February 2020
Description: This paper developed a Bioconductor package for reproducible GSEA benchmarking, and used the package to assess 10 widely used enrichment methods with regard to runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested, and recovery of biologically relevant gene sets. The framework can be extended to additional methods, datasets, and benchmark criteria, and should serve as a gold standard for future GSEA benchmarking studies.
Tools/methods compared: The paper quantitatively asseses the performance of 10 enrichment methods (ORA
, GSEA
, GSA
, PADOG
, SAFE
, CAMERA
, ROAST
, GSVA
, GLOBALTEST
, SAMGS
). The paper also compares 10 frequently used enrichment tools implementing these methods (DAVID
, ENRICHR
, CLUSTER-PROFILER
, GOSTATS
, WEBGESTALT
, G:PROFILER
, GENETRAIL
, GORILLA
, TOPPGENE
, PANTHER
).
Recommendation(s):
ORA
for the exploratory analysis of simple gene lists, pre-rankedGSEA
or pre-rankedCAMERA
for the analysis of pre-ranked gene lists accompanied by gene scores such as fold changes,- For enrichment analysis on the full expression matrix (genes x samples), the paper recommends to provide normalized log2 intensities for microarray data and logTPMs (or logRPKMs/logFPKMs) for RNA-seq data; when given raw read counts, the paper recommends to apply a variance-stabilizing transformation such as
voom
to arrive at library-size normalized logCPMs, ROAST
(sample group comparisons) orGSVA
(single sample) if the question of interest is to test for association of any gene in the set with the phenotype (self-contained null hypothesis),PADOG
(simple experimental designs) orSAFE
(extended experimental designs) if the question of interest is to test for excess of differential expression in a gene set relative to genes outside the set (competitive null hypothesis).
Additional links: http://bioconductor.org/packages/GSEABenchmarkeR
Title: A survey of software for genome-wide discovery of differential splicing in RNA-Seq data
Authors: Joan E Hooper
Journal Info: Human Genomics, January 2014
Description: This paper compares the methodologies, advantages, and disadvantages of eight differential splicing analysis tools, detailing use-cases and features for each.
Tools/methods compared: Cuffdiff2
, MISO
, DEXSeq
, DSGseq
, MATS
, DiffSplice
, Splicing compass
, AltAnalyze
.
Recommendation(s): This is a true breakdown of each tools' advantages and disadvantages.
The author makes no recommendation due to the performance reliance on experimental setup, data type (e.g. AltAnalyze
works best on junction + exon microarrays), and user objectives.
Table 1 provides a good comparison of the features and methodology of each method.
Authors: Gabriela A Merino
Journal Info: Briefings in Bioinformatics, March 2019
Description: This paper compares nine most commonly used workflows to detect differential isoform expression and splicing.
Tools/methods compared: EBSeq
, DESeq2
, NOISeq
, Limma
, LimmaDS
, DEXSeq
, Cufflinks
, CufflinksDS
, SplicingCompass
.
Recommendation(s): DESeq2, Limma and NOISeq for differential isoform expression(DIE) analysis and DEXSeq and LimmaDS for differential splicing (DS) testing.
Authors: Katharina E. Hayer et al.
Journal Info: Bioinformatics, Dec 2015
Description: This paper compared both guided and de novo transcript reconstruction algorithms using simulated and in vitro transcription (IVT) generated libraries. Precision/recall metrics were obtained by comparing the reconstructed transcripts to their true models.
Tools/methods compared: Cufflinks
, CLASS
, FlipFlop
, IReckon
, IsoLasso
, MiTie
, StringTie
, StringTie-SR
, AUGUSTUS
, Trinity
, SOAP
, Trans-ABySS
.
Recommendation(s): All tools measured produced less than ideal precision-recall (both <90%) when using imperfect simulated or IVT data and genes producing mulitple isoforms. Cufflinks and StringTie are among the best performers.
Authors: Martin Holzer & Manja Marz
Journal Info: GigaScience, May 2019
Description: This paper compares 10 de novo assembly tools across 9 RNA-seq datasets spanning multiple species and kingdoms for 20 biological-based and reference-free metrics.
Tools/methods compared: Trinity
, Oases
, Trans-ABySS
, SOAPdenovo-Trans
, IDBA-Tran
, Bridger
, BinPacker
, Shannon
, SPAdes-sc
, SPAdes-rna
.
Recommendation(s): The authors found that no tool's performance was dominant for all data sets, but Trinity, SPAdes, and Trans-ABySS were typically among the best. For assembly evaluation, the authors recommend a hybrid approach combining both biological-based (BUSCO, # of full length transcripts) and reference-free metric (e.g. TransRate
, DETONATE
).
Additional links: The authors provide a comprehensive electronic supplement website containing all metrics and assembly commands in addition to many supplementary figures.
Title: Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology
Authors: Markus List*, Tatsiana Aneichyk*, et al.
Journal Info: Bioinformatics, July 2019
Description: This paper benchmarks and compares seven methods for computational deconvolution of cell-type abundance in bulk RNA-seq samples. Each method was tested on both simulated and true bulk RNA-seq samples validated by FACS.
Tools/methods compared: quanTIseq
, TIMER
, CIBERSORT
, CIBERSORT abs. mode
, MCPCounter
, xCell
, EPIC
.
Recommendation(s): Varies. In general, the authors recommend EPIC and quanTIseq due to their overall robustness and absolute (rather than relative) scoring, though xCell is recommended for binary presence/absence of cell types and MCPcounter was their recommended relative scoring method.
Additional links: The authors created an R package called immunedeconv for easy installation and use of all these methods. For developers, they have made their benchmarking pipeline available so that others can reproduce/extend it to test their own tools/methods.
Title: Comprehensive benchmarking of computational deconvolution of transcriptomics data
Authors: Francisco Avila Cobos, et al.
Journal Info: bioRxiv, January 2020
Description: This paper compared the effects of transformation, scaling/normalization, marker selection, cell type composition, and deconvolution methods on computing cell type proportions in mixed bulk RNA-seq samples. Performance is assessed by means of Pearson correlation and root-mean-square-error (RMSE) between the cell type proportions computed by the different deconvolution methods and known compositions of 1000 pseudo-bulk mixtures from 4 different single cell RNA-seq datasets with varying numbers of cells.
Tools/methods compared: Transformation methods: linear (none)
, log
, VST (DESeq2)
, sqrt
. Scaling/normalization methods (bulk): column-wise
, min-max
, z-score
, QN
, UQ
, row-wise
, global min-max
, global z-score
, TPM
, TMM
, median ratios
, LogNormalize
. Scaling/normalization methods (single cell): RNBR
, scran
, scater
, Linnorm
. Deconvolution methods (bulk): OLS
, NNLS
, FARDEEP
, RLR
, lasso
, ridge
, DCQ
, elastic net
, DSA
, EPIC
, CIBERSORT
, dtangle
, ssFrobenius
, ssKL
, DeconRNASeq
. Deconvolution methods (scRNA-seq reference): BisqueRNA
, deconvSeq
, DWLS
, MuSiC
, SCDC
.
Recommendation(s): The authors strongly recommend keeping data in the linear scale for deconvolution, avoiding the use of column min-max
, column z-score
, and QN
for normalization/scaling of bulk RNA-seq data, avoiding the use of row-normalization
, column min-max
, and TPM
for normalization/scaling of single cell RNA-seq data, use all possible cell markers, ensure that all possible cell types are represented in the reference matrix, and use one of the top performing deconvolution methods - OLS
, nnls
, RLR
, FARDEEP
, CIBERSORT
, DWLS
, MuSiC
, or SCDC
.
Additional links: The authors provide their benchmarking code on Github.
Title: A benchmark of algorithms for the analysis of pooled CRISPR screens
Authors: Sunil Bodapati*, Timothy P. Daley*, et al.
Journal Info: Genome Biology, March 2020
Description: This study evaluates and compares various algorithms used for analyzing data from pooled CRISPR screens, using a comprehensive simulation framework and real datasets. The algorithms were benchmarked for their ability to accurately identify essential genes in CRISPR knockout (CRISPRko), CRISPR interference (CRISPRi), and CRISPR activation (CRISPRa) screens. Key parameters such as the number of guides per gene, guide binding efficiency, sequencing depth, and control guides were varied to assess the robustness of these algorithms under different conditions.
Tools/methods compared: The study compared several algorithms, including Redundant siRNA Activity (RSA), MAGeCK Robust Ranking Algorithm (RRA), HiTSelect, MAGeCK Maximum Likelihood Estimation (MLE), CRISPhieRmix, CERES, JACKS, and a standard t test. BAGEL was also discussed, but not included in the testing due to its requirement of a priori knowledge.
Recommendation(s):
- MAGeCK RRA is recommended for general use due to its robustness and consistent performance across various conditions.
- CRISPhieRmix is suggested for screens with high variable guide efficiency, such as in CRISPRi or CRISPRa screens.
- For multiple screens across different cell types or lines, MAGeCK MLE, JACKS, and CERES are good options.
- The study also highlights that a simple t test can be effective when suitable control guides are used.
- The number of guides per gene is crucial, with four being the minimum recommended for reliable results. More guides may be necessary for low-signal phenotypes.
- Sequencing depth is less critical than previously thought, with the performance of most algorithms plateauing at 25 reads per guide.
Additional links: The simulation framework and scripts used in the study are available at GitHub - CRISPR Benchmarking Algorithms.
Authors: Li Zhou, et al.
Journal Info: Scientific Reports, 2019
Description: This study focuses on improving strategies for whole genome bisulfite sequencing (WGBS) used in large-scale epidemiological studies. It systematically compares three WGBS library preparation methods (Swift Biosciences Accel-NGS, Illumina TruSeq, and QIAGEN QIAseq) across two Illumina sequencing platforms (NovaSeq and HiSeq X). The study also examines the concordance between WGBS and methylation array data. The study assayed quality metrics (Q20 and Q30 fractions), insert size, adaptor contamination, overlapping bases, read duplication rate, alignment rate, coverage depth, and genome coverage uniformity across the platforms.
Tools/methods compared:
- Library Preparation Methods: Swift Biosciences Accel-NGS, Illumina TruSeq, QIAGEN QIAseq.
- Sequencing Platforms: Illumina NovaSeq, HiSeq X.
Recommendation(s):
- Swift Biosciences Accel-NGS achieved the highest proportion of CpG sites assayed and effective coverage, making it the most recommended method.
- Illumina TruSeq had a high proportion of PCR duplicates, while QIAGEN QIAseq generally underperformed across all quality metrics.
- NovaSeq and HiSeq X platforms showed similar performance, except for a higher read duplication rate in NovaSeq.
- WGBS was less precise than methylation arrays, requiring a minimum coverage of 100x for comparable precision.
- Swift outperformed other methods in terms of uniform genome coverage and CpG site coverage at lower depths.
- No significant differences in nucleotide amplification bias were observed between NovaSeq and HiSeq X.
- For quantifying DNA methylation, WGBS with Swift on HiSeq X outperformed methylation arrays in CpG coverage.
Authors: Miljana Tanić, et al.
Journal Info: Nature Biotechnology, October 2022.
Description: This study benchmarks five commercial Targeted Bisulfite Sequencing (TBS) platforms for analyzing human DNA methylomes at base-pair resolution. The platforms include three hybridization capture-based (Agilent, Roche, and Illumina) and two reduced-representation-based (Diagenode and NuGen). Eleven sample types were analyzed, including cell lines and DNA methylation standards. Two samples were also compared with whole-genome DNA methylation sequencing using Illumina and Oxford Nanopore platforms. Key assessment parameters included workflow complexity, on/off-target performance, coverage accuracy, and reproducibility.
Tools/methods compared:
- Hybridization capture-based platforms: Agilent SureSelect Methyl-Seq, Roche NimbleGen SeqCap EpiGiant, and Illumina TruSeq Methyl Capture EPIC.
- Reduced-representation-based platforms: Diagenode Premium RRBS and NuGen Ovation RRBS Methyl-Seq.
Recommendation(s):
- Each platform exhibited strengths and limitations in coverage, reproducibility, and concordance of DNA methylation levels.
- RRBS platforms (Diagenode and NuGen) showed more uniform CpG coverage compared to hybridization capture-based methods (Agilent, Roche, Illumina).
- Illumina’s platform had the highest on-target capture efficiency (~90.6%), followed by Agilent's (~78.2%) and Roche's (~61.5%).
- Differences in CpG coverage between platforms indicated varying suitability for specific genomic features. For instance, RRBS platforms had higher coverage in CpG islands and shores, while Roche excelled in covering open-sea regions and Illumina in enhancer regions.
- High intra-platform and inter-platform correlation in DNA methylation levels were observed, except for Diagenode, which slightly underperformed.
- All TBS platforms showed a strong correlation to whole-genome bisulfite sequencing (WGBS) and Oxford Nanopore data, suggesting their reliability for DNA methylome analysis.
- The study suggests considering specific genomic feature coverage and desired analysis depth when choosing a TBS platform.
Additional links: The authors provide their analysis code on Github.
Title: Systematic benchmarking of tools for CpG methylation detection from nanopore sequencing
Authors: Zaka Wing-Sze Yuen, et al.
Journal Info: Nature Communications, 2021
Description: This study systematically benchmarks six tools for detecting 5-methylcytosine (5mC) from nanopore sequencing using individual reads, controlled methylation mixtures, Cas9-targeted sequencing, and whole-genome bisulfite sequencing (WGBS). The research highlights a trade-off between true positives and false positives among these tools and a general high dispersion in predicting methylation frequencies. Metrics include accuracy (true positive rate & true negative rate) at the individual read level and per controlled mixture. The authors also tested a consensus approach combining the results of pairs of callers.
Tools/methods compared: Nanopolish
, Megalodon
, DeepSignal
, Guppy
, Tombo
, and DeepMod
.
Recommendation(s):
- No single method accurately predicts across all methylation frequency ranges.
- Guppy excels in identifying unmethylated sites but fails at fully methylated sites.
- Nanopolish and Tombo accurately recover fully methylated sites but have high false positives at unmethylated sites.
- Megalodon showed the best overall performance but requires GPU support.
- The consensus approach METEORE, combining predictions from two or more tools (specifically Megalodon and DeepSignal), improved accuracy over individual methods. It balances accuracy and running times, although it also requires a GPU for efficiency.
- On a CPU, the combination of Nanopolish and DeepSignal can match Megalodon's accuracy and be time-competitive.
- Reassessing score cutoffs for individual reads and removing sites with uncertain methylation status can further enhance accuracy.
- These methods showed good consistency with WGBS data, suggesting potential for sensitive diagnostic and forensic tests without high coverage.
- The study also notes that the highest discrepancy with WGBS occurred at CG sites in AT-rich sequences, particularly for DeepMod and DeepSignal.
Additional links: The authors provide their method for consensus calling (METEORE) on GitHub.
Title: Benchmarking variant callers in next-generation and third-generation sequencing analysis
Authors: Surui Pei, et al.
Journal Info: Briefings in Bioinformatics, July 2020
Description: This paper compared evaluated 11 modes among 6 variant callers on 12 NGS and TGS datasets on germline and somatic variant calling.
Tools/methods compared: Sentieon
(TNscope
, TNseq
, DNAseq
), DeepVariant
(WGS
), GATK
(HC
& MuTect2
), NeuSomatic
, VarScan2
, Strelka2
Recommendation(s): All the four germline callers had comparable performance on NGS data. For TGS data, all the three callers had similar performance in SNP calling, while DeepVariant outperformed the others in InDel calling. For somatic variant calling on NGS, Sentieon TNscope and GATK Mutect2 outperformed the other callers.
--
Authors: Jiayun Chen, et al.
Journal Info: Scientific Reports, June 2019
Description: This paper compared three variant callers for WGS and WES samples from NA12878 across five next-gen sequencing platforms
Tools/methods compared: GATK
, Strelka2
, Samtools-Varscan2
.
Recommendation(s): Though all methods tested generally scored well, Strelka2 had the highest F-scores for both SNP and indel calling in addition to being the most computationally performant.
Title: Comparison of three variant callers for human whole genome sequencing
Authors: Anna Supernat, et al.
Journal Info: Scientific Reports, December 2018
Description: The paper compared three variant callers for WGS samples from NA12878 at 10x, 15x, and 30x coverage.
Tools/methods compared: DeepVariant
, GATK
, SpeedSeq
.
Recommendation(s): All methods had similar F-scores, precision, and recall for SNP calling, but DeepVariant scored higher across all metrics for indels at all coverages.
Title: A Comparison of Variant Calling Pipelines Using Genome in a Bottle as a Reference
Authors: Adam Cornish, et al.
Journal Info: BioMed Research International, October 2015
Description: This paper compared 30 variant calling pipelines composed of six different variant callers and five different aligners on NA12878 WES data from the "Genome in a Bottle" consortium.
Tools/methods compared:
- Variant callers:
FreeBayes
,GATK-HaplotypeCaller
,GATK-UnifiedGenotyper
,SAMtools mpileup
,SNPSVM
- Aligners:
bowtie2
,BWA-mem
,BWA-sampe
,CUSHAW3
,MOSAIK
,Novoalign
.
Recommendation(s): Novoalign with GATK-UnifiedGenotyper exhibited the highest sensitivity while producing few false positives.
In general, BWA-mem was the most consistent aligner, and GATK-UnifiedGenotyper
performed well across the top aligners (BWA, bowtie2, and Novoalign).
Authors: Anne Bruun Krøigård, et al.
Journal Info: PLoS ONE, March 2016
Description: This paper performed comparisons between nine somatic variant callers on five paired tumor-normal samples from breast cancer patients subjected to WES and targeted deep sequencing.
Tools/methods compared: EBCall
, Mutect
, Seurat
, Shimmer
, Indelocator
, SomaticSniper
, Strelka
, VarScan2
, Virmid
.
Recommendation(s): EBCall, Mutect, Virmid, and Strelka (now Strelka2) were most reliable for both WES and targeted deep sequencing. EBCall was superior for indel calling due to high sensitivity and robustness to changes in sequencing depths.
Title: Comparison of somatic mutation calling methods in amplicon and whole exome sequence data
Authors: Huilei Xu, et al.
Journal Info: BMC Genomics, March 2014
Description: Using the "Genome in a Bottle" gold standard variant set, this paper compared five somatic mutation calling methods on matched tumor-normal amplicon and WES data.
Tools/methods compared: GATK-UnifiedGenotyper followed by subtraction
, MuTect
, Strelka
, SomaticSniper
, VarScan2
.
Recommendation(s): MuTect and Strelka (now Strelka2) had the highest sensitivity, particularly at low frequency alleles, in addition to the highest specificity.
Title: Detection of germline CNVs from gene panel data: benchmarking the state of the art
Authors: Elisabet Munté, Carla Roca, et al.
Journal Info: Briefings in Bioinformatics, December 2024.
Description: This work evaluated 12 germline copy number variation callers against four real-validated datasets using their default parameters, assessed the impact of modifying 107 tool parameters, and analyzed 66 tool pair combinations to produce better meta-callers. Sensitivity, specificity, positive predictive value, negative predictive value, F1 score, and various correlation coefficients were used as benchmarking metrics.
Tools/methods compared: Atlas-CNV
, ClearCNV
, ClinCNV
, CNVkit
, Cobalt
, CODEX2
, CoNVaDING
, DECoN
, ExomeDepth
, GATK-gCNV
, panelcn.MOPS
, VisCap
Recommendation(s): Results indicated that in terms of F1 score, ClinCNV and GATK-gCNV were the best CNV callers. Regarding sensitivity, GATK-gCNV also exhibited particularly high performance.
Additional links: The authors published (CNVbenchmarkeR2), so other users can benchmark their tools on their own data.
Title: Benchmark of tools for CNV detection from NGS panel data in a genetic diagnostics context
Authors: José Marcos Moreno-Cabrera, et al.
Journal Info: bioRxiv, November 2019.
Description: This paper compared five germline copy number variation callers against four genetic diagnostics datasets (495 samples, 231 CNVs validated by MLPA) using both default and optimized parameters. Sensitivity, specificity, positive predictive value, negative predictive value, F1 score, and various correlation coefficients were used as benchmarking metrics.
Tools/methods compared: DECoN
, CoNVaDING
, panelcn.MOPS
, ExomeDepth
, CODEX2
.
Recommendation(s): Most tools performed well, but varied based on datasets. The authors felt DECoN and panelcn.MOPS with optimized parameters were sensitive enough to be used as screening methods in genetic dianostics.
Additional links: The authors have made their benchmarking code (CNVbenchmarkeR) available, which can be run to determine optimal parameters for each algorithm for a given user's data.
Title: An evaluation of copy number variation detection tools for cancer using whole exome sequencing data
Authors: Fatima Zare, et al.
Journal Info: BMC Bioinformatics, May 2017
Description: This paper compared six copy number variation callers on ten TCGA breast cancer tumor-normal pair WES datasets in addition to simulated datasets from VarSimLab. Sensitivity, specificity, and false-discovery rate were used as the benchmarking metrics.
Tools/methods compared: ADTEx
, CONTRA
, cn.MOPS
, ExomeCNV
, VarScan2
, CoNVEX
.
Recommendation(s): All tools suffered from high FDRs (~30-60%), but ExomeCNV (a now defunct R package) had the highest overall sensitivity. VarScan2 had moderate sensitivity and specificity for both amplifications and deletions.
Authors: Daniel L. Cameron, et al.
Journal Info: Nature Communications, July 2019
Description: This paper compared 10 structural variant callers on four cell line WGS datasets (NA12878, HG002, CHM1, and CHM13) with orthogonal validation data. Precision and recall were the benchmarking metrics used.
Tools/methods compared: BreakDancer
, cortex
, CREST
, DELLY
, GRIDSS
, Hydra
, LUMPY
, manta
, Pindel
, SOCRATES
.
Recommendation(s): The authors found GRIDSS and manta consistently performed well, but also provide more general guidelines for both users and developers.
- Use a caller that utilizes multiple sources of evidence and assembly.
- Use a caller that can call all events you care about.
- Ensemble calling is not a cure-all and generally don't outperform the best individual callers (at least on these datasets).
- Do not use callers that rely only on paired-end data.
- Calls with high read counts are typically artefacts.
- Simulations aren't real - benchmarking solely on simulations is a bad idea.
- Developers - be wary of incomplete trust sets and the potential for overfitting. Test tools on multiple datasets.
- Developers - make your tool easy to use with basic sanity checks to protect against invalid inputs. Use standard file formats.
- Developers - use all available evidence and produce meaningful quality scores.
Title: Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing
Authors: Shunichi Kosugi, et al.
Journal Info: Genome Biology, June 2019
Description: This study compared 69 structural variation callers on simulated and real (NA12878, HG002, and HG00514) datasets. F-scores, precision, and recall were the main benchmarking metrics.
Tools/methods compared: 1-2-3-SV
, AS-GENESENG
, BASIL-ANISE,
BatVI
, BICseq2
, BreakDancer
, BreakSeek
, BreakSeq2
, Breakway
, CLEVER
, CNVnator
,
Control-FREEC
, CREST
, DELLY
, DINUMT
, ERDS
, FermiKit
, forestSV
, GASVPro
, GenomeSTRiP
, GRIDSS
, HGT-ID
, Hydra-sv
, iCopyDAV
, inGAP-sv
, ITIS
,
laSV
, Lumpy
, Manta
, MATCHCLIP
, Meerkat
, MELT
, MELT-numt
, MetaSV
, MindTheGap
, Mobster
, Mobster-numt
, Mobster-vei
, OncoSNP-SEQ
, Pamir
, PBHoney
,
PBHoney-NGM
, pbsv
, PennCNV-Seq
, Pindel
, PopIns
, PRISM
, RAPTR
, readDepth
, RetroSeq
, Sniffles
, Socrates
, SoftSearch
, SoftSV
, SoloDel
, Sprites
,
SvABA
, SVDetect
, Svelter
, SVfinder
, SVseq2
, Tangram
, Tangram-numt
, Tangram-vei
, Tea
, TEMP
, TIDDIT
, Ulysses
, VariationHunter
, VirusFinder
, VirusSeq
, Wham
.
Recommendation(s): Varies greatly depending on type and size of the structural variant in addition to read length.
GRIDSS
, Lumpy
, SVseq2
, SoftSV
, and Manta
performed well calling deletions of diverse sizes.
TIDDIT
, forestSV
, ERDS
, and CNVnator
called large deletions well, while pbsv
, Sniffles
, and PBHoney
were the best performers for small deletions.
For duplications, good choices included Wham
, SoftSV
, MATCHCLIP
, and GRIDSS
, while CNVnator
, ERDS
, and iCopyDAV
excelled calling large duplications.
For insertions, MELT
, Mobster
, inGAP-sv
, and methods using long read data were most effective.
Title: Evaluating nanopore sequencing data processing pipelines for structural variation identification
Authors: Anbo Zhou, et al.
Journal Info: Genome Biology, November 2019
Description: This paper evaluated four alignment tools and three SV detection tools on four nanopore datasets (both simulated and real).
Tools/methods compared: aligners - minimap2
, NGMLR
, GraphMap
, LAST
. SV Callers - Sniffles
, NanoSV
, Picky
.
Recommendation(s): The authors recommend using the minimap2 aligner in combination with the SV caller Sniffles because of their speed and relatively balanced performance.
Additional links: The authors provide all code used in the study as well as a singularity package containing pre-installed programs and all seven pipeline.
Title: Comparison of transformations for single-cell RNA-seq data
Authors: Constantin Ahlmann-Eltze & Wolfgang Huber
Journal Info: Nature Methods, May 2023
Description: This study evaluates 22 transformations for preprocessing single-cell RNA-sequencing data. These transformations are aimed at adjusting counts for variable sampling efficiency and making variance similar across the dynamic range. The research focuses on understanding cell types and states in terms of lower-dimensional mathematical structures, with benchmarks designed to assess the ability of transformations to recover latent cell structures.
Tools/methods compared:
- Delta method-based variance-stabilizing transformations: acosh transformation, shifted logarithm (with pseudo-count y0 = 1 or y0 = 1/4α), shifted logarithm with CPM, and additional variations including HVG selection, z scoring, and output rescaling.
- Residuals-based variance-stabilizing transformations: clipped and unclipped Pearson residuals (implemented by sctransform and transformGamPoi), randomized quantile residuals, clipped Pearson residuals with HVG selection, z scoring, and an analytical approximation to the Pearson residuals.
- Latent gene expression-based transformations (Lat Expr): Sanity Distance, Sanity MAP, Dino, Normalisr.
- Count-based factor analysis models (Count): GLM PCA, NewWave.
- Negative controls: raw untransformed counts (y) and raw counts scaled by the size factor (y/s).
Recommendation(s):
- The shifted logarithm transformation (log(y/s + y0) with y0 = 1) followed by PCA performed well, often better than more sophisticated alternatives.
- The study advises against using CPM as input for the shifted logarithm due to unrealistic large overdispersion assumptions.
- Pearson residuals-based transformation has good theoretical properties and performed similarly to the shifted logarithm. However, it has limitations in handling genes with large dynamic range across cells.
- Latent expression state transformations and count-based latent factor models did not outperform the shifted logarithm and were computationally expensive.
- The delta method-based transformation produced more consistent results on the 10x datasets.
- Overall, despite extensive research in preprocessing methods for single-cell RNA-seq data, the shifted logarithm still ranks among the best, underlining the utility of lower-dimensional embeddings of the transformed count matrix for noise reduction and fidelity increase.
Additional links: An interactive website with all results for all tested parameter combinations is provided.
Authors: Nighat Noureen, et al.
Journal Info: eLife, February 2022
Description: This study benchmarks five signature-scoring methods in the context of cancer single-cell RNA sequencing data. The authors highlight that methods developed for bulk sample analysis, specifically single sample gene set enrichment analysis (ssGSEA) and Gene Set Variation Analysis (GSVA), show biases and inaccuracies when applied to single-cell data. This is attributed to the higher gene counts in cancer cells compared to normal cells, which affects the performance of ssGSEA and GSVA. The study emphasizes the importance of considering cellular context in signature scoring, especially the effect of dropouts in single-cell data.
Tools/methods compared: ssGSEA
, GSVA
, AUCell
, Single Cell Signature Explorer (SCSE)
, Jointly Assessing Signature Mean and Inferring Enrichment (JASMINE)
Recommendation(s): The study recommends caution when using bulk-sample-based methods like ssGSEA and GSVA for single-cell RNA sequencing data due to their susceptibility to biases caused by high gene counts and dropouts in cancer cells. Single-cell-based methods, particularly JASMINE and SCSE, are more robust in this context. JASMINE, a new method developed in this study, showed particular effectiveness in accounting for dropouts and evaluating average expression levels of expressed signature genes. Typically, I'd avoid adding a paper in which the author's are touting their new tool, but I feel the contextual information therein is invaluable in this case.
Additional links: The GitHub repository for JASMINE is available here.
Title: Benchmarking single-cell RNA-sequencing protocols for cell atlas projects
Authors: Elisabetta Mereu*, Atefeh Lafzi*, et al.
Journal Info: Nature Biotechnology, April 2020
Description: This paper evaluated 13 single cell/nuclei RNA-seq protocols to evaluate their aptitude for use in cell atlas-like projects. Using a single cell, multi-species mixture, the authors measured each protocol's ability to capture cell markers, gene detection power, clusterability (with and without integration with other protocols), mappability, and mixability.
Tools/methods compared: Quartz-seq2
, Chromium
, Smart-seq2
, CEL-seq2
, C1HT-medium
, C1HT-small
, ddSEQ
, Chromium (single nuclei)
, Drop-seq
, inDrop
, ICELL8
, MARS-seq
, gmcSCRB-seq
.
Recommendation(s): See figure 6 for a summary of benchmarking results for each method. Quartz-seq2
was the overall best performing, yielding superior results for gene detection and marker expression over other methods, though Chromium
, Smart-seq2
, and CEL-seq2
were also strong performers.
Additional links: The authors provide benchmarking code and analysis code in two different Github repositories - here and here.
Title: Systematic comparison of single-cell and single-nucleus RNA-sequencing methods
Authors: Jiarui Ding, et al.
Journal Info: Nature Biotechnology, April 2020
Description: This study evaluated seven methods for single-cell and/or single-nucleus RNA-sequencing on three types of samples: cell lines, PBMCs, and brain tissue. Evaluation metrics included the structure and alignment of reads, number of multiplets and detection sensitivity, and ability to recover known biological information.
Tools/methods compared: Smart-seq2
, CEL-Seq2
, 3' 10X Chromium
, Drop-Seq
, Seq-Well
, inDrops
, sci-RNA-seq
.
Recommendation(s): Overall, the authors found 3' 10X Chromium
to have the strongest consistent performance among the high-throughput methods, yielding the highest sensitivity, though it did not perform any better for cell type classification. When greater sensitivity is required, the authors recommend Smart-seq2
or CEL-Seq2
, which both performed similarly. Supplementary table 7 includes an overview of each method's relative merits.
Additional links: The authors made their unified analysis pipeline (scumi) available as a python package, the repo of which also includes their R scripts used for cell filtering and cell type assignment.
Title: Comparison of visualization tools for single-cell RNAseq data
Authors: Batuhan Cakir, et al.
Journal Info: NAR Genomics and Bioinformatics, September 2020
Description: This study evaluated 13 scRNA-seq visualization platforms based on their features, performance, cloud and web support, containerization, and inter-operability between analysis platforms and data formats (loom, h5ad, SingleCellExperiment, Seurat, raw txt/csv.) with varying numbers of cells (5k to 2 million).
Tools/methods compared: ASAP
, Bbrowser
, cellxgene
, Granatum
, iSEE
, Loom viewer
, Loupe Cell Browser
, SCope
, scSVA
, scVI
, Single Cell Explorer
, SPRING
, and UCSC Cell Browser
.
Recommendation(s): Table 1 provides an overview of the tool capabilities and current support. The authors further compared tools that could be used for web sharing - cellxgene
, iSEE
with SCE files, iSEE
with loom files, loom-viewer
, SCope
, Single Cell Explorer
, scSVA
, and UCSC Cell Browser
for preprocessing memory and time requirements with varying numbers of cells. iSEE-loom
, SCope
, scSVA
and loom-viewer
all enable efficient integration with the hierarchical data format (HDF5) from which loom and h5ad formats are derived, and as such, have the lowest preprocessing time requirements. iSEE-SCE
performed poorly with large numbers of cells (>50k) with the default number of panels (8) but performed better with a more limited set of visualizations. Some of the tools had unexplainable resource spikes with high numbers of cells, e.g. loom-viewer dramatically increasing in both time and memory requirements with 2M cells versus 1.5M. cellxgene
was generally recommended due to its ease of use, expansive community support and active maintenance, and ability to handle large datasets. Single Cell Explorer
and UCSC Cell Browser
both were among the poorest performers resource-wise, with their usage increasing linearly with cell number. iSEE
was generally commended for its flexibility and support of custom visualizations.
Additional links: The authors made their package (sceasy) to convert between Seurat, SCE, Loom, and AnnData objects available via GitHub.
Title: Comparison of high-throughput single-cell RNA sequencing data processing pipelines
Authors: Mingxuan Gao, et al.
Journal Info: Briefings in Bioinformatics, July 2020
Description: This study evaluated 7 scRNA-seq pipelines on 8 data sets.
Tools/methods compared: Drop-seq-tools version-2.3.0
, Cell Ranger version-3.0.2
, scPipe version-1.4.1
, zUMIs version-2.4.5b
, UMI-tools version-1.0.0
, umis version-1.0.3
, dropEst version-0.8.6
Recommendation(s): Cell Ranger shows the highest algorithm complexity and parallelization, whereas scPipe, umis and zUMIs have lower complexity that is suitable for large-scale scRNA-seq integration analysis. UMI-tools show the highest transcript quantification accuracy on ERCC datasets from three scRNA-seq platforms. Integration of expression matrices from different pipelines will introduce confounding factors akin to batch effect. zUMIs and dropEst have higher sensitivity to detectmore genes for single cells, which may also bring unwanted factors. For most downstream analysis, Drop-seq-tools, Cell Ranger and UMI-tools show high consistency, whereas umis and zUMIs show inconsistent results compared with the other pipelines.
Title: A systematic evaluation of single cell RNA-seq analysis pipelines
Authors: Beate Veith, et al.
Journal Info: Nature Communications, October 2019
Description: This study evaluated ~3000 pipeline combinations based on three mapping, three annotation, four imputation, seven normalization, and four differential expression testing approaches with five scRNA-seq library protocols on simulated data.
Tools/methods compared: scRNA-seq library prep protocols - SCRB-seq
, Smart-seq2
, CEL-seq2
, Drop-seq
, 10X Genomics
. Mapping - bwa
, STAR
, kallisto
. Annotation - gencode
, refseq
, vega
. Imputation - filtering
, DrImpute
, scone
, SAVER
. Normalization - scran
, SCnorm
, Linnorm
, Census
, MR
, TMM
. Differential testing - edgeR-zingeR
, limma
, MAST
, T-test
.
Recommendation(s): Figure 5F contains a flowchart with the authors' recommendations. For alignment, STAR
with Gencode annotations generally had the highest mapping and assignment rates. All mappers performed best with Gencode annotations. For normalization, scran
was found to best handle potential assymetric differential expression and large numbers of differentially expressed genes. They also note that normalization is overall the most influential step, particularly if asymmetric DE is present (Figure 5). For Smart-seq2 data without spike-ins, the authors suggest Census
may be the best choice. The authors found little benefit to imputation in most scenarios, particularly if one of the better normalization methods (e.g. scran
) was used. The authors found library prep and normalization strategies to have a stronger effect on pipeline performance than the choice of differential expression tool, but generally found limma-trend
to have the most robust performance.
Additional links: The authors made their simulation tool (powsimR) available on Github along with their pipeline scripts to reproduce their analyses.
Title: A Systematic Evaluation of Single-cell RNA-sequencing Imputation Methods
Authors: Wenpin Hou, et al.
Journal Info: bioRxiv, January 2020
Description: This paper evaluated 18 scRNA-seq imputation methods using seven datasets containing cell line and tissue data from several experimental protocols. The authors assessed the similarity of imputed cell profiles to bulk samples and investigated whether imputation improves signal recovery or introduces noise in three downstream applications - differential expression, unsupervised clustering, and trajectory inference.
Tools/methods compared: scVI
, DCA
, MAGIC
, scImpute
, kNN-smoothing
, mcImpute
, SAUCIE
, DrImpute
, PBLR
, SAVER
, VIPER
, SAVERX
, DeepImpute
, scRecover
, ALRA
, bayNorm
, AutoImpute
, scScope
.
Recommendation(s): Figure 6 provides a performance summary of the tested methods. In general, the authors recommend caution using any of these methods, as they can introduce significant variability and noise into downstream analyses. Of the methods tested, MAGIC
, kNN-smoothing
, and SAVER
outperformed the other methods most consistently, though this varied widely across evaluation criteria, protocols, datasets, and downstream analysis. Many methods show no clear improvement over no imputation, and in some cases, perform significantly worse in downstream analyses.
Additional links (optional): The authors placed all of their benchmaking code on Github.
Title: Bias, robustness and scalability in single-cell differential expression analysis
Authors: Charlotte Soneson* & Mark D Robinson*
Journal Info: Nature Methods, February 2018
Description: This paper evaluated 36 approaches for determining differential gene expression from both synthetic and 36 real scRNA-seq datasets. The authors assess type I error control, FDR control and power, computational efficiency, and consistency.
Tools/methods compared: edgeRQLFDetRate
, MASTcpmDetRate
, limmatrend
, MASTtpmDetRate
, edgeRQLF
, ttest
, voomlimma
, Wilcoxon
, MASTcpm
, MASTtpm
, SAMseq
, D3E
, edgeRLRT
, metagenomeSeq
, edgeRLRTcensus
, edgeRLRTdeconv
, monoclecensus
, ROTStpm
, ROTSvoom
, DESeq2betapFALSE
, edgeRLRTrobust
, monoclecount
, DESeq2
, DESeq2nofilt
, ROTScpm
, SeuratTobit
, NODES
, DESeq2census
, scDD
, BPSC
, SCDE
, DEsingle
, monocle
, SeuratBimodnofilt
, SeuratBimodlsExpr2
, SeuratBimod
.
Recommendation(s): In general, the authors found that gene prefiltering was essential for good, robust performance from many methods. They note high variability between methods and summarize general performance across all metrics in Figure 5. They do not make recommendations as to a specific method/tool. Of note is that Seurat switched to using the wilcoxon test by default after this study was released, as it performed much better than their previously available methods.
Additional links: The authors make their benchmarking pipeline, conquer, available on Github.
Title: A comparison of single-cell trajectory inference methods
Authors: Wouter Saelens*, Robrecht Cannoodt*, et al.
Journal Info: Nat Biotech, April 2019
Description: A comprehensive evaluation of 45 trajectory inference methods, this paper provides an unmatched comparison of the rapidly evolving field of single-cell trajectory inference. Each method was scored on accuracy, scalability, stability, and usability. Should be considered a gold-standard for other benchmarking studies.
Tools/methods compared: PAGA
, RaceID/StemID
, SLICER
, Slingshot
, PAGA Tree
, MST
, pCreode
, SCUBA
, Monocle DDRTree
, Monocle ICA
, cellTree maptpx
, SLICE
, cellTree VEM
, EIPiGraph
, Sincell
, URD
, CellTrails
, Mpath
, CellRouter
, STEMNET
, FateID
, MFA
, GPfates
, DPT
, Wishbone
, SCORPIUS
, Component 1
, Embeddr
, MATCHER
, TSCAN
, Wanderlust
, PhenoPath
, topslam
, Waterfall
, EIPiGraph linear
, ouijaflow
, FORKS
, Angle
, EIPiGraph cycle
, reCAT
.
Recommendation(s): Varies depending on dataset and expected trajectory type, though PAGA, PAGA Tree, SCORPIUS, and Slingshot all scored highly across all metrics.
Authors wrote an interactive Shiny app to help users choose the best methods for their data.
Additional links: The dynverse site contains numerous packages for users to run and compare results from different trajectory methods on their own data without installing each individually by using Docker. Additionally, they provide several tools for developers to wrap and benchmark their own method against those included in the study.
Title: Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data
Authors: Aditya Pratapa, et al.
Journal Info: Nature Methods, January 2020
Description: The authors compared 12 gene regulatory network (GRN) inference techniques to assess the accuracy, robustness, and efficiency of each method on simulated data from synthetic networks, simulated data from curated models, and real scRNA-seq datasets.
Tools/methods compared: GENIE3
, PPCOR
, LEAP
, SCODE
, PIDC
, SINCERITIES
, SCNS
, GRNVBEM
, SCRIBE
, GRNBoost2
, GRISLI
, SINGE
.
Recommendation(s): In general, the authors recommend PIDC
, GENIE3
, or GRNBoost2
, as they were the leading and consistent performers for both curated models and experimental datasets in terms of accuracy as well as having multithreaded implementations. Both the GENIE3
and GRNBoost2
methods can be used from the SCENIC workflows in either R or the (much faster) python implementation.
Additional links: The authors provide their benchmarking framework, BEELINE on Github, which also provides an easy-to-use and uniform interface to each method in the form of a Docker image.
Authors: Shounan Chen* & Jessica C. Mar*
Journal Info: BMC Bioinformatics, June 2018
Description: This study compared 8 gene regulatory network inference methods (5 bulk RNA-seq, 3 specific to scRNA-seq) for scRNA-seq data for precision and recall.
Tools/methods compared: Partial correlation (Pcorr)
, Bayesian networks
, GENIE3
, ARACNE
, CLR
, SCENIC
, SCODE
, PIDC
.
Recommendation(s): Generally, the authors found relatively poor performance across all methods both for simulated and real data. The results from each method had few similarities with other methods and high false positive rates, and the authors recommend caution when interpreting the networks reconstructed with these methods. The authors showed that many of these methods were dramatically affected by dropout events.
Title: Benchmarking multi-omics integration algorithms across single-cell RNA and ATAC data
Authors: Chuxi Xiao, et al.
Journal Info: bioRxiv, November 2023
Description: This study benchmarks 12 multi-omics integration methods across three integration tasks, assessing their performance in combining single-cell RNA (scRNA-seq) and ATAC (scATAC-seq) data. The evaluation considers aspects such as the extent of mixing between different omics, cell type conservation, single-cell level alignment accuracy, trajectory preservation, scalability, and ease of use.
Tools/methods compared: scMVP
, MOFA+
, MultiVI
, Cobolt
, scDART
, UnionCom
, MMD-MA
, scJoint
, Harmony
, Seurat (v4.3)
, LIGER
, GLUE
.
Recommendation(s): The study recommends different methods based on dataset type and size. For unpaired datasets, GLUE
is preferred. In paired tasks, GLUE
and MultiVI
are top choices, with the latter excelling in trajectory conservation. For omics mixing, scDART
, LIGER
, and Seurat
are recommended. For cell type conservation, MOFA+
and scMVP
are viable options. In terms of scalability, Seurat
, LIGER
, and MOFA+
are efficient. For ease of use, scDART
, scJoint
, and Seurat
are highlighted for their detailed guidance.
Title: A benchmark of batch-effect correction methods for single-cell RNA sequencing data
Authors: Hoa Thi Nhu Tran, et al.
Journal Info: Genome Biology, January 2020
Description: The authors compared 14 methods in terms of computational runtime, the ability to handle large datasets, and batch-effect correction efficacy while preserving cell type purity.
Tools/methods compared:
Seurat2
, Seurat3
, Harmony
, fastMNN
, MNN Correct
, ComBat
, Limma
, scGen
, Scanorama
, MMD-ResNet
, ZINB-WaVe
, scMerge
, LIGER
, BBKNN
Recommendation(s): Based on the benchmarking results authors suggest Harmony, LIGER, and Seurat3 as best methods for batch integration.
Authors: Shiquan Sun, et al.
Journal Info: BioRxiv, October 2019
Description: A mammoth comparison of 18 different dimension reduction methods on 30 publicly available scRNAseq data sets in addition to 2 simulated datasets for a variety of purposes ranging from cell clustering to trajectory inference to neighborhood preservation.
Tools/methods compared:
factor analysis (FA)
, principal component analysis (PCA)
, independent component analysis (ICA)
, Diffusion Map
, nonnegative matrix factorization (NMF)
, Poisson NMF
, zero-inflated factor analysis (ZIFA)
, zero-inflated negative binomial based wanted variation extraction (ZINB-WaVE)
, probabilistic count matrix factorization (pCMF)
, deep count autoencoder network (DCA)
, scScope
, generalized linear model principal component analysis (GLMPCA)
, multidimensional scaling (MDS)
, locally linear embedding (LLE)
, local tangent space alignment (LTSA)
, Isomap
, uniform manifold approximation and projection (UMAP)
, t-distributed stochastic neighbor embedding (tSNE)
.
Recommendation(s): Varies depending on use case. Factor Analysis and principal component analysis performed well for most use cases. See figure 5 for pratical guidelines.
Additional links: The authors have made their benchmarking code available on Github.
Title: Benchmarking principal component analysis for large-scale single-cell RNA-sequencing
Authors: Koki Tsuyuzaki, et al.
Journal Info: Genome Biology, January 2020
Description: This study compared 21 implementations of 10 algorithms across Python, R, and Julia for principal component analysis for scRNA-seq data, measuring scalability, computational efficiency, outlier robustness, t-SNE/UMAP replication, ease of use, and more using both synthetic and real datasets.
Tools/methods compared: PCA (sklearn, full)
, fit (MultiVariateStats.jl)
, Downsampling
, IncrementalPCA (sklearn)
, irlba (irlba)
, svds (RSpectra)
, propack.svd (svd)
, PCA (sklearn, arpack)
, irlb (Cell Ranger)
, svds (Arpack.jl)
, orthiter (OnlinePCA.jl)
, gd (OnlinePCA.jl)
, sgd (OnlinePCA.jl)
, rsvd (rsvd)
, oocPCA_CSV (oocRPCA)
, PCA (sklearn, randomized)
, randomized_svd (sklearn)
, PCA (dask-ml)
, halko (OnlinePCA.jl)
, algorithm971 (OnlinePCA.jl)
.
Recommendation(s): Author recommendations vary based on the language being used and matrix size. See figure 8 for recommendations along with recommended parameter settings.
Additional links: The authors published their benchmarking scripts on Github.
Title: A comparison of automatic cell identification methods for single-cell RNA sequencing data
Authors: Tamim Abdelaal*, Lieke Michielsen*, et al.
Journal Info: Genome Biology, September 2019
Description: The authors benchmarked 22 classification methods that automatically assign cell identities including single-cell-specific and general-purpose classifiers across 27 publicly available single-cell RNA sequencing datasets of different sizes, technologies, species, and levels of complexity. Two types of experimental setups were used evaluate the performance of each method for within dataset predictions (intra-dataset) and across datasets (inter-dataset) based on accuracy, percentage of unclassified cells, and computation time.
Tools/methods compared: Garnett
, Moana
, DigitalCellSorter
, SCINA
, scVI
, Cell-BLAST
, ACTINN
, LAmbDA
, scmapcluster
, scmapcell
, scPred
, CHETAH
, CaSTLe
, SingleR
, scID
, singleCellNet
, LDA
, NMC
, RF
, SVM
, SVM<sub>rejection</sub>
, kNN
Recommendation(s): All classifiers performed well. The authors recommended SVMrejection classifier (with a linear kernel). Other classifiers include SVM , singleCellNet, scmapcell, and scPred were also of high performances.
In their experiments, incorporating prior knowledge in the form of marker genes does not improve the performance.
Additional links: A Snakemake workflow, scRNAseq_Benchmark, was provided to automate the benchmarking analyses.
Authors: Fenglin Liu*, Yuanyuan Zhang*, et al.
Journal Info: Genome Biology, November 2019
Description: This paper compared seven variant callers using both simulation and real scRNA-seq datasets and identified several elements influencing their performance, including read depth, variant allele frequency, and specific genomic contexts. Sensitivity and specificity were the benchmarking metrics used.
Tools/methods compared: SAMtools
, GATK
, CTAT
, FreeBayes
, MuTect2
, Strelka2
, VarScan2
.
Recommendation(s): Varies, see figure 7 for a flowchart breakdown. Generally, SAMtools (most sensitive, lower specificity in intronic or high-identity regions), Strelka2 (good performance when read depth >5), FreeBayes (good specificity/sensitivity in cases with high variant allele frequencies), and CTAT (no alignment step necessary) were top performers.
Additional links: The authors made their benchmarking code available on Github.
Title: Assessment of computational methods for the analysis of single-cell ATAC-seq data
Authors: Caleb Lareau*, Tommaso Andreani*, Micheal E. Vinyard*, et al.
Journal Info: Genome Biology, November 2019
Description: This study compares 10 methods for scATAC-seq processing and featurizing using 13 synthetic and real datasets from diverse tissues and organisms.
Tools/methods compared: BROCKMAN
, chromVAR
, cisTopic
, Cicero
, Gene Scoring
, Cusanovich2018
, scABC
, Scasat
, SCRAT
, SnapATAC
.
Recommendation(s): SnapATAC, Cusanovich2018, and cisTopic were the top performers for separating cell populations of different coverages and noise levels. SnapATAC was the only method capable of analyzing a large dataset (>80k cells).
Additional links: The authors have made their benchmarking code available on Github.
Title: A practical guide to methods controlling false discoveries in computational biology
Authors: Keegan Korthauer*, Patrick K. Kimes*, et al.
Journal Info: Genome Biology, June 2019
Description: An benchmark comparison of the accuracy, applicability, and ease of use of two classic and six modern methods that control for the false discovery rate (FDR). Used simulation studies as well as six case studies in computational biology (specifically differential expression testing in bulk RNA-seq, differential expression testing in single-cell RNA-seq, differential abundance testing and correlation analysis in 16S microbiome data, differential binding testing in ChIP-seq, genome-wide association testing, and gene set analysis).
Tools/methods compared: Benjamini-Hochberg, Storey’s q-value, conditional local FDR (LFDR), FDR regression (FDRreg), independent hypothesis weighting (IHW), adaptive shrinkage (ASH), Boca and Leek’s FDR regression (BL), and adaptive p-value thresholding (AdaPT).
Recommendation(s): Modern FDR methods that use an informative covariate (as opposed to only p-values) leads to more power while controlling the FDR over classic methods. The improvement of the modern FDR methods over the classic methods increases with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses.
Additional links: Full analyses of the in silico experiments, simulations, and case studies are provided in Additional files 2–41 at https://pkimes.github.io/benchmark-fdr-html/. The source code to reproduce all results in the manuscript and additional files, as well as all figures, is available on GitHub. An ExperimentHub
package containing the full set of results objects is available through the Bioconductor project, and a Shiny application for interactive exploration of these results is also available on GitHub. The source code, ExperimentHub package, and Shiny application are all made available under the MIT license.
Title: Evaluating Bioinformatic Pipeline Performance for Forensic Microbiome Analysis
Authors: Sierra F. Kaszubinski*, Jennifer L. Pechal*, et al.
Journal Info: Journal of Forensic Sciences, 2019
Description: Sequence reads from postmortem microbiome samples were analyzed with mothur v1.39.5, QIIME2 v2018.11, and MG-RAST v4.0.3. For postmortem data, MG-RAST had a much smaller effect size than mothur and QIIME2 due to the twofold reduction in samples. QIIME2 and Mothur returned similar results, with Mothur showing inflated richness due to unclassified taxa. Adjusting minimum library size had significant effects on microbial community structure, sample size less so except for low abundant taxa.
Tools/methods compared: mothur
, QIIME2
, MG-RAST
Recommendation(s): QIIME2 was deemed the most appropriate choice for forensic analysis in this study.
Additional links: Sequence data are archived through the European Bioinformatics Institute European Nucleotide Archive (www.ebi.ac.uk/ena) under accession number: PRJEB22642. Pipeline parameters and microbial community analyses are available on GitHub.
Title: Comparison of computational methods for the identification of topologically associating domains
Authors: Marie Zufferey*, Daniele Tavernari*, et al.
Journal Info: Genome Biology, December 2018
Description: In this study, the authors compared the performance of 22 TAD callers, each on 20 different conditions (4 map resolutions each normalized with 2 independent strategies, plus 12 additional contact maps with variable sequencing depth) and assessed their performance via concordance, robustness to data resolution and normalization method, and ability to recapitulate biological features typically associated with TADs and TAD boundaries. Assessments were performed on high-resolution Hi-C data from GM12878 and validated in other datasets for select callers.
Tools/methods compared: 3DNetMod
, armatus
, arrowhead
, CaTCH
, CHDF
, chromoR
, ClusterTAD
, DI
, EAST
, GMAP
, HiCExplorer
, HiCseq
, HiTAD
, ICFinder
, IS
, matryoshka
, MrTADFinder
, PSYCHIC
, spectral
, TADbit
, TADtree
, and TopDom
.
Recommendation(s): See Table 2 for a succinct results summary. In general, the authors found that TopDom
, HiCseg
, and CaTCH
satisfied at least four out of five criteria: robustness with respect to bin size (resolution) and normalization strategy (ICE and LGF); cost-effective performance based on the ability of the caller to identify concordant TADs with <1% of reads; reproducibility of TADs called by other callers; computational efficiency; and biological relevance based on previosly reported TAD-associated features. The authors do, however, note that callers that attempt to call hierarchical TAD structures generally performed worse than those that do not, potentially due to requiring higher data resolution.
Title: Comparison of normalization methods for Hi-C data
Authors: Hongqiang Lyu, Erhu Liu, Zhifang Wu
Journal Info: Biotechniques, October 2019
Description: In this study, the authors compared the performance of 6 Hi-C normalization methods at 8 different resolution levels - 2.5M, 1M, 500k, 250k, 100k, 50k, 10k, and 5k from 4 different Hi-C studies. These methods were compared for heat map texture, statistical quality, influence of resolution, consistency of distance stratum, and reproducibility of TAD architecture. The authors assessed the quality of statistics by comparing the distribution of interaction frequency, correlation of replicates, and comparability of replicates between contexts.
Tools/methods compared: SCN
, HiCNorm
, ICE
, KR
, chromoR
, and multiHiCcompare
.
Recommendation(s): See Table 2 for a succinct results summary. In summary, the authors found that use-case determines the best tool. All of these methods except multiHiCcompare
are single sample methods, which led to multiHiCcompare
performing the best in many considerations, including distribution of interaction frequency, comparability between contexts, and consistency of distance stratum. However, multiHiCcompare
is by far the most computationally expensive and requires a large number of samples to be useful. SCN
and KR
show the best reproducibility of TAD architecture across various resolutions. chromoR
was found to blur interaction heatmaps at lower resolutions due to its de-noising procedure, though it was also found the achieve the best correlation of replicates at mid-resolutions (500k, 250k).
Additional links: The authors have made their implementations to run the various methods available on GitHub.
- Jared Andrews (@j-andrews7)
- Kevin Blighe (@kevinblighe, biostars)
- Ludwig Geistlinger (@lgeistlinger)
- Jeremy Leipzig (@leipzig)
- Avi Srivastava (@k3yavi)
- Stephanie Hicks (@stephaniehicks)
- Sridhar N Srivatsan (@sridhar0605)
- Qingzhou Zhang (@zqzneptune)
- Guandong Shang (@shangguandong1996, @GuandongS)