nf-core/rnaseq benchmark: how do tool combinations in different pipeline versions affect the analysis outcome?
Five different pipeline settings were run on three publicly available datasets from different organisms (human, plant, fish) of varying sizes (117GB, 37GB, 11GB) containing spike-ins of the External RNA Control Consortium (ERCC).
The two versions of the nf-core/rnaseq
pipeline (v.1.4.2 and v.3.2) were run in five settings, differing in aligner and quantification tools. For the older pipeline version v1.4.2 the options --aligner salmon
and hisat2
were used, while for the newer pipeline version v3.2 the options --aligner star_salmon
and star_rsem
, as well as the setting --pseudo_aligner salmon --skip_alignment true
were executed.
- Human cell dataset (publication by Rapaport et al., 2013)
- Arabidopsis dataset (publication by Califar et al., 2020)
- Zebrafish dataset (publication by Schall et al., 2017)
The iGenomes Ensembl references for Homo sapiens (GRCh37), Arabidopsis thaliana (TAIR10) and Danio rerio (GRCz10) were used for analysis after adding the ERCC sequences and annotations to the .fasta and .gtf files.
The qbic-pipelines/rnadeseq
pipeline was used to apply downstream analysis for rnaseq output with DESeq2
to identify differentially expressed (DE) genes.
Analysis and visualization of the DESeq2 output was performed in a Python Jupyter Notebook (6.3.0), applying mainly the packages pandas (1.2.4), numpy (1.20.2), scipy.stats (1.7.0) and scikit-learn (1.0). Graphs were generated with the python packages matplotlib (3.3.4) and seaborn (0.11.2). Venn diagrams were drawn using the R (4.2.2) library VennDiagram (1.7.3).
The results were submitted to the journal NAR Genomics and Bioinformatics
and pre-published on BioRxiv: How tool combinations in different pipeline versions affect the outcome in RNA-seq analysis
. The Authors Original Version and Supplements can also be found in the Paper/ folder.