-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathscRNAtools_paper.Rmd
599 lines (525 loc) · 34.9 KB
/
scRNAtools_paper.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
---
title: "Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database"
author:
- Luke Zappia^1,2^
- Belinda Phipson^1^
- Alicia Oshlack^1,2^*
output:
bookdown::word_document2:
reference_docx: style/style.docx
pandoc_args: [ "--csl", "style/plos.csl" ]
bibliography: style/references.bib
---
^1^ Bioinformatics, Murdoch Children's Research Institute, Melbourne, Victoria,
Australia
^2^ School of Biosciences, Faculty of Science, University of Melbourne,
Melbourne, Victoria, Australia
\* Corresponding author
E-mail: alicia.oshlack@mcri.edu.au
```{r setup, include = FALSE}
knitr::opts_chunk$set(autodep = TRUE,
cache = FALSE,
cache.path = "cache/",
cache.comments = TRUE,
echo = FALSE,
error = FALSE,
fig.path = "figures/",
fig.width = 10,
fig.height = 8,
dev = c('png', 'pdf'),
message = FALSE,
warning = FALSE)
```
```{r libaries}
# Plotting
library("cowplot")
library("RColorBrewer")
# Presentation
library("knitr")
# Strings
library("stringr")
# JSON
library("jsonlite")
# Tidyverse
library("tidyverse")
```
```{r source}
source("R/plotting.R")
source("R/utils.R")
```
```{r palette}
pal <- c(pink = "#EC008C", # Pink
blue = "#00ADEF", # Blue
green = "#8DC63F", # Green
teal = "#00B7C6", # Teal
orange = "#F47920", # Orange
purple = "#7A52C7", # Purple
grey = "#999999") # Grey
phase.pal <- c(brewer.pal(5, "Set1"), "#999999")
```
```{r load}
tools <- read_json("data/tools-table.json", simplifyVector = TRUE) %>%
select(-Refs) %>%
mutate(Added = as.Date(Added),
Updated = as.Date(Updated)) %>%
mutate(IsOld = Added < "2016-10-01")
date.counts <- tools %>%
select(Date = Added) %>%
group_by(Date = as.Date(Date)) %>%
summarise(Count = n()) %>%
complete(Date = full_seq(Date, 1), fill = list(Count = 0)) %>%
mutate(Total = cumsum(Count))
cats <- read_json("data/descriptions.json", simplifyVector = TRUE) %>%
mutate(Phase = factor(Phase,
levels = c("Phase 1", "Phase 2", "Phase 3", "Phase 4",
"Multiple", "Other")))
cat.counts <- tools %>%
summarise_at(7:38, sum) %>%
gather(key = Category, value = Count) %>%
arrange(-Count, Category) %>%
mutate(Prop = Count / nrow(tools)) %>%
left_join(cats, by = "Category") %>%
mutate(Category = str_replace_all(Category, "([[:upper:]])", " \\1")) %>%
mutate(Category = str_trim(Category)) %>%
mutate(Category = ifelse(Category == "U M Is", "UMIs", Category)) %>%
mutate(Category = factor(Category, levels = Category))
```
# Abstract
As single-cell RNA-sequencing (scRNA-seq) datasets have become more widespread
the number of tools designed to analyse these data has dramatically increased.
Navigating the vast sea of tools now available is becoming increasingly
challenging for researchers. In order to better facilitate selection of
appropriate analysis tools we have created the scRNA-tools database
(www.scRNA-tools.org) to catalogue and curate analysis tools as they become
available. Our database collects a range of information on each scRNA-seq
analysis tool and categorises them according to the analysis tasks they
perform. Exploration of this database gives insights into the areas of rapid
development of analysis methods for scRNA-seq data. We see that many tools
perform tasks specific to scRNA-seq analysis, particularly clustering and
ordering of cells. We also find that the scRNA-seq community embraces an
open-source and open-science approach, with most tools available under
open-source licenses and preprints being extensively used as a means to
describe methods. The scRNA-tools database provides a valuable resource for
researchers embarking on scRNA-seq analysis and records the growth of the
field over time.
# Author summary
In recent years single-cell RNA-sequencing technologies have emerged that allow
scientists to measure the activity of genes in thousands of individual cells
simultaneously. This means we can start to look at what each cell in a sample
is doing instead of considering an average across all cells in a sample, as was
the case with older technologies. However, while access to this kind of data
presents a wealth of opportunities it comes with a new set of challenges.
Researchers across the world have developed new methods and software tools to
make the most of these datasets but the field is moving at such a rapid pace it
is difficult to keep up with what is currently available. To make this easier
we have developed the scRNA-tools database and website (www.scRNA-tools.org).
Our database catalogues analysis tools, recording the tasks they can be used
for, where they can be downloaded from and the publications that describe how
they work. By looking at this database we can see that developers have focused
on methods specific to single-cell data and that they embrace an open-source
approach with permissive licensing, sharing of code and release of preprint
publications.
# Introduction
Single-cell RNA-sequencing (scRNA-seq) has rapidly gained traction as an
effective tool for interrogating the transcriptome at the resolution of
individual cells. Since the first protocols were published in 2009
[@Tang2009-jd] the number of cells profiled in individual scRNA-seq experiments
has increased exponentially, outstripping Moore’s Law [@Svensson2018-dx]. This
new kind of transcriptomic data brings a demand for new analysis methods. Not
only is the scale of scRNA-seq datasets much greater than that of bulk
experiments but there are also a variety of challenges unique to the
single-cell context [@Stegle2015-mc]. Specifically, scRNA-seq data is extremely
sparse (there is no expression measured for many genes in most cells), it can
have technical artefacts such as low-quality cells or differences between
sequencing batches and the scientific questions of interest are often different
to those asked of bulk RNA-seq datasets. For example, many bulk RNA-seq
datasets are generated to discover differentially expressed genes through a
designed experiment while many scRNA-seq experiments aim to identify or
classify cell types in complex tissues.
The bioinformatics community has embraced this new type of data at an
astonishing rate, designing a plethora of methods for the analysis of scRNA-seq
data. Keeping up with the current state of scRNA-seq analysis is now a
significant challenge as the field is presented with a huge number of choices
for analysing a dataset. Since September 2016 we have collated and categorised
scRNA-seq analysis tools as they have become available. This database is being
continually updated and is publicly available at www.scRNA-tools.org. In order
to help researchers navigate the vast ocean of analysis tools we categorise
tools in the database in the context of the typical phases of an scRNA-seq
analysis. Through the analysis of this database we show trends in not only the
analysis applications these methods address but how they are published and
licensed, and the platforms they use. Based on this database we gain insight
into the current state of current tools in this rapidly developing field.
# Design and implementation
## Database
The scRNA-tools database contains information on software tools specifically
designed for the analysis of scRNA-seq data. For a tool to be eligible for
inclusion in the database it must be available for download and public use.
This can be from a software package repository (such as Bioconductor
[@Huber2015-lz], CRAN or PyPI), a code sharing website such as GitHub or
directly from a private website. When new tools come to our attention they are
added to the scRNA-tools database. DOIs and publication dates are recorded for
any associated publications. As preprints may be frequently updated they are
marked as a preprint instead of recording a date. The platform used to build
the tool, links to code repositories, associated licenses and a short
description are also recorded. Each tool is categorised according to the
analysis tasks it can perform, receiving a true or false for each category
based on what is described in the accompanying paper or documentation. We also
record the date that each entry was added to the database and the date that it
was last updated. Most tools are added after a preprint or publication becomes
available but some have been added after being mentioned on social media or in
similar collections such as Sean Davis' awesome-single-cell page
(https://github.com/seandavi/awesome-single-cell).
## Website
To build the scRNA-tools website we start with the table described above as a
CSV file which is processed using an R script. The lists of packages available
in the CRAN, Bioconductor, PyPI and Anaconda software repositories are
downloaded and matched with tools in the database. For tools with associated
publications the number of citations they have received is retrieved from the
Crossref database (www.crossref.org) using the rcrossref package
(v`r packageVersion("rcrossref")`) [@Chamberlain2017-ov]. We also make use of
the aRxiv package (v`r packageVersion("aRxiv")`) [@Ram2017-mn] to retrieve
information about arXiv preprints. JSON files describing the complete table,
tools and categories are produced and used to populate the website.
The website consists of three main pages. The home page shows an interactive
table with the ability to sort, filter and download the database. The second
page shows an entry for each tool, giving the description, details of
publications, details of the software code and license and the associated
software categories. Badges are added to tools to provide clearly visible
details of any associated software or GitHub repositories. The final page
describes the categories, providing easy access to the tools associated with
them. Both the tools and categories pages can be sorted in a variety of ways,
including by the number of associated publications or citations. An additional
page shows a live and up-to-date version of some of the analysis presented here
with visualisations produced using ggplot2 (v`r packageVersion("ggplot2")`)
[@Wickham2010-zq] and plotly (v`r packageVersion("plotly")`) [@Sievert2017-am].
We welcome contributions to the database from the wider community via
submitting an issue to the project GitHub page
(https://github.com/Oshlack/scRNA-tools) or by filling in the submission form
on the scRNA-tools website.
## Analysis
The most recent version of the scRNA-tools database as of
`r format(Sys.Date(), format = "%d %B %Y")` was
used for the analysis presented in this paper. Data was manipulated in R
(v`r str_extract(R.version.string, "[0-9.]+")`) using the dplyr package
(v`r packageVersion("dplyr")`) [@Wickham2017-ip] and plots produced using the
ggplot2 (v`r packageVersion("ggplot2")`) and cowplot
(v`r packageVersion("cowplot")`) [@Wilke2017-zt] packages.
# Results
## Overview of the scRNA-tools database
When the database was first constructed it contained 70 scRNA-seq analysis
tools representing the majority of work in the field during the three years
from the publication of SAMstrt [@Katayama2013-pq] in November 2013 up to
September 2016. In the time since then over 160 new tools have been added (Fig
\@ref(fig:overview-panel)A). The more than tripling of the number of available
tools in such a short time demonstrates the booming interest in scRNA-seq and
its maturation from a technique requiring custom-built equipment with
specialised protocols to a commercially available product.
```{r make-overview-panel}
number.plot <- plotToolsNumber(date.counts, pal)
pub.plot <- plotPublication(tools, pal)
pub.time.plot <- plotPubTime(tools, pal)
licenses.plot <- plotLicenses(tools, pal)
platforms.plot <- plotPlatforms(tools, pal)
pub.panel <- plot_grid(pub.plot, pub.time.plot, nrow = 1, ncol = 2,
labels = c("B - Publication status",
"C - Publication status over time"),
hjust = -0.01)
plat.panel <- plot_grid(platforms.plot, licenses.plot, nrow = 1, ncol = 2,
labels = c("D - Platforms used by analysis tools",
"E - Associated software licenses"),
hjust = -0.01)
panel <- plot_grid(number.plot, pub.panel, plat.panel, nrow = 3, ncol = 1,
labels = c("A - Increase in tools over time", rep("", 4)),
hjust = -0.01)
ggsave("figures/Fig1_overview_panel.png", panel, height = 22.2, width = 19,
units = "cm")
ggsave("figures/Fig1_overview_panel.eps", panel, height = 22.2, width = 19,
units = "cm")
```
```{r overview-panel, fig.cap = "(A) Number of tools in the scRNA-tools database over time. Since the scRNA-seq tools database was started in September 2016 more than 160 new tools have been released. (B) Publication status of tools in the scRNA-tools database. Over half of the tools in the full database have at least one published peer-reviewed paper while another third are described in preprints. (C) When stratified by the date tools were added to the database we see that the majority of tools added before October 2016 are published, while around half of newer tools are available only as preprints. Newer tools are also more likely to be unpublished in any form. (D) The majority of tools are built using either the R or Python programming languages. (E) Most tools are released under a standard open-source software license, with variants of the GNU Public License (GPL) being the most common. However licenses could not be found for a large proportion of tools. Up-to-date versions of these plots (with the exception of C) are available on the analysis page of the scRNA-tools website (https://www.scrna-tools.org/analysis)."}
include_graphics("figures/Fig1_overview_panel.png")
```
### Publication status
Most tools have been added to the scRNA-tools database after coming to our
attention in a publication or preprint describing their method and use. Of all
the tools in the database about half have at least one publication in a
peer-reviewed journal and another third are described in preprint articles,
typically on the bioRxiv preprint server (Fig \@ref(fig:overview-panel)B).
Tools can be split into those that were available when the database was created
and those that have been added since. We can see that the majority of older
tools have been published while more recent tools are more likely to only be
available as preprints (Fig \@ref(fig:overview-panel)C). This is a good
demonstration of the delay imposed by the traditional publication process. By
publishing preprints and releasing software via repositories such as GitHub,
scRNA-seq tool developers make their tools available to the community much
earlier, allowing them to be used for analysis and their methods improved prior
to formal publication [@Bourne2017-fc].
### Platforms and licensing
Developers of scRNA-seq analysis tools have choices to make about what
platforms they use to create their tools, how they make them available to the
community and whether they share the source code. We find that the most
commonly used platform for creating scRNA-seq analysis tools is the R
statistical programming language, with many tools made available through the
Bioconductor or CRAN repositories (Fig \@ref(fig:overview-panel)D). Python is
the second most popular language, followed by MATLAB, a proprietary programming
language, and the lower-level C++. The use of R and Python is consistent with
their popularity across a range of data science fields. In particular the
popularity of R reflects its history as the language of choice for the analysis
of bulk RNA-seq datasets and a range of other biological data types.
The majority of tools in the scRNA-tools database have been developed with an
open-source approach, making their code available under permissive software
licenses (Fig \@ref(fig:overview-panel)E). We feel this reflects the general
underlying sentiment and willingness of the bioinformatics community to share
and build upon the work of others. Variations of the GNU Public License (GPL)
are the most common, covering almost half of tools. This license allows free
use, modification and distribution of source code, but also has a "copyleft"
nature which requires any derivatives to disclose their source code and use the
same license. The MIT license is the second most popular which also allows use
of code for any purpose but without any restrictions on distribution or
licensing. The appropriate license could not be identified for almost a quarter
of tools. This is problematic as fellow developers must assume that source code
cannot be reused, potentially limiting the usefulness of the methods in those
tools. We strongly encourage tool developers to clearly display their license
in source code and documentation to provide certainty to the community as to
any restrictions on the use of their work.
## Categories of scRNA-seq analysis
Single-cell RNA-sequencing is often used to explore complex mixtures of cell
types in an unsupervised manner. As has been described in previous reviews a
standard scRNA-seq analysis in this setting consists of several tasks which can
be completed using various tools [@Bacher2016-en; @Wagner2016-zm;
@Miragaia2017-tq; @Poirion2016-da; @Rostom2017-dc]. In the scRNA-tools database
we categorise tools based on the analysis tasks they perform. Here we group
these tasks into four broad phases of analysis: data acquisition, data
cleaning, cell assignment and gene identification (Fig
\@ref(fig:analysis-phases)). The data acquisition phase (Phase 1) takes the raw
nucleotide sequences from the sequencing experiment and returns a matrix
describing the expression of each gene in each cell. This phase consists of
tasks common to bulk RNA-seq experiments, such as alignment to a reference
genome or transcriptome and quantification of expression, but is often extended
to handle Unique Molecular Identifiers (UMIs) [@Kivioja2012-bq]. Once an
expression matrix has been obtained it is vital to make sure the resulting data
is of high enough quality. In the data cleaning phase (Phase 2) quality control
of cells is performed as well as filtering of uninformative genes. Additional
tasks may be performed to normalise the data or impute missing values.
Exploratory data analysis tasks are often performed in this phase, such as
viewing the datasets in reduced dimensions to look for underlying structure.
The high-quality expression matrix is the focus of the next phases of analysis.
In Phase 3 cells are assigned, either to discrete groups via clustering or
along a continuous trajectory from one cell type to another. As high-quality
reference datasets become available it will also become feasible to classify
cells directly into different cell types. Once cells have been assigned the
focus of analysis turns to interpreting what those assignments mean.
Identifying interesting genes (Phase 4), such as those that are differentially
expressed across groups, marker genes expressed in a single group or genes that
change expression along a trajectory, is the typical way to do this. The
biological significance of those genes can then be interpreted to give meaning
to the experiment, either by investigating the genes themselves or by getting a
higher-level view through techniques such as gene set testing.
```{r analysis-phases, fig.cap = "Phases of a typical unsupervised scRNA-seq analysis process. In Phase 1 (data acquisition) raw sequencing reads are converted into a gene by cell expression matrix. For many protocols this requires the alignment of reads to a reference genome and the assignment and de-duplication of Unique Molecular Identifiers (UMIs). The data is then cleaned (Phase 2) to remove low-quality cells and uninformative genes, resulting in a high-quality dataset for further analysis. The data can also be normalised and missing values imputed during this phase. Phase 3 assigns cells, either in a discrete manner to known (classification) or unknown (clustering) groups or to a position on a continuous trajectory. Interesting genes (eg. differentially expressed, markers, specific patterns of expression) are then identified to explain these groups or trajectories (Phase 4)."}
include_graphics("figures/Fig2_phase_diagram.png")
```
While there are other approaches that could be taken to analyse scRNA-seq data
these phases represent the most common path from raw sequencing reads to
biological insight applicable to many studies. An exception to this may be
experiments designed to test a specific hypothesis where cell populations may
have been sorted or the interest lies in differences between experimental
conditions rather than cell types. In this case Phase 3 may not be required, and
slightly different tools or approaches may be used, but many of the same
challenges will apply. In addition, as the field expands and develops it is
likely that data will be used in new ways to answer other biological questions,
requiring new analysis techniques. Descriptions of the categories in the
scRNA-tools database are given in Table \@ref(tab:categories-table), along with
the associated analysis phases.
```{r categories-table}
cap <- "Descriptions of categories for tools in the scRNA-tools database"
cats %>%
mutate(Category = str_replace_all(Category, "([[:upper:]])", " \\1")) %>%
mutate(Category = str_trim(Category)) %>%
mutate(Category = ifelse(Category == "U M Is", "UMIs", Category)) %>%
arrange(Phase) %>%
select(Phase, everything()) %>%
kable(caption = cap)
```
### Trends in scRNA-seq analysis tasks
Each of the tools in the database is assigned to one or more analysis
categories. We investigated these categories in further detail to give insight
into the trends in scRNA-seq analysis. Fig \@ref(fig:categories-panel)A shows
the frequency of tools performing each of the analysis tasks. Visualisation is
the most commonly included task and is important across all stages of analysis
for exploring and displaying data and results. Tasks for assigning cells
(ordering and clustering) are the next most common. This has been the biggest
area of development in single-cell analysis with clustering tools such as
Seurat [@Satija2015-or; @Butler2018-js], SC3 [@Kiselev2017-an] and BackSPIN
[@Zeisel2015-rd] being used to identify cell types in a sample and trajectory
analysis tools (for example Monocle [@Trapnell2014-he; @Qiu2017-bi;
@Qiu2017-mq], Wishbone [@Setty2016-yh] and DPT [@Haghverdi2016-zt]) being used
to investigate how genes change across developmental processes. These areas
reflect the new opportunities for analysis provided by single-cell data that
are not possible with bulk RNA-seq experiments.
```{r make-categories-panel}
cats.plot <- plotCategories(cat.counts, phase.pal)
cats.time.plot <- plotCatsTime(tools, cat.counts, pal)
phases.date.plot <- plotPhasesDate(tools, date.count, phase.pal)
cats.tools.plot <- plotCatsPerTool(tools, pal)
cats.tools.time.plot <- plotCatsPerToolTime(tools, pal)
cats.panel <- plot_grid(cats.time.plot, phases.date.plot, nrow = 1, ncol = 2,
labels = c("B - Change in analysis categories",
"C - Analysis phases over time"),
hjust = -0.01)
tools.panel <- plot_grid(cats.tools.plot, cats.tools.time.plot, nrow = 1,
ncol = 2,
labels = c("D - Number of categories per tool",
"E - Categories per tool by date added"),
hjust = -0.01)
panel <- plot_grid(cats.plot, cats.panel, tools.panel,
nrow = 3, ncol = 1, rel_heights = c(1, 1.4, 0.8),
labels = c("A - Analysis categories", rep("", 4)),
hjust = -0.01)
ggsave("figures/Fig3_categories_panel.png", panel, height = 22.2, width = 19,
units = "cm")
ggsave("figures/Fig3_categories_panel.eps", panel, height = 22.2, width = 19,
units = "cm")
```
```{r categories-panel, fig.cap = "(A) Categories of tools in the scRNA-tools database. Each tool can be assigned to multiple categories based on the tasks it can complete. Categories associated with multiple analysis phases (visualisation, dimensionality reduction) are among the most common, as are categories associated with the cell assignment phase (ordering, clustering). (B) Changes in analysis categories over time, comparing tools added before and after October 2016. There have been significant increases in the percentage of tools associated with visualisation, dimensionality reduction, gene networks and simulation. Categories including expression patterns, ordering and interactivity have seen relative decreases. (C) Changes in the percentage of tools associated with analysis phases over time. The percentage of tools involved in the data acquisition and data cleaning phases have increased, as have tools designed for alternative analysis tasks. The gene identification phase has seen a relative decrease in the number of tools. (D) The number of categories associated with each tools in the scRNA-tools database. The majority of tools perform few tasks. (E) Most tools that complete many tasks are relatively recent."}
include_graphics("figures/Fig3_categories_panel.png")
```
Dimensionality reduction is also a common task and has applications in
visualisation (via techniques such as t-SNE [@Maaten2008-ne]), quality control
and as a starting point for analysis. Testing for differential expression (DE)
is perhaps the most common analysis performed on bulk RNA-seq datasets and it is
also commonly applied by many scRNA-seq analysis tools, typically to identify
genes that are different in one group of cells compared to the rest. However
it should be noted that the DE testing applied by scRNA-seq tools is often not
as sophisticated as the rigorous statistical frameworks of tools developed for
bulk RNA-seq such as edgeR [@Robinson2010-pt; @McCarthy2012-gc], DESeq2
[@Love2014-tw] and limma [@Ritchie2015-te], often using simple statistical tests
such as the likelihood ratio test. While methods designed to test DE
specifically in single-cell datasets do exist (such as
SCDE [@Kharchenko2014-tb], and scDD [@Korthauer2016-yq]) it is still unclear
whether they improve on methods that have been established for bulk data
[@Jaakkola2016-ko; @Miao2016-eo; @Dal_Molin2017-pl], with the most comprehensive
comparison to date finding that bulk methods do not perform significantly worse
than those designed for scRNA-seq data [@Soneson2018-fb].
To investigate how the focus of scRNA-seq tool development has changed over
time we again divided the scRNA-tools database into tools added before and
after October 2016. This allowed us to see which analysis tasks are more common
in recently released tools. We looked at the percentage of tools in each time
period that performed tasks in the different analysis categories (Fig
\@ref(fig:categories-panel)B). Some categories show little change in the
proportion of tools that perform them while other areas have changed
significantly. Specifically, both visualisation and dimensionality reduction
are more commonly addressed by recent tools. The UMIs category has also seen a
big increase recently as UMI based protocols have become commonly used and
tools designed to handle the extra processing steps required have been
developed (e.g. UMI-tools [@Smith2017-nz], umis [@Svensson2017-zo], zUMIs
[@Parekh2018-ug]). Simulation is a valuable technique for developing, testing
and validating scRNA-seq tools. More packages are now including their
simulation functions and some tools have been developed for the specific
purpose of generating realistic synthetic scRNA-seq datasets (e.g. powsimR
[@Vieth2017-vc], Splatter [@Zappia2017-sg]). Classification of cells into known
groups has also increased as reference datasets become available and more tools
are identifying or making use of co-regulated gene networks.
Some categories have seen a decrease in the proportion of tools they represent,
most strikingly testing for expression patterns along a trajectory.
This is likely related to the change in cell ordering analysis, which is the
focus of a lower percentage of tools added after October 2016. The ordering of
cells along a trajectory was one of the first developments in scRNA-seq
analysis and a decrease in the development of these tools could indicate that
researchers have moved on to other techniques or that use has converged on a
set of mature tools.
By grouping categories based on their associated analysis phases we see similar
trends over time (Fig \@ref(fig:categories-panel)C). We see increases in the
percentage of tools performing tasks in Phase 1 (quantification), across
multiple phases (such as visualisation and dimensionality reduction) and
alternative analysis tasks. In contrast the percentage of tools that perform
gene identification tasks (Phase 4) has decreased and the percentage assigning
cells (Phase 3) has remained steady. Phase 2 (quality control and filtering)
has fluctuated over time but currently sits at a level slightly above when the
database was first created. This also indicates a maturation of the analysis
space as developers shift away from the tasks that were the focus of bulk
RNA-seq analysis and continue to focus on those specific to scRNA-seq while
working on methods for handling data from new protocols and performing
alternative analysis tasks.
### Pipelines and toolboxes
While there are a considerable number of scRNA-seq tools that only perform a
single analysis task, many perform at least two (Fig
\@ref(fig:categories-panel)D). Some tools (dropEst [@Petukhov2017-lm], DrSeq2
[@Zhao2017-ju], scPipe [@Tian2017-zl]) are pre-processing pipelines, taking raw
sequencing reads and producing an expression matrix. Others, such as Scanpy
[@Wolf2018-na], SCell [@Diaz2016-fz], Seurat, Monocle and scater
[@McCarthy2017-ql] can be thought of as analysis toolboxes, able to complete a
range of complex analyses starting with a gene expression matrix. Most of the
tools that complete many tasks are relatively more recent (Fig
\@ref(fig:categories-panel)E). Being able to complete multiple tasks using a
single tool can simplify analysis as problems with converting between different
data formats can be avoided. However it is important to remember that it is
difficult for a tool with many functions to continue to represent the
state of the art in all of them. Support for common data formats, such as the
recently released SingleCellExperiment [@Lun2017-jj], anndata [@Wolf2018-na] or
loom (http://loompy.org/) objects provides another way for developers to allow
easy use of their tools and for users to build custom workflows from
specialised tools.
### Alternative analyses
Some tools perform analyses that lie outside the common tasks performed on
scRNA-seq data described above. Simulation is one alternative task that has
already been mentioned but there is also a group of tools designed to detect
biological signals in scRNA-seq data apart from changes in expression. For
example identifying alternative splicing (BRIE [@Huang2017-lp], Outrigger
[@Song2017-ym], SingleSplice [@Welch2016-dv]), single nucleotide variants
(SSrGE [@Poirion2016-ld]), copy number variants (inferCNV [@Patel2014-bl]) and
allele-specific expression (SCALE [@Jiang2017-sb]). Reconstruction of immune
cell receptors is another area that has received considerable attention from
tools such as BASIC [@Canzar2017-jb], TraCeR [@Stubbington2016-xw] and TRAPeS
[@Afik2017-qw]. While tools that complete these tasks are unlikely to ever
dominate scRNA-seq analysis we expect to see an increase in methods for
tackling specialised analyses as researchers continue to push the boundaries of
what can be observed using scRNA-seq data.
# Availability and future directions
Since October 2016 we have seen the number of software tools for analysing
single-cell RNA-seq data more than triple, with more than 230 analysis tools
now available. As new tools have become available we have curated and
catalogued them in the scRNA-tools database where we record the analysis tasks
that they can complete, along with additional information such as any
associated publications. By analysing this database we have found that tool
developers have focused much of their efforts on methods for handling new
problems specific to scRNA-seq data, in particular clustering cells into groups
or ordering them along a trajectory. We have also seen that the scRNA-seq
community is generally open and willing to share their methods which are often
described in preprints prior to peer-reviewed publication and released under
permissive open-source licenses for other researchers to re-use.
The next few years promise to produce significant new developments in scRNA-seq
analysis. New tools will continue to be produced, becoming increasingly
sophisticated and aiming to address more of the questions made possible by
scRNA-seq data. We anticipate that some existing tools will continue to improve
and expand their functionality while others will cease to be updated and
maintained. Detailed benchmarking and comparisons will show how tools perform
in different situations and those that perform well, continue to be developed
and provide a good user experience will become preferred for standard analyses.
As single-cell capture and sequencing technology continues to improve analysis
tools will have to adapt to significantly larger datasets (in the millions of
cells) which may require specialised data structures and algorithms. Methods
for combining multiple scRNA-seq datasets as well as integration of scRNA-seq
data with other single-cell data types, such as DNA-seq, ATAC-seq or
methylation, will be another area of growth. In addition, projects such as the
Human Cell Atlas [@Regev2017-wg] will provide comprehensive cell type
references which will open up new avenues for analysis.
As the field expands the scRNA-tools database will continue to be updated with
support from the community. We hope that it provides a resource for researchers
to explore when approaching scRNA-seq analyses as well as providing a record of
the analysis landscape and how it changes over time.
## Availability
The scRNA-tools databases is publicly accessible via the website at
www.scRNA- tools.org. Suggestions for additions, updates and improvements are
warmly welcomed at the associated GitHub repository
(https://github.com/Oshlack/scRNA-tools) or via the submission form on the
website. The code and datasets used for the analysis in this paper are
available from https://github.com/Oshlack/scRNAtools-paper.
# Acknowledgements
We would like to acknowledge Sean Davis' work in managing the
awesome-single-cell page and producing a prototype of the script used to
process the database. Daniel Wells had the idea for recording software licenses
and provided licenses for the tools in the database at that time. Breon Schmidt
designed a prototype of the scRNA-tools website and answered many questions
about HTML and Javascript. Our thanks also to Matt Ritchie for his thoughts on
early versions of the manuscript.
# References