-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.Rmd
245 lines (179 loc) · 9.65 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
spliceclust [![Build Status](https://travis-ci.org/pkimes/spliceclust.svg?branch=master)](https://travis-ci.org/pkimes/spliceclust) [![codecov](https://codecov.io/gh/pkimes/spliceclust/branch/master/graph/badge.svg)](https://codecov.io/gh/pkimes/spliceclust)
=======================
## Contents
1. [Introduction](#intro)
2. [SpliceGraHM Examples](#splicegrahm)
3. [SplicePCA Examples](#splicepca)
4. [SplicePCP Examples](#splicepcp)
4. [SpliceGraHM2 Examples](#splicegrahm2)
## <a name="intro"></a> Introduction
This package may be used to plot exon and splice junction coverage across a large cohort
of RNA-seq samples. The plotting approach is based on the idea of transposing
expression heatmaps on splicing diagrams. The plots are generated using the `ggplot2` and
`ggbio` packages.
```{r, warning=FALSE, message=FALSE}
## load annotations packages
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
library("BSgenome.Hsapiens.UCSC.hg19")
## load current package
devtools::load_all()
```
## <a name="splicegrahm"></a> SpliceGraHM Examples
First we must construct a `concomp` object from a `GRangesList` object with exon and junction
boundaries and corresponding coverage values. For illustration purposes, we use a previously
constructed `GRanges` object, `klk12_lusc_gr`, containing annotation information for the KLK12 gene
locus along with coverage for 177 lung squamous cell carcinoma samples.
```{r}
data("klk12_lusc_gr")
```
The exon and junction boundaries, and the corresponding kind label are given by:
```{r}
ranges(klk12_lusc_gr)
mcols(klk12_lusc_gr)$kind
```
In addition to `kind`, the `klk12_lusc_gr` metadata columns also include the gene name (`gIdx`),
gene boundaries (`gStart`, `gStop`), and coverage values for the 177 samples.
```{r}
mcols(klk12_lusc_gr)[1:5, 1:6]
```
To construct the `concomp` (connected component) object for analysis, we first convert the
`GRanges` object into a `GRangesList` object of length 2, corresponding to exon and junction
information.
```{r}
klk12_gl <- split(klk12_lusc_gr, mcols(klk12_lusc_gr)$kind)
klk12_cc <- concomp(klk12_gl)
```
We first demonstrate the default and __basic SpliceGraHM__ (Splice Graph Heat Map) plotting procedure.
```{r, fig.width=12, fig.height=4}
splicegrahm(klk12_cc)
```
In the plot above, each box arranged horizontally corresponds to a contiguous exonic region
along the genome. Each box is colored by 177 horizontal lines, showing the expression level
for the 177 samples being analyzed. Note that the exons are plotted along genomic coordinates, and
log expression is shown in color.
In addition to the boxes, the plot contains arrows which correspond to a splicing events with
sufficient support in the data (e.g. at least 8 samples, each with at least 5 reads spanning the splice
junction). The arrows are colored such that darker arrows were present in a higher proportion of the
177 samples (scale shown on right).
The SpliceGraHM name is in reference to the fact that after removing the spacing between each exon
the plot simply reduces to a __standard heatmap__ of expression along a single gene. The corresponding
figure after removing the intronic gaps is shown below. The rectangle of color is a heatmap with rows
and columns corresponding to samples and exons, respectively.
```{r, fig.width=12, fig.height=4}
splicegrahm(klk12_cc, genomic = FALSE, ex_use = 1, log_base = 2)
```
It is possible to show the coverage of each splice junction using a similar convention with rows
corresponding to samples, and with color being used for coverage.
```{r, fig.width=12, fig.height=4}
splicegrahm(klk12_cc, genomic = FALSE, j_incl = TRUE)
```
If annotation information is available, e.g. from UCSC KnownGenes, these can be passed to the
function to add an additional track to the plot. The appropriate `GRangesList` object to be
passed is illustrated in the following example.
```{r, fig.width=12, fig.height=4}
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
isActiveSeq(txdb)[seqlevels(txdb)] <- FALSE
isActiveSeq(txdb)[paste0("chr", 1:22)] <- TRUE
exbytx <- exonsBy(txdb, "tx")
splicegrahm(klk12_cc, genomic = TRUE, j_incl = TRUE, txlist = exbytx)
```
If gene names are desired, the following can be used to match the transcript ID
in `txdb` against gene symbols (e.g. in `org.Hs.eg.db`).
```{r, fig.width=12, fig.height=4}
suppressPackageStartupMessages(library("org.Hs.eg.db"))
splicegrahm(klk12_cc, genomic = TRUE, j_incl = TRUE, txlist = exbytx,
txdb = txdb, orgdb = org.Hs.eg.db)
```
The `splicegrahm` function can now also plot gene models in non-genomic space
with an additional parameter `eps`. The `eps` parameter determines how far up/down
from the connected component to lookfor overlapping gene models. If `eps = NULL`,
all overlapping gene models are included. If `eps = 1000`, only overlapping gene
models which are fully contained within 1000bp of the connected component
are included.
```{r, fig.width=12, fig.height=4}
splicegrahm(klk12_cc, genomic = FALSE, j_incl = TRUE, txlist = exbytx,
txdb = txdb, orgdb = org.Hs.eg.db, eps = NULL)
```
```{r, fig.width=12, fig.height=4}
splicegrahm(klk12_cc, genomic = FALSE, j_incl = TRUE, txlist = exbytx,
txdb = txdb, orgdb = org.Hs.eg.db, eps = 1)
```
```{r, fig.width=12, fig.height=4}
splicegrahm(klk12_cc, genomic = FALSE, j_incl = TRUE, txlist = exbytx,
txdb = txdb, orgdb = org.Hs.eg.db, eps = 1000)
```
## <a name="splicepca"></a> SplicePCA Examples
Principal Component Analysis (PCA) is a popular exploratory analysis tool for studying low rank
structure in high-dimensional datasets by projecting the data along the directions of greatest
variation. These directions of greatest variation are often referred to as the PC "loading"
directions. The `splicepca` function is written to visualize the loading vectors of splicing
data.
As an example, the following code can be used to visualize the first 3 PC loadings in the
KLK12 dataset from above.
```{r, fig.width=10, fig.height=4}
splicepca(klk12_cc, npc = 3)
```
In the above PC loadings plot, red and blue are used to denote the magnitude of the PC loadings
for each exon and splice junction. From the plot above, we see that the greatest source of variation
in expression at the KLK12 gene locus is the overall expression of the gene. However, the second
PC loading shows interesting behavior where the primary source of variation is attributable to the
expression of the central exon. Scrolling up a little to the UCSC KnownGenes shown included above,
we see that the presence or absence of the central exon actually corresponds to differing isoforms
annotated to this gene. Note that above the PCs are computed separately for exons and junctions
information. As such, we obtain separate PC "scores" for the exon and junction PC loadings.
```{r, fig.width=6, fig.height=6}
splicepca(klk12_cc, npc = 3, scores = TRUE)
```
It is also possible to perform the PCA analysis using the concatenated exon and junction information by
setting the `pc_sep` parameter to `FALSE`, and specifying the relative "weight" of each with `ej_w`. The
exon and junction data are rescaled to each have sum-of-squares equal to the values specified by `ej_w`.
In the following example, we use equal weights for the two data sources.
```{r, fig.width=10, fig.height=4}
splicepca(klk12_cc, npc = 3, pc_sep = FALSE, ej_w = c(1, 1))
splicepca(klk12_cc, npc = 3, pc_sep = FALSE, ej_w = c(1, 1), scores = TRUE)
```
## <a name="splicepcp"></a> SplicePCP Examples
All above plots have used color to represent expression. However, low frequency or outlier events may
become lost in this particular view. To handle this problem, we also provide the option to plot the
splicing objects with vertical height (the y-axis) corresponding to expression. The name of the function is
a reference to [_parallel coordinates plots_](pcp). The same _KLK12_ data is shown below using the
`splicepcp` function.
```{r, fig.width=12, fig.height=5}
splicepcp(klk12_cc, genomic = TRUE, txlist = exbytx, txdb = txdb, orgdb = org.Hs.eg.db)
```
The plot includes 3 tracks:
1. log-expression values for each exon or region of an exon
2. the complete gene model and splice junctions
3. annotated transcripts from the UCSC KnownGene database.
Currently, the function is being rewritten to also include junction coverage using a
parallel coordinates plot in a separate track.
## <a name="splicegrahm2"></a> SpliceGraHM2 Examples
To make direct comparison of two populations possible, we have created `splicegrahm2` which
draws two `splicegrahm` plots in a single figure. Consider, for example, the task of comparing
the behavior of splicing between the LUSC samples and head and neck squamous cell carcinoma (HNSC)
samples. We have loaded a connected component for 279 HNSC samples as `klk12_hnsc_cc`.
```{r}
data("klk12_hnsc_cc")
```
```{r, fig.width=12, fig.height=5}
splicegrahm2(klk12_cc, klk12_hnsc_cc, genomic=FALSE)
```
When gene models are also desired, they are placed in a central track for easy comparison to
the two `splicegrahm`s.
```{r, fig.width=12, fig.height=6}
splicegrahm2(klk12_cc, klk12_hnsc_cc, genomic=FALSE,
txlist = exbytx, txdb = txdb, orgdb = org.Hs.eg.db)
```
Alternatively, the bottom plot can be drawn with the same orientation as the top plot by
setting `mirror=FALSE`.
```{r, fig.width=12, fig.height=6}
splicegrahm2(klk12_cc, klk12_hnsc_cc, genomic=FALSE, mirror=FALSE)
```
While not modified above, the option ``same_scale`` can be used to specify whether
the two plots should be drawn using the same or separate scale along the y-axis.
This can be helpful when the two populations are of substantially different sizes.
## <a name="sessioninfo"></a> Session Information
```{r}
sessionInfo()
```
[pcp]: http://en.wikipedia.org/wiki/Parallel_coordinates