-
Notifications
You must be signed in to change notification settings - Fork 10
/
008-ggplot-dendrograms-and-heatmaps.qmd
455 lines (359 loc) · 15.6 KB
/
008-ggplot-dendrograms-and-heatmaps.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
---
title: "Clusters and Heatmaps"
author: "Jeff Oliver"
date: "`r format(Sys.time(), '%d %B, %Y')`"
---
```{r library-check, echo = FALSE, results = "hide"}
libs <- c("ggdendro", "ggplot2", "grid", "tidyr")
libs_missing <- libs[!(libs %in% installed.packages()[, "Package"])]
if (length(libs_missing) > 0) {
for (l in libs_missing) {
install.packages(l)
}
}
```
Want to combine the visualization of quantitative data with clustering
algorithms? Unsatisfied with the options provided by the base R packages? In
this hands-on workshop, we'll use the `ggendro` package for R to make
publication-quality graphics.
#### Learning objectives
1. Run clustering analysis and display dedrogram
2. Visualize data in a heatmap
3. Use `grid` package to create multi-plot figures
***
## Getting started
First we need to setup our development environment. Open RStudio and create a
new project via:
+ File > New Project...
+ Select 'New Directory'
+ For the Project Type select 'New Project'
+ For Directory name, call it something like "r-graphing" (without the quotes)
+ For the subdirectory, select somewhere you will remember (like "My Documents"
or "Desktop")
We need to create two folders: 'data' will store the data we will be analyzing,
and 'output' will store the results of our analyses. In the RStudio console:
```{r setup, eval = FALSE}
dir.create("data")
dir.create("output")
```
We will also need a few R packages that are not included in the standard
distribution of R. Those packages are `ggplot2`, `ggdendro`, `tidyr`, and
`grid` and can be installed with the `install.packages()` function. Note these
packages need only be installed once on your machine.
```{r install-packages, eval = FALSE}
install.packages("ggplot2")
install.packages("ggdendro")
install.packages("grid")
install.packages("tidyr")
```
Finally, download data file from
[https://jcoliver.github.io/learn-r/data/otter-mandible-data.csv](https://jcoliver.github.io/learn-r/data/otter-mandible-data.csv)
or [http://tinyurl.com/otter-data-csv](http://tinyurl.com/otter-data-csv) (the
latter just re-directs to the former). These data are a subset of those used in
a study on skull morphology and diet specialization in otters
[doi: 10.1371/journal.pone.0143236](http://dx.doi.org/10.1371/journal.pone.0143236). **Save this file in the "data" folder we just created**.
***
## Data preparation
For any work that we do, we want to record all the steps we take, so instead of
typing commands directly into the R console, we keep all our work in an R
script. These scripts are just text files with R commands; by convention, we
start the script with a brief bit of information about what the script does.
```{r read-data}
# Cluster & heatmap on otter data
# Jeff Oliver
# jcoliver@email.arizona.edu
# 2017-08-15
# Load dependencies
library("ggplot2")
library("ggdendro")
library("tidyr")
library("grid")
# Read in data
otter <- read.csv(file = "data/otter-mandible-data.csv",
stringsAsFactors = TRUE)
```
If we look at the first few rows of data, we see the first three columns have
information about the specimen from which the measurements were taken, while
columns 4-9 have data for six mandible measurements (`m1` - `m6`):
```{r otter-head}
head(otter)
```
For our purposes, we only want to look at the data for two species, so we
subset the data, including only those rows that have either "A. cinerea" or
"L. canadensis" in the `species` column. Note, this _does not_ alter the data
in the original file we read into R; it only alters the data object `otter`
currently in R's memory.
```{r subset-data}
two_species <- c("A. cinerea", "L. canadensis")
otter <- otter[otter$species %in% two_species, ]
```
The last thing we need to do is scale the data variables (columns 4-9) so
measurements have a mean value of 0 and a variance of 1.
```{r rescale-data}
otter_scaled <- otter
otter_scaled[, c(4:9)] <- scale(otter_scaled[, 4:9])
```
***
## Clustering
To make our figure, we will build the two plots (the cluster diagram and the
heatmap) separately, then use the `grid` framework to put them together. We
start by making the dendrogram (or cluster).
```{r run-clustering}
# Run clustering
otter_matrix <- as.matrix(otter_scaled[, -c(1:3)])
rownames(otter_matrix) <- otter_scaled$accession
otter_dendro <- as.dendrogram(hclust(d = dist(x = otter_matrix)))
# Create dendro
dendro_plot <- ggdendrogram(data = otter_dendro, rotate = TRUE)
# Preview the plot
print(dendro_plot)
```
We can make those labels a bit smaller:
```{r axis-labels}
dendro_plot <- dendro_plot + theme(axis.text.y = element_text(size = 6))
# Preview the plot
print(dendro_plot)
```
That's it for the cluster image (we'll make a few more changes later on).
***
## Heatmap
We need to start by transforming data to a "long" format, where each row only
contains values for one measurement. For example, our data started in this
format:
```{r print-wide, echo = FALSE}
otter_scaled[c(1:2), c(1:5)]
```
We want to transform it so there is only a single measurement in each row:
```{r print-long, echo = FALSE}
# For demonstration purposes only; print the first few rows and columns
otter_long_demo <- pivot_longer(data = otter_scaled[c(1:2), c(1:5)],
cols = -c(species, museum, accession),
names_to = "measurement",
values_to = "value")
otter_long_demo
```
This new data has columns that have information about the specimen (the
"species", "museum", and "accession" columns), but now has "measurement" and
"value" columns. Instead of one column for each mandible measurement, each row
only has one measurement in the "value" column, and the measurement type is
stored in the "measurement" column.
We use the `tidyr` package's `pivot_longer` function to handle the data
transformation:
```{r to-long}
# Heatmap
# Data wrangling
otter_long <- pivot_longer(data = otter_scaled,
cols = -c(species, museum, accession),
names_to = "measurement",
values_to = "value")
```
Now we can use that new data frame, `otter_long` to create our heatmap. In the
`ggplot` package, we use the `geom_tile` layer for creating a heatmap. Given
our prior experience with the y-axis labels being large, we will again use
`theme` to make the accession numbers (the y-axis labels) a little smaller:
```{r heatmap-1}
heatmap_plot <- ggplot(data = otter_long, aes(x = measurement, y = accession)) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2() +
theme(axis.text.y = element_text(size = 6))
# Preview the heatmap
print(heatmap_plot)
```
***
## Putting it all together
OK. We made a dendrogram. We made a heatmap. Now how to get them into the same
figure? The `grid` package offers a way to combine separate plots.
```{r grid-plot-1}
grid.newpage()
print(heatmap_plot,
vp = viewport(x = 0.4, y = 0.5, width = 0.8, height = 1.0))
print(dendro_plot,
vp = viewport(x = 0.90, y = 0.445, width = 0.2, height = 1.0))
```
OK, that's a start, but there are a few things we need to address:
1. Notice the tips of the dendrogram are not in the same order as the y-axis of
the heatmap. We'll need to re-order the heatmap to match the structure of the
clustering.
2. It would be nice to have the dendrogram and heatmap closer together, without
a legend separating them.
3. The dendrogram should be vertically stretched so each tip lines up with a
row in the heatmap plot.
### Re-order heatmap rows to match dendrogram
We'll use the order of the tips in the dendrogram to re-order the rows in our
heatmap. The order of those tips are stored in the `otter_dendro` object, and
we can extract it with the `order.dendro` function:
```{r order-data}
otter_order <- order.dendrogram(otter_dendro)
```
And now for the fun part. By default, `ggplot` use the level order of the
y-axis labels as the means of ordering the rows in the heatmap. That is, level
1 of the factor is plotted in the top row, level 2 is plotted in the second
row, level 3 in the third row and so on. So we can dictate the order in which
these rows are plotted by _setting_ the order of the levels in the column we
use for the y-axis labels in the heatmap. The data we used for the heatmap was
the long-format, `otter_long` object, and the column we used for the y-axis is
`accession`. Using the order of the dendrogram tips (we stored this above in
the `otter_order` vector), we can re-level the `otter_long$accession` column to
match the order of the tips in the trees.
```{r reorder-levels}
# Order the levels according to their position in the cluster
otter_long$accession <- factor(x = otter_long$accession,
levels = otter_scaled$accession[otter_order],
ordered = TRUE)
```
At this point, we need to re-create the heatmap, because the underlying data
(`otter_long`) has changed:
```{r heatmap-2}
# Heatmap
heatmap_plot <- ggplot(data = otter_long, aes(x = measurement, y = accession)) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2() +
theme(axis.text.y = element_text(size = 6))
# Preview the heatmap
print(heatmap_plot)
```
Now the rows have been re-arranged and we can see right away that the heatmap
now does a better job of grouping "red" rows together and "blue" rows together.
The combined figure shows the accession numbers in the heatmap are now in the
same order as the acession numbers in the dendrogram:
```{r grid-plot-2}
grid.newpage()
print(heatmap_plot,
vp = viewport(x = 0.4, y = 0.5, width = 0.8, height = 1.0))
print(dendro_plot,
vp = viewport(x = 0.90, y = 0.445, width = 0.2, height = 1.0))
```
### Re-position legend
We would like the legend to be somewere else besides in between the two graphs,
so we will move it to the top of the heatmap. We do this with through the
`theme` layer when creating the heatmap, setting the value of `legend.position`
to "top":
```{r heatmap-3}
# Heatmap
heatmap_plot <- ggplot(data = otter_long, aes(x = measurement, y = accession)) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2() +
theme(axis.text.y = element_text(size = 6),
legend.position = "top")
# Preview the heatmap
print(heatmap_plot)
```
The combined figure now has the heatmap and dendrogram closer together:
```{r grid-plot-3}
grid.newpage()
print(heatmap_plot,
vp = viewport(x = 0.4, y = 0.5, width = 0.8, height = 1.0))
print(dendro_plot,
vp = viewport(x = 0.90, y = 0.445, width = 0.2, height = 1.0))
```
### Align dendrogram tips with heatmap rows
Finally, we will need to adjust the height and alignment of dendrogram to
ensure the tips line up with the corresponding rows in the heatmap.
Unfortunately, because the two figures are actually independent graphics
objects, this part involves considerable trial and error. In general, we will
need to change two parameters in the `viewport` function when we call `print`
on `dendro_plot`:
1. `y`: This argument sets the vertical justification. A value of 0.5 centers
the graphic in the middle of the viewport to which it is assigned. Values
closer to zero move the graphic towards the bottom of the viewport and values
closer to one move the graphic towards the top of the viewport. In our case, we
will probably want to move the dendrogram down a little bit, so we use a value
below 0.5.
2. `height`: This is the proportion of the viewport's vertical space the
graphic should use. We initially set it to use all the vertical space
(`height = 1.0`), but since the legend now occupies the real estate at the top
of the graphic, we will need to make the dendrogram height smaller.
**Note:** The actual values you use will depend on the graphics device you
ultimately use; that is, the values presented here are designed for an html to
be viewed on the web, but if you are displaying on the screen or writing to a
pdf file, you may need to use different values for `y` and `height` (e.g. when
calling `print` on `dendro_plot`, it may work better on your screen to set
`y = 0.4` and `height = 0.85`).
```{r grid-plot-4}
grid.newpage()
print(heatmap_plot,
vp = viewport(x = 0.4, y = 0.5, width = 0.8, height = 1.0))
print(dendro_plot,
vp = viewport(x = 0.90, y = 0.43, width = 0.2, height = 0.92))
```
And lastly, as a bonus, we can remove the accession numbers from the heatmap,
as they are already displayed at the tips of the dendrogram (and we made sure
the heatmap rows and dendrogram tips match, above). We do so by setting the
relevant theme elements to `element_blank()` in our call to `ggplot`:
```{r heatmap-grid}
# Heatmap
heatmap_plot <- ggplot(data = otter_long, aes(x = measurement, y = accession)) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2() +
theme(axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "top")
grid.newpage()
print(heatmap_plot,
vp = viewport(x = 0.4, y = 0.5, width = 0.8, height = 1.0))
print(dendro_plot,
vp = viewport(x = 0.90, y = 0.43, width = 0.2, height = 0.92))
```
Our final script would look like this:
```{r final-script, eval = FALSE}
# Cluster & heatmap on otter data
# Jeff Oliver
# jcoliver@email.arizona.edu
# 2017-08-15
################################################################################
# Load dependencies
library("ggplot2")
library("ggdendro")
library("tidyr")
library("grid")
# Read in data
otter <- read.csv(file = "data/otter-mandible-data.csv",
stringsAsFactors = TRUE)
# Restrict the data to only two species, A. cinerea & L. canadensis
two_species <- c("A. cinerea", "L. canadensis")
otter <- otter[otter$species %in% two_species, ]
# Force re-numbering of the rows after subsetting
rownames(otter) <- NULL
# Scale each measurement (independently) to have a mean of 0 and variance of 1
otter_scaled <- otter
otter_scaled[, c(4:9)] <- scale(otter_scaled[, 4:9])
# Run clustering
otter_matrix <- as.matrix(otter_scaled[, -c(1:3)])
rownames(otter_matrix) <- otter_scaled$accession
otter_dendro <- as.dendrogram(hclust(d = dist(x = otter_matrix)))
# Create dendrogram plot
dendro_plot <- ggdendrogram(data = otter_dendro, rotate = TRUE) +
theme(axis.text.y = element_text(size = 6))
# Heatmap
# Data wrangling
otter_long <- pivot_longer(data = otter_scaled,
cols = -c(species, museum, accession),
names_to = "measurement",
values_to = "value")
# Extract the order of the tips in the dendrogram
otter_order <- order.dendrogram(otter_dendro)
# Order the levels according to their position in the cluster
otter_long$accession <- factor(x = otter_long$accession,
levels = otter_scaled$accession[otter_order],
ordered = TRUE)
# Create heatmap plot
heatmap_plot <- ggplot(data = otter_long, aes(x = measurement, y = accession)) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2() +
theme(axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "top")
# All together
grid.newpage()
print(heatmap_plot,
vp = viewport(x = 0.4, y = 0.5, width = 0.8, height = 1.0))
print(dendro_plot,
vp = viewport(x = 0.90, y = 0.43, width = 0.2, height = 0.92))
```
***
## Additional resources
+ Documentation for the [ggdendro package](https://cran.r-project.org/web/packages/ggdendro/vignettes/ggdendro.html)
+ An [example](https://stackoverflow.com/questions/6673162/reproducing-lattice-dendrogram-graph-with-ggplot2) on StackOverflow
+ A [pdf version](https://jcoliver.github.io/learn-r/008-ggplot-dendrograms-and-heatmaps.pdf) of this lesson