-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03c_profiling.qmd
231 lines (156 loc) · 9.56 KB
/
03c_profiling.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
title: "Profiling R code"
format:
html:
highlight-style: "default"
---
> The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimisation is the root of all evil (or at least most of it) in programming.
>
> [*Donald Knuth*](https://en.wikiquote.org/wiki/Donald_Knuth)*, legendary computer scientist*
In our benchmarking explorations, we jumped in and started optimising the expression we *thought* would have most impact.
But this is not the right way to approach optimisation and we can often end up optimising parts of our code that do not contribute much to overall efficiency.
The best way to approach optimisation is to profile our code and identify bottlenecks that need our attention before beginning any optimisation.
## `profvis()`
The best tool to perform this in R is the `profvis()` function in the `profvis` package.
[Profvis](https://github.com/rstudio/profvis) is a tool for helping you to understand how R spends its time. It provides a interactive graphical interface for visualizing data from `Rprof`, R's built-in tool for collecting profiling data.
It is so fundamental that it ships with RStudio and can even be accessed directly from the IDE through the **Profile** menu.
![](assets/images/rstudio_profile.png){width="205"}
```{r}
library(profvis)
library(dplyr)
```
Let's use our previous example from the benchmarking section to see how we can use `profvis` to optimise our problem.
As a reminder, we're working with a data frame that has 151 columns and 40,000 rows. One of the columns contains an ID, and the other 150 columns contain numeric values.
What we are trying to achieve is, for each numeric column, to take the mean and subtract it from the column, so that the new mean value of the column is zero.
Let's create our data:
```{r}
times <- 4e5
cols <- 150
data <- as.data.frame(x = matrix(rnorm(times * cols, mean = 5), ncol = cols))
data <- cbind(id = paste0("g", seq_len(times)), data)
```
```{r}
profvis({
data1 <- data # Store in another variable for this run
# Get column means
means <- apply(data1[, names(data1) != "id"], 2, mean)
# Subtract mean from each column
for (i in seq_along(means)) {
data1[, names(data1) != "id"][, i] <- data1[, names(data1) != "id"][, i] - means[i]
}
})
```
#### `profvis()` output
##### Flame graph view
The first view in the profvis output is the flame graph view.
In the top panel we the code we profiled which includes the amount of time spent on each line of code as well as memory allocation and deallocation.
The bottom contains the flame graph. In the flame graph, the horizontal direction represents time in milliseconds, and the vertical direction represents the call stack.
Profvis is interactive!
- As we mouse over the flame graph, information about each block will show in the info box.
- If we mouse over a line of code, all flame graph blocks that were called from that line will be highlighted.
- We can click and drag on the flame graph to pan up, down, left, right.
- We can double-click on a flamegraph block to zoom the x axis the width of that block.
- We can double-click on the background to zoom the x axis to its original extent.
##### The data view
In addition to the flame graph view, profvis provides a *data view*, which can be seen by clicking on the Data tab. It provides a top-down tree view of the profile. Click the **code** column to expand the call stack under investigation and the following columns to reason about resource allocation:
- Memory: Memory allocated or deallocated (for negative numbers) for a given call stack.
- Time: Time spent in milliseconds.
#### Interpreting profvis results
Back to our example, what is profvis telling us? Well the first thing it's telling us is that we had not been optimising the slowest part of our code when we embarked on our benchmarking experiments! If anything this shows the importance of taking the time to profile properly!
Most of the time is actually spent in the `apply` call to generate the `means` vector, so, in fact, that's the best candidate for a first pass at optimization. We can also see that the `apply` results in a lot of memory being allocated and deallocated.
Looking at the flame graph (or data view), we can see that `apply` calls `as.matrix` and `aperm`. These two functions convert the data frame to a matrix and transpose it -- so much of this approach is spent on, frankly, unnecessarily transforming the data.
So let's think of some other approaches:
An obvious alternative is to use the `colMeans` function. Additionally, we could also use `lapply` or `vapply` to apply the `mean` function over each column.
Let's compare the speed of these four different ways of getting column means. We could use `microbenchmark` for a quick speed test:
```{r}
microbenchmark::microbenchmark(
apply = apply(data[, names(data) != "id"], 2, mean),
colmeans = colMeans(data[, names(data) != "id"]),
lapply = lapply(data[, names(data) != "id"], mean),
vapply = vapply(data[, names(data) != "id"], mean, numeric(1)),
times = 5
)
```
`colMeans` is much faster than using `apply` with `mean` but that `lapply`/`vapply`are faster yet.
To dig into the underlying causes we could use `profvis` again, profiling each of our candidate approaches.
```{r}
profvis({
data1 <- data
# Four different ways of getting column means
apply(data1[, names(data1) != "id"], 2, mean)
colMeans(data1[, names(data1) != "id"])
lapply(data1[, names(data1) != "id"], mean)
vapply(data1[, names(data1) != "id"], mean, numeric(1))
})
```
Now we see that `colMeans` is still using `as.matrix`, which takes a good chunk of time while the `lapply` and `vapply` primarily spend their time calling `mean.default`.
`lapply` returns the values in a list, while `vapply` returns the values in a numeric vector, which is the form that we want, so it looks like `vapply` is the way to go for this part.
So let's edit our code, replacing `apply` with `vapply` and try again:
```{r}
profvis({
data1 <- data
means <- vapply(data1[, names(data1) != "id"], mean, numeric(1))
for (i in seq_along(means)) {
data1[, names(data1) != "id"][, i] <- data1[, names(data1) != "id"][, i] - means[i]
}
})
```
Our code is about 3x faster than the original version. Most of the time is now spent on line 6, and the majority of that is in the `[<-.data.frame` function. This is usually called with syntax `x[i, j] <- y`, which is equivalent to `` `[<-`(x, i, j, y) ``. In addition to being slow, the code is ugly: on each side of the assignment operator we're indexing into `data1` twice with `[`.
```{r}
profvis({
data1 <- data
means <- vapply(data1[, names(data1) != "id"], mean, numeric(1))
data1[, names(data1) != "id"] <- purrr::map2(
data1[, names(data1) != "id"],
means,
~.x - .y)
})
```
In this case, it's useful to take a step back and think about the broader problem. We want to normalize each column. Couldn't we we apply a function over the columns that does both steps, taking the mean and subtracting it? Because a data frame is a list, and we want to assign a list of values into the data frame, we'll need to use `lapply`.
```{r}
profvis({
data1 <- data
# Given a column, normalize values and return them
col_norm <- function(col) {
col - mean(col)
}
# Apply the normalizer function over all columns except id
data1[, names(data1) != "id"] <- purrr::map_dfc(
data1[, names(data1) != "id"],
col_norm)
})
```
```{r}
profvis({
data1 <- data
# Given a column, normalize values and return them
col_norm <- function(col) {
col - mean(col)
}
# Apply the normalizer function over all columns except id
data1[, names(data1) != "id"] <- lapply(data1[, names(data1) != "id"],
col_norm)
})
```
Now we have code that's not only about 6x faster than our original -- it's shorter and more elegant as well. Not bad! The profiler data helped us to identify performance bottlenecks, and understanding of the underlying data structures allowed us to approach the problem in a more efficient way.
Could we further optimize the code? It seems unlikely, given that all the time is spent in functions that are implemented in C (`mean` and `-`). That doesn't necessarily mean that there's no room for improvement, but this is a good place to move on to the next example.
### `Rprof()`
I just wanted to mention that `profvis` effectively wraps `utils` functions `Rprof()` and `Rprofmem()` that ship with R for profiling execution times and memory allocation.
Should your code generate a really complicated call stack that is hard to weed through with `profvis()`, you could try `Rprof()` (which writes it's results to a file, here a temporary file) followed by `summaryRprof()` to drill down to the function calls that are taking the most time:
```{r, eval=FALSE}
prof_file <- tempfile()
Rprof(prof_file)
data1 <- data # Store in another variable for this run
# Get column means
means <- apply(data1[, names(data1) != "id"], 2, mean)
# Subtract mean from each column
for (i in seq_along(means)) {
data1[, names(data1) != "id"][, i] <- data1[, names(data1) != "id"][, i] - means[i]
}
Rprof(NULL)
summaryRprof(prof_file)
```
### Further Reading:
- [Advanced R: **Measuring performance**](https://adv-r.hadley.nz/perf-measure.html)
- [Garbage collection in R](https://stat.ethz.ch/R-manual/R-patched/library/base/html/gc.html)
- [Memory in RStudio](https://support.posit.co/hc/en-us/articles/1500005616261-Understanding-Memory-Usage-in-the-RStudio-IDE)