-
Notifications
You must be signed in to change notification settings - Fork 0
/
efficient-coding.qmd
473 lines (325 loc) · 17.9 KB
/
efficient-coding.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
---
title: "Efficient coding"
---
> “I think you can have a ridiculously enormous and complex data set, but if you have the right tools and methodology, then it’s not a problem.”
>
> -- Aaron Koblin
This chapter covers how to write a clear, concise, and optimised code to handle data and computations effectively in R. We going to cover what is a loop, how to make a function, and how to use an apply family function. For beginners, this chapter will be quite challenging. So, it is completely understandable if you need to read this chapter several times. In fact, you are free to skip this chapter for the time being, and come back again once you become more familiar with R.
Mastering the topics such as a loop, apply family and functions is crucial especially once you become more advanced in using R. More often than not, you will have a large data and you need to repeat the analysis several times, thus, knowing these topics, makes the R codes more efficient and concise (though not completely readable to beginners).
This chapter does not aim to cover everything on these topics. However, this chapter intends to provide a brief introduction to these topics for beginners and novices in R.
## Load packages
Please load a `tidyverse` package before moving to the next section.
```{r warning=FALSE, message=FALSE}
library(tidyverse)
```
## Loop
A loop is utilised in the case of an iterative process. Generally, there are two types of loops:
1. For loop
2. While loop
A for loop iterate iterates over a sequence. The general structure of the for loop is:
```{r eval=FALSE}
for (variable in sequence) {
# Code to execute
}
```
For example, if we want to add 1 over a sequence of numbers:
```{r}
# Number sequence 1 to 10
num_seq <- 1:10
num_seq
```
```{r}
# Make a for loop that adds 1 to each number in the sequence
for (i in num_seq) {
i = i + 1
print(i)
}
```
`i` reflects each number in the number sequence, while `print()` will print each number after 1 is added.
Now, let's try something more related to the data analysis. We going to use the `iris` dataset for this example. For those who might not be familiar, the `iris` dataset, is a built-in dataset in R. Further details can be read on the `Help` pane by typing `?iris`.
We going to calculate the mean for each numeric column in the `iris` dataset.
```{r}
# Loop through numeric columns of the iris dataset
for (col in names(iris)[1:4]) {
# Calculate the mean of the current column
column_mean <- mean(iris[[col]])
# Round the mean to 2 decimal points
column_mean <- round(column_mean, digits = 2)
# Print the column name and its mean
print(paste("Mean of", col, "is", column_mean))
}
```
Additionally, we can use for loop to do several plots. For example, we can plot a boxplot for each numeric column in the `iris` dataset.
```{r}
# Select numeric columns only from iris
df <- iris[, -5]
# Loop through numeric columns of the iris dataset
for (col_name in names(df)) {
# Create a boxplot
boxplot(df[[col_name]],
main = paste("Boxplot of", col_name),
ylab = col_name)
}
```
Thus, integrating a for loop in the data analysis code will make it more efficient. Next, let's see what is a while loop.
The while loop executes a block of code as long as a specified condition remains true. A general structure of the while loop is:
```{r eval=FALSE}
while (condition) {
# Code to execute
}
```
As a basic example of the while loop, we can add 1 to the sequence of numbers as long as the number is below 10.
```{r}
# Initialise the number
count <- 0
# A while loop that adds 1 to each number as long as the number is less than 10
while (count < 10) {
count <- count + 1
print(count)
}
```
So, as long as the `count` is not equal to 10, the while loop will keep adding 1 to the `count`. Next, let's try the while loop which is more related to the data analysis. By using the `iris` dataset, let's say we want to calculate the cumulative sum of `Sepal.Length` until the sum exceeds 60.
```{r}
# Initialize variables
index <- 1 #index for looping
cumulative_sum <- 0 #variable to hold the cumulative sum
# While loop to calculate the cumulative sum
while (cumulative_sum <= 60 && index <= nrow(iris)) {
# Calculate the cumulative sum starting with the 1st row
cumulative_sum <- cumulative_sum + iris$Sepal.Length[index]
# Move to the next row
index <- index + 1
# Display the result
print(paste0("Row ", index - 1, ", Cumulative sum: ", cumulative_sum))
}
```
The conditions in the while loop above are:
1. The loop will run until the cumulative sum exceeds 60
2. Or the last row of the dataset is reached.
The last line of the codes is just to print the result. For now, you do not actually need to understand it.
Generally, the for loop is more popular and commonly used compared to the while loop. However, knowing both loops, at least at the basic level will definitely benefit you in future. Additionally, the loop functions are not only available in R but in other programming software such as Python, Julia, MATLAB and Stata.
## Apply
The apply family function in R is utilised to apply a specific operation over elements of data structures such as vectors, matrices, and data frames. This function is relatively similar to the loop function. However, in R, generally, the apply family functions are faster and more efficient compared to the loop functions.
There are 7 types of apply family functions:
1. `apply()`: applies a function over rows or columns of a matrix or array.
2. `lapply()`: applies a function to each element of a vector, data frame or list and returns a list.
3. `sapply()`: similar to `lapply()` but tries to simplify the result (e.g., into a vector or matrix).
4. `vapply()`: similar to `sapply()`, but requires specifying the output type.
5. `mapply()`: multivariate version of `sapply()`, applying a function to multiple arguments.
6. `tapply()`: applies a function over subsets of a vector grouped by a factor.
7. `rapply()`: recursive version of `lapply()` for nested lists.
We are not going to cover all the apply family functions in this book, but we going to cover the top three apply family functions (`apply()`, `lapply()`, `sapply()`).
**`apply()` function**
The `apply()` is used to apply the function over rows or columns. The basic syntax is:
```{r eval=FALSE}
apply(X, MARGIN, FUN, ...)
```
The basic arguments that we need to supply:
1. `X`: an array.
2. `MARGIN`: where the function should be applied, 1 is at row, and 2 is at column.
3. `FUN`: the function.
Let's try applying this function to our iris dataset. Let's say we want to calculate the mean for each numeric column in the dataset.
```{r}
apply(iris %>% select(-Species), MARGIN = 2 , FUN = mean)
```
`MARGIN = 2` indicates we want the mean of columns. If we set `MARGIN = 1` means that we want the mean of each row.
**`lapply()` function**
The `lapply()` applies a function to each element of a list and returns a list. The basic syntax is:
```{r eval=FALSE}
lapply(X, FUN)
```
Where:
1. `X`: a vector, data frame or a list.
2. `FUN`: the function.
As an example of `lapply()`, we can find the mean of each column and see the result is returned in a list format.
```{r}
lapply(iris %>% select(-Species), FUN = mean)
```
**`sapply()` function**
The `sapply()` function is similar to the `lapply()`, but it simplifies the result. The syntax is similar to the `lapply()`.
Let's use the same example as in the `lapply()` section and see how is the output is formatted.
```{r}
sapply(iris %>% select(-Species), FUN = mean)
```
The output is formatted in a more simplified, and in this case, the output format is similar to the one in the `apply()` function section.
So, we have seen three functions from the apply family functions. As the example in the for loop, we can also use the apply family function to plot several plots.
### `purrr`
`purrr` package is part of `tidyverse`, which contains many functions that are equivalent to base R apply family functions.
The equivalent of `lapply()` function is `map()`.
```{r}
map(iris %>% select(-Species), .f = mean)
```
There are many more functions in the `purrr` package that are highly beneficial to learn. However, most of these functions are more advanced and require a deeper understanding of R, which goes beyond what is covered in this book. As such, they may not be suitable for beginners and novices at this stage.
## Function {#sec-loop-function}
One of the flexibility in R is we can make our own function. Let's make a basic function that adds two numbers.
```{r}
add_num <- function(number1, number2) {
result <- number1 + number2
return(result)
}
```
Next, let's try the function.
```{r}
add_num(number1 = 100, number2 = 200)
```
Let's upgrade our function, instead of adding the number, we going to calculate the mean of the numbers.
```{r}
avg_num <- function(x_range) {
# Sum of all elements in the vector
sum_x <- sum(x_range)
# Determine the number of elements in the vector
n <- length(x_range)
# Calculate mean
mean_value <- sum_x / n
# Return the mean
return(mean_value)
}
```
Let's try our function using the `iris` dataset, and compare it with the `mean()` in the base R.
```{r}
# Our function
avg_num(iris$Sepal.Length)
# Mean function in R
mean(iris$Sepal.Length)
```
Additionally, we can also integrate the R functions in our own function. Let's say we want to build a function that can return the value of mean, standard deviation (SD), median, and interquartile range (IQR). Instead of typing the function one by one every time we need it, we can create a function that gives us the four statistical measures. So, it is more efficient to create this function if we need to run it more than two times. For those who are not familiar with these statistical measures, you can just ignore them for now as we will cover it in the later chapter.
```{r}
summary_func <- function(x_range) {
# Calculate mean and standard deviation, then round the decimal points to 2
mean_val <- mean(x_range) |> round(digits = 2)
std_val <- sd(x_range) |> round(digits = 2)
# Calculate median and interquartile range, then round the decimal points to 2
med_val <- median(x_range) |> round(digits = 2)
iqr_val <- IQR(x_range) |> round(digits = 2)
# Display the result
return(c(
`Mean (SD)` = paste0(mean_val, " (", std_val, ")"),
`Median (IQR)` = paste0(med_val, " (", iqr_val, ")")
))
}
```
Now, we have the function ready. Let's test out on the iris dataset.
```{r}
summary_func(iris$Sepal.Length)
```
Perfect! Now, every time we want to calculate the mean, SD, median, and IQR for any column, we can just call our function.
To be more efficient we can combine our function with loop or apply family functions that we have learnt in the previous sections. Instead of running `summary_func()` one by one for each column in the `iris` dataset, we integrate it in the for loop.
```{r}
# Loop through numeric columns of the iris dataset
for (col in names(iris)[1:4]) {
# Calculate mean of the current column
res <- summary_func(iris[[col]])
# Print the result
print(res)
}
```
In R, it is more efficient to use the apply family functions. Let's use `sapply()` for this.
```{r}
sapply(iris |> select(-Species), FUN = summary_func)
```
What we have learned just now, is known as named function. It is probably the most commonly used function.
### Anonymous function
There is another type of function in R, known as the anonymous function. It is defined without a name and is often used temporarily.
For example, let's create an anonymous function that squares a number.
```{r eval=FALSE}
function(x) { x^2 }
```
So, to use this function, we need to type the whole function again. In contrast, the named function can be called by the name of the function to be reused.
```{r eval=FALSE}
function(x) { x^2 }(3)
```
An example of a practical use of the anonymous function is to be used in line with apply family functions. For example, if we want to square the range of numeric values.
```{r}
sapply(1:10, function(x) x^2)
```
Another example of using anonymous functions with the apply family is to generate multiple plots, similar to the example in the for loop section. In this case, we will use `walk()`, which is the equivalent of the apply family function available in `purrr` package. Let's create a boxplot for each numeric column in the `iris` dataset using `walk()` and anonymous function.
```{r}
# Select numeric columns only from iris
df <- iris[, -5]
# Create boxplots for each numeric column using walk
walk(names(df), function(col_name) {
boxplot(df[[col_name]],
main = paste("Boxplot of", col_name),
ylab = col_name,
col = "skyblue")
})
```
The function `function(col_name)` defines an anonymous function to process each column name in `names(df)`. Using `walk()` allows us to generate multiple boxplots, as it executes the function for its side effects without returning any output. In contrast, using `lapply()` for this task would return additional information, specifically the statistics for each boxplot, alongside creating the plots.
To observe this difference, try running the code below on your machine and compare the behaviour of `walk()` and `lapply()` in this context.
```{r eval=FALSE}
# Select numeric columns only from iris
df <- iris[, -5]
# Create boxplots for each numeric column using lapply
lapply(names(df), function(col_name) {
boxplot(df[[col_name]],
main = paste("Boxplot of", col_name),
ylab = col_name,
col = "skyblue")
})
```
## Chapter summary
We have covered adequately about the loop, apply family, and function.
| **Concept** | **Brief Description** | **When to Use** |
|:-----------------------|:-----------------------|:-----------------------|
| **Loop** | Iterative control structures (e.g., `for`, `while`) used to repeat a block of code for a set number of iterations or conditions. | When flexibility and customization are needed for tasks that may not easily fit into vectorized operations. |
| **Apply Family** | A group of vectorized functions (`apply`, `lapply`, `sapply`, `mapply`, etc.) that simplify applying functions to data structures. | When performing repetitive operations over data (e.g., rows, columns, or lists) without writing explicit loops. |
| **Function** | A reusable block of code that performs a specific task. Functions can be named or anonymous (e.g., `function(x) x^2`). | Use when you need modular, repeatable tasks or computations that are applied across datasets. |
: Brief summary of loop, apply family, and function. {#tbl-summary-loop-apply-function}
For readers seeking a greater challenge, I recommend reading the [second edition of *R for Data Science* book](https://r4ds.hadley.nz/), particularly [Chapter 25](https://r4ds.hadley.nz/functions) on functions and [Chapter 27](https://r4ds.hadley.nz/base-r) on loops and the apply family [@wickham2023r]. Additionally, I suggest exploring [Chapter 21](https://r4ds.had.co.nz/iteration.html) of the [first edition of *R for Data Science*](https://r4ds.had.co.nz/index.html), as it covers the loops and apply family more extensively [@wickham2017r].
## Revision
1. Using `USJudgeRatings` dataset, a built-in dataset in R, write a for loop that calculates the median for the first two columns and prints the result. Ensure the medians are rounded to two decimal places.
```{r eval=FALSE}
data("USJudgeRatings")
```
2. Write a for loop that generates histograms for each numeric column in the `iris` dataset. Customize each histogram to include a title and an appropriate x-axis label.
3. Analyse the following code and explain why it does not produce the expected output. Correct the code to ensure it calculates the sum of the first 5 rows of the `Sepal.Width` column in the `iris` dataset:
```{r eval=FALSE}
count <- 0
for (i in 1:5) {
count = count + iris$Sepal.Width
}
print(count)
```
4. Review the following code and identify the issue:
```{r eval=FALSE}
apply(iris, MARGIN = 1, FUN = mean)
```
Why does this produce an error or unexpected output? Correct the code so it calculates the mean of all numeric columns for each row in the `iris` dataset.
5. Rewrite the following base R code using the equivalent `purrr` function:
```{r eval=FALSE}
# Load the library
library(dplyr)
# Change the codes below to the equivalent purrr function
lapply(iris %>% select(-Species), mean)
```
6. What are the difference between `lapply()` and `sapply()`.
7. Using the `summary_func()` function in @sec-loop-function, calculate the statistical summaries (mean, SD, median, IQR) for all numeric columns in the `USJudgeRatings` dataset, a built-in dataset in R. Use the `sapply()` function and for loop implementation.
```{r eval=FALSE}
data("USJudgeRatings")
```
8. Create a function called `multiply_num` that takes two arguments and returns their product. Use this function to multiply 10 and 20. The function should result an output as below.
```{r echo=FALSE}
multiply_num <- function(a, b) {
return(a * b)
}
```
```{r}
multiply_num(10, 20)
```
9. Modify the `avg_num()` function in @sec-loop-function to include an optional argument `round()` that rounds the mean to a specified number of decimal places. Test it with the `iris$Sepal.Length` column, rounding to 3 decimal places. The function should result an output as below.
```{r echo=FALSE}
avg_num <- function(x_range) {
# Sum of all elements in the vector
sum_x <- sum(x_range)
# Determine the number of elements in the vector
n <- length(x_range)
# Calculate mean
mean_value <- sum_x / n
# Return the mean
return(round(mean_value, digits = 3))
}
```
```{r}
avg_num(iris$Sepal.Length)
```
10. Write an anonymous function that takes a numeric vector and returns a vector of squared values only for numbers greater than 5. Test this using `sapply()` with a sequence from 1 to 10.