-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path06-iterate.Rmd
458 lines (320 loc) · 9.8 KB
/
06-iterate.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
---
title: "Iterate and model"
date: "2019-03-25
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Vectors
```{r}
library(tidyverse)
```
Vectors are basic R data structures, even single value (scalar) is a vector (with lenght 1) or you can have empty vector (with lenght 0).
You can have two-dimentional vectors -- matixes.
data frames consist of vectors bound column wise.
There are two types of vectors:
- atomic vectors: most frequent types that you encounter are *logical*, *integer*, *double*, *character*.
- lists can contain any type of R data structure (including lists).
Main difference between atomic vectors and lists is that atomic vectors can contain only one type of data and are therefore *homogeneous* whereas list can be *heterogeneous*.
There is also
```{r}
NULL
```
which represents absence of vector and behaves as vector with length 0.
Every vector has two key properties *type* and *length*:
Look for type and class of object:
```{r}
typeof(1:10)
typeof(month.name)
class(1:10)
class(month.name)
```
Query length of object:
```{r}
length(1:10)
length(month.name)
length(numeric())
```
### Types of atomic vectors
#### Logical
Can contain only three values FALSE, TRUE, NA:
```{r}
1:10 %% 2 == 0
c(TRUE, FALSE, TRUE, NA)
```
#### Numeric
Integer and double vectors are known collectively as numeric vectors:
```{r}
typeof(1)
typeof(1L)
```
In case of doubles and integers, it's important to keep in mind that:
- doubles are approximations and represent floating-point numbers that are shown only within the limits of the largest double or integer and the machine's precision.
the smallest positive floating-point number:
```{r}
.Machine$double.eps
```
- Integers have one special value, NA, while doubles can have four: NA, NaN, Inf, -Inf
```{r}
c(-1, 0, 1) / 0
```
#### Character vectors are most complex type, because each element of character vector is string and string can contain arbitrary amount of data.
Importantly, R has global string pool and each unique string is stored in memory only once and every use of string points to that representation. This reduces the amount of memory to store duplicated strings:
```{r}
x <- "R stats TalTech"
pryr::object_size(x)
```
Let's replicate x 1000 times
```{r}
y <- rep(x, 1000)
pryr::object_size(y)
```
y does not take 1000x more memory as x, but 8 * 1000 + 120 = 8.11kB, where each pointer is 8 bytes.
### Using vectors
#### coersion
Vector coersion is converting vector from one type to another and can occur
- explicitly:
```{r}
as.character(1:10)
as.logical(c(0,1,1,0))
as.numeric(c("1", "2", "3.1"))
```
- implicitly, for example when you use logical vector in numeric summary function:
```{r}
sum(c(TRUE, TRUE, FALSE))
```
When you try to create vector with multiple types, most complex type wins:
```{r}
typeof(c(TRUE, 1L))
typeof(c(1L, 1.5))
typeof(c(1, "a"))
```
> Atomic vector cannot contain values from mix of different types, because type is property of complete vector.
#### Scalars and recycling rules
R also coerces the length of vectors, by recycling shorter vector to the same length as longer vector.
Here we add 100 to 6 random numbers sampled from range 1 to 10:
```{r}
sample(1:10, size = 6) + 100
```
or
```{r}
runif(10) < 0.4
```
> Basic mathematical operations are vectorised in R.
Together, it's easy to understand when you add two vectors with same length or vector and scalar (vector with length 1).
But what happens when you add vectors of different lengths?
```{r}
1:10 + 1:2
```
```{r}
paste(1:10, c("a", "b"))
```
Here, in these two examples above, R expands or recycles shortest vector to the same length as the longest.
And you will get warning if longer vector is not a multiple of shorter vector:
```{r}
1:10 + 1:3
```
Importantly, tidyverse functions will throw an error if recycling other than sclar:
```{r, eval=FALSE}
tibble(
a = 1:4,
b = 1:2
)
```
If you need to recycle in tidyverse, specify it explicitly with rep() function:
```{r}
tibble(
a = 1:4,
b = rep(1:2, 2)
)
```
#### Naming vectors
All types of vectors can be named:
```{r}
c(a = 1, b = 2, c = 3)
```
or by using purrr::set_names()
```{r}
purrr::set_names(1:3, nm = c("a", "b", "c"))
```
purrr::set_names() can be used within pipe!
#### Subsetting
Examples:
```{r}
```
### Lists
List are complex vectors, because they can contain other lists.
Lists can be used to represent hierarchical structures.
List is created by list() function:
```{r}
list(1, 2, 3)
```
structure of a list can be viewed with str() command:
```{r}
str(list(1, 2, 3))
```
List can contain different type of objects:
```{r}
y <- list(do = 3.1, int = 1L, lo = TRUE, ch = month.abb, li = list(1, 2, 3))
str(y)
```
#### Subsetting lists
There are three ways to subset a list:
- "[" extracts a sublist, result is always a list
```{r}
y[c(1, 4)]
```
```{r}
str(y[c(1, 4)])
```
- "[[" extracts single list element and it removes one level of hierarchy from the list:
```{r}
y[[1]]
```
```{r}
y[[4]]
```
- "$" is a shorthand for extracting names elements of a list. It works similarly to [[ except you don't need use quotes:
```{r}
y$ch
```
```{r}
y[["ch"]]
```
## Iteration with purrr
Iteration handles code duplication in cases **when you need to do same thing to multiple inputs**.
Reducing of code duplication is desirable, because
1. It improves code readability and makes easier to see the intent of your code because your eyes are drawn to what's different, not what's always the same.
2. It is easier to update and debug your code, because you need to do changes in only one place.
3. You are likely to have fewer bugs.
Reducing duplication is achieved by 1) **use of functions**, either generic (mean etc) or custom, and 2) **by iteration** (repetition) of the same operation on different columns or datasets/-splits.
Generally iteration is solved by using loops, but loops are verbose.
Base R has apply family functions to wrap loops into verbs:
- `apply(df, 1, mean)`
- `lapply(x, mean)`
- `sapply(x, mean)`
"tidyverse" has **purrr** package that contains map() family functions to loop over vectors and lists.
There is different map() function for each type of input:
- map() outputs list
- map_lgl() outputs logical vector
- map_int() outputs integer vector
- map_dbl() outputs double vector
- map_chr() outputs character vector
- map_df() outputs dataframe
Each function takes vector as input applies function to each element of the vector and outputs new vector with same length (and names) as the input.
```{r}
?map
```
The main benefit of using map() functions over loops is not speed but code clarity.
Let's see how they work.
Let's create toy data frame:
```{r}
tb <- tibble(
a = rnorm(10, 0),
b = rnorm(10, 1),
c = rnorm(10, 2),
d = rnorm(10, 3)
)
```
Here we calculate column means over data frame:
```{r}
map_dbl(tb, mean)
```
Previous code could be replaced by dplyr summarise functions, at least when input is a data frame.
Nevertheless, tibble and data.frame is also a list:
```{r}
is.list(tb)
```
You can use map_*() functions that output atomic vector when function returns single value.
Additional arguments can be passed to functions via ellipsis (...):
Let's pass na.rm=TRUE to mean function (in this example nothing changes because we have no NAs in out data):
```{r}
map_dbl(tb, mean, na.rm = TRUE)
```
### Split data and map on the fly
Load gapminder dataset from gapminder library
```{r}
library(gapminder)
gapminder
```
gapminder contains Gapminder https://www.gapminder.org data on life expectancy, GDP per capita, and population by country.
Split data by country and fit same linear model to all pices:
```{r}
models <- gapminder %>%
split(.$country) %>%
map(function(x) lm(lifeExp ~ year + gdpPercap, data = x))
```
Note that syntax to create anonymous function is quite verbose:
```{r}
function(x) lm(lifeExp ~ year + gdpPercap, data = x)
```
Whereas purrr allows to use one-sided formula as drop-in replacement:
```{r}
models <- gapminder %>%
split(.$country) %>%
map(~ lm(lifeExp ~ year + gdpPercap, data = .))
```
Let's have look at model summaries of all these models:
First, we can extract r.squared component from model objects:
```{r}
models %>%
map(summary) %>%
map_dbl(~.$r.squared)
```
Or more conveniently with purrr, you can directly extract a named component:
```{r}
models %>%
map(summary) %>%
map_dbl("r.squared")
```
You can also use integers to extract elements by position:
```{r}
x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x %>% map_dbl(2)
```
### Dealing with failure
One bad apple can spoil the whole bunch.
If one out of many iterations fail, then you will get error message and no output. This is annoying, as one failure prevents you from accessing all successful calculations.
purrr handles these situations with safely() function.
safely() takes a function and returns its modified version wrapped into error handler.
safely-modified function will never throw an error, but returns a list with two with two elements:
- result - the original result and in case of error NULL
- error - error object and when operation was successful, returns NULL
One of the list objects returned by safely is allways NULL
Let's create safe log function:
```{r}
safe_log <- safely(log)
```
In case of success:
```{r}
safe_log(10)
```
In case of error:
```{r}
safe_log("a")
```
Safely works with map:
```{r}
x <- list(1, 10, "a")
y <- x %>% map(safe_log)
y
```
You can get then two lists one for results and the other one for errors with purrr::transpose()
```{r}
y <- y %>% transpose()
y
```
Then you can subset good results:
```{r}
good <- y$error %>% map_lgl(is.null)
x[!good] # bad apple in original data
```
Extract good results and flatten into numeric vector
```{r}
y$result[good] %>% flatten_dbl()
```
flatten functions remove one level of hierarchy from a list:
```{r}
?purrr::flatten
```