06-iterate.Rmd

---
title: "Iterate and model"
date: "2019-03-25
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Vectors

```{r}
library(tidyverse)
```

Vectors are basic R data structures, even single value (scalar) is a vector (with lenght 1) or you can have empty vector (with lenght 0). 

You can have two-dimentional vectors -- matixes.
data frames consist of vectors bound column wise. 

There are two types of vectors:

- atomic vectors: most frequent types that you encounter are *logical*, *integer*, *double*, *character*. 
- lists can contain any type of R data structure (including lists).

Main difference between atomic vectors and lists is that atomic vectors can contain only one type of data and are therefore *homogeneous* whereas list can be *heterogeneous*.

There is also 
```{r}
NULL
```

which represents absence of vector and behaves as vector with length 0.

Every vector has two key properties *type* and *length*:

Look for type and class of object:
```{r}
typeof(1:10)
typeof(month.name)
class(1:10)
class(month.name)
```

Query length of object:
```{r}
length(1:10)
length(month.name)
length(numeric())
```

### Types of atomic vectors

#### Logical

Can contain only three values FALSE, TRUE, NA:
```{r}
1:10 %% 2 == 0
c(TRUE, FALSE, TRUE, NA)
```

#### Numeric

Integer and double vectors are known collectively as numeric vectors:
```{r}
typeof(1)
typeof(1L)
```

In case of doubles and integers, it's important to keep in mind that:
- doubles are approximations and represent floating-point numbers that are shown only within the limits of the largest double or integer and the machine's precision.

the smallest positive floating-point number:
```{r}
.Machine$double.eps
```

- Integers have one special value, NA, while doubles can have four: NA, NaN, Inf, -Inf
```{r}
c(-1, 0, 1) / 0
```

#### Character vectors are most complex type, because each element of character vector is string and string can contain arbitrary amount of data.

Importantly, R has global string pool and each unique string is stored in memory only once and every use of string points to that representation. This reduces the amount of memory to store duplicated strings:
```{r}
x <- "R stats TalTech"
pryr::object_size(x)
```

Let's replicate x 1000 times
```{r}
y <- rep(x, 1000)
pryr::object_size(y)
```

y does not take 1000x more memory as x, but 8 * 1000 + 120 = 8.11kB, where each pointer is 8 bytes.


### Using vectors

#### coersion
Vector coersion is converting vector from one type to another and can occur 

- explicitly:
```{r}
as.character(1:10)
as.logical(c(0,1,1,0))
as.numeric(c("1", "2", "3.1"))
```

- implicitly, for example when you use logical vector in numeric summary function:

```{r}
sum(c(TRUE, TRUE, FALSE))
```

When you try to create vector with multiple types, most complex type wins:
```{r}
typeof(c(TRUE, 1L))
typeof(c(1L, 1.5))
typeof(c(1, "a"))
```

> Atomic vector cannot contain values from mix of different types, because type is property of complete vector.

#### Scalars and recycling rules

R also coerces the length of vectors, by recycling shorter vector to the same length as longer vector.

Here we add 100 to 6 random numbers sampled from range 1 to 10:
```{r}
sample(1:10, size = 6) + 100
```

or 
```{r}
runif(10) < 0.4
```

> Basic mathematical operations are vectorised in R.

Together, it's easy to understand when you add two vectors with same length or vector and scalar (vector with length 1).
But what happens when you add vectors of different lengths?
```{r}
1:10 + 1:2
```

```{r}
paste(1:10, c("a", "b"))
```


Here, in these two examples above, R expands or recycles shortest vector to the same length as the longest.

And you will get warning if longer vector is not a multiple of shorter vector:

```{r}
1:10 + 1:3
```

Importantly, tidyverse functions will throw an error if recycling other than sclar:

```{r, eval=FALSE}
tibble(
  a = 1:4, 
  b = 1:2
  )
```

If you need to recycle in tidyverse, specify it explicitly with rep() function:
```{r}
tibble(
  a = 1:4, 
  b = rep(1:2, 2)
  )
```


#### Naming vectors

All types of vectors can be named:
```{r}
c(a = 1, b = 2, c = 3)
```

or by using purrr::set_names()
```{r}
purrr::set_names(1:3, nm = c("a", "b", "c"))
```

purrr::set_names() can be used within pipe!


#### Subsetting

Examples:
```{r}

```


### Lists

List are complex vectors, because they can contain other lists.
Lists can be used to represent hierarchical structures.


List is created by list() function:
```{r}
list(1, 2, 3)
```


structure of a list can be viewed with str() command:
```{r}
str(list(1, 2, 3))
```

List can contain different type of objects:
```{r}
y <- list(do = 3.1, int = 1L, lo = TRUE, ch = month.abb, li = list(1, 2, 3))
str(y)
```

#### Subsetting lists

There are three ways to subset a list:

- "[" extracts a sublist, result is always a list

```{r}
y[c(1, 4)]
```


```{r}
str(y[c(1, 4)])
```

- "[[" extracts single list element and it removes one level of hierarchy from the list:

```{r}
y[[1]]
```

```{r}
y[[4]]
```

- "$" is a shorthand for extracting names elements of a list. It works similarly to [[ except you don't need use quotes:

```{r}
y$ch
```


```{r}
y[["ch"]]
```

## Iteration with purrr

Iteration handles code duplication in cases **when you need to do same thing to multiple inputs**.

Reducing of code duplication is desirable, because

1. It improves code readability and makes easier to see the intent of your code because your eyes are drawn to what's different, not what's always the same.

2. It is easier to update and debug your code, because you need to do changes in only one place.

3. You are likely to have fewer bugs. 


Reducing duplication is achieved by 1) **use of functions**, either generic (mean etc) or custom, and 2) **by iteration** (repetition) of the same operation on different columns or datasets/-splits.

Generally iteration is solved by using loops, but loops are verbose. 

Base R has apply family functions to wrap loops into verbs:
- `apply(df, 1, mean)`
- `lapply(x, mean)`
- `sapply(x, mean)`


"tidyverse" has **purrr** package that contains map() family functions to loop over vectors and lists. 

There is different map() function for each type of input:

- map() outputs list
- map_lgl() outputs logical vector
- map_int() outputs integer vector
- map_dbl() outputs double vector
- map_chr() outputs character vector
- map_df() outputs dataframe

Each function takes vector as input applies function to each element of the vector and outputs new vector with same length (and names) as the input.

```{r}
?map
```


The main benefit of using map() functions over loops is not speed but code clarity.

Let's see how they work. 

Let's create toy data frame:
```{r}
tb <- tibble(
  a = rnorm(10, 0),
  b = rnorm(10, 1),
  c = rnorm(10, 2),
  d = rnorm(10, 3)
)
```

Here we calculate column means over data frame:
```{r}
map_dbl(tb, mean)
```

Previous code could be replaced by dplyr summarise functions, at least when input is a data frame.

Nevertheless, tibble and data.frame is also a list:
```{r}
is.list(tb)
```


You can use map_*() functions that output atomic vector when function returns single value. 

Additional arguments can be passed to functions via ellipsis (...):

Let's pass na.rm=TRUE to mean function (in this example nothing changes because we have no NAs in out data):
```{r}
map_dbl(tb, mean, na.rm = TRUE)
```

### Split data and map on the fly

Load gapminder dataset from gapminder library
```{r}
library(gapminder)
gapminder
```

gapminder contains Gapminder https://www.gapminder.org data on life expectancy, GDP per capita, and population by country.

Split data by country and fit same linear model to all pices:

```{r}
models <- gapminder %>% 
  split(.$country) %>% 
  map(function(x) lm(lifeExp ~ year + gdpPercap, data = x))
```

Note that syntax to create anonymous function is quite verbose:
```{r}
function(x) lm(lifeExp ~ year + gdpPercap, data = x)
```

Whereas purrr allows to use one-sided formula as drop-in replacement:
```{r}
models <- gapminder %>% 
  split(.$country) %>% 
  map(~ lm(lifeExp ~ year + gdpPercap, data = .))
```

Let's have look at model summaries of all these models:

First, we can extract r.squared component from model objects:
```{r}
models %>% 
  map(summary) %>% 
  map_dbl(~.$r.squared)
```

Or more conveniently with purrr, you can directly extract a named component:
```{r}
models %>% 
  map(summary) %>% 
  map_dbl("r.squared")
```

You can also use integers to extract elements by position:
```{r}
x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x %>% map_dbl(2)
```


### Dealing with failure

One bad apple can spoil the whole bunch. 

If one out of many iterations fail, then you will get error message and no output. This is annoying, as one failure prevents you from accessing all successful calculations.

purrr handles these situations with safely() function.

safely() takes a function and returns its modified version wrapped into error handler. 

safely-modified function will never throw an error, but returns a list with two with two elements:

- result - the original result and in case of error NULL

- error - error object and when operation was successful, returns NULL

One of the list objects returned by safely is allways NULL

Let's create safe log function:
```{r}
safe_log <- safely(log)
```

In case of success:
```{r}
safe_log(10)
```

In case of error:
```{r}
safe_log("a")
```

Safely works with map:
```{r}
x <- list(1, 10, "a")
y <- x %>% map(safe_log)
y
```

You can get then two lists one for results and the other one for errors with purrr::transpose()

```{r}
y <- y %>% transpose()
y
```

Then you can subset good results:

```{r}
good <- y$error %>% map_lgl(is.null)
x[!good] # bad apple in original data

```

Extract good results and flatten into numeric vector
```{r}
y$result[good] %>% flatten_dbl()
```

flatten functions remove one level of hierarchy from a list:
```{r}
?purrr::flatten
```