scripts/2_one-predictor-marriage.Rmd

---
title: "Simple regression -- adding a predictor"
author: "Taavi Päll and Ülo Maiväli"
date: "2021-10-02"
output: github_document
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Loading some required libraries

```{r}
library(tidyverse)
library(here)
library(brms)
library(bayesplot)
library(tidybayes)
library(modelr)
library(rethinking)
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
```

### Simple regression -- adding a predictor

We will use **WaffleDivorce** data for the individual States of the USA, 
describing number of Waffle House diners and various marriage and 
demographic facts.

Loading WaffleDivorce data from rethinking package.

```{r}
(waffledivorce <- read_csv2(here("data/waffledivorce.csv")))
```

Where **Divorce** is 2009 divorce rate per 1000 adults and **Marriage** is 2009 marriage rate per 1000 adults.

> We want to see how Marriage rate predicts Divorce rate. 

In order to get divorced you need to get married first, but there is no reason 
to believe that high marriage rate causes high divorce rate.

There are at least two possible scenarios with high marriage rate, first, 
high rate means that marriage is highly valued in a society or, second, 
marriage is inflated.

First scenario may imply that marriage rate is negatively associated with 
divorce rate. Second scenario is compatible with positive association.

```{r}
waffledivorce %>% 
  ggplot(aes(x = Marriage, y = Divorce)) +
  geom_point() +
  geom_smooth(method = "lm") +
  coord_fixed() +
  labs(x = "Marriage rate", y = "Divorce rate")
```

Let's calculate summary stats from non-scaled data:
```{r}
waffledivorce %>% 
  mutate_at("MedianAgeMarriage", ~.x / 10) %>% 
  summarise_at(
    c("Marriage", "Divorce", "MedianAgeMarriage"), 
    list(median = median, sd = sd)
    )
```

For modeling, we want to bring marriage and divorce rate into relative scale, 
given that negative divorce rate does not make sense when marriage rate equals 
zero.

We are using `standardize` function from **rethinking** package as it returns vector 
in contrast to `scale` function. `standardize` is a wrapper around `scale` 
anyway and keeping matrices in our tibbles causes trouble for some downstream functions.

```{r}
divorce <- waffledivorce %>% 
  select(Loc, Marriage, Divorce, MedianAgeMarriage) %>% 
  mutate_at(c("Marriage", "Divorce", "MedianAgeMarriage"), standardize)
```

#### Model

Here we specify our model by using priors from non-scaled original data.

$$D_i \sim \text{Normal}(\mu_i, \sigma)\quad\text{[likelihood]}$$
$$\mu_i = \alpha + \beta M_i\quad\text{[linear model]}$$
$$\alpha \sim \text{Normal}(97.5, 18.5)\quad[\alpha\ \text{prior}]$$
$$\beta \sim \text{Normal}(0, 2.5)\quad[\beta\ \text{prior}]$$
$$\sigma \sim \text{Normal}(0, 18.5)\quad[\sigma\ \text{prior}]$$

- Likelihood denotes that *i* on $\mu_i$ indicates that the mean depends upon the row. *D* denotes divorce rate.      
- The mean \mu is no longer a parameter to be estimated, $\mu_i$ is constructed from other parameters, $\alpha$ and $\beta$, and the predictor variable *M*, marriage rate. There is also deterministic relationship between $\mu$ and $\alpha$ and $\beta$, denoted by "=".   
- Then there are weakly informative priors for alpha, sigma and beta.    


We can see that by default, **brms** provides **flat** priors for beta coefficients,
so minimally we should provide these.
```{r}
get_prior(formula = Divorce ~ Marriage, data = divorce, family = gaussian())
```

First, we can test if our prior choice is reasonable by sampling only from prior:

Exercise, please choose your (better than default) priors for Intercept, beta and sigma parameters 
using **scaled divorce** data and fit the model.

```{r}
priors <- c(
  prior("normal(0, 0.2)", class = "Intercept"),
  prior("normal(0, 0.5)", class = "b"),
  prior("normal(0, 1)", class = "sigma")
)
mod4.0 <- brm(
  formula = Divorce ~ Marriage,
  data = divorce,
  family = gaussian(),
  prior = priors,
  chains = 3,
  file = here("models/Divorce~Marriage_prior_only"),
  sample_prior = "only",
  file_refit = "on_change"
)
```


```{r}
summary(mod4.0)
```


Run pp_check multiple times, to get impression of prior-only performance:
```{r}
pp_check(mod4.0)
```

Let's have a look how well our prior-only model recovers different distribution parameters
```{r}
y <- as.numeric(divorce$Divorce) 
yrep <- posterior_predict(mod4.0)
```

We can check mean and max values in replications:
```{r}
ppc_stat(y, yrep, stat = "mean", binwidth = 1)
```

```{r}
ppc_stat(y, yrep, stat = "sd", binwidth = 1)
```

```{r}
ppc_stat(y, yrep, stat = "max", binwidth = 1)
```


Let's visualize (keeping only small amount of samples to avoid over-plotting):
```{r}
samples4.0 <- as_tibble(posterior_samples(mod4.0)) %>% 
  sample_n(50)
```

Here we can see a sample of plausible regression lines, based on our prior-only model:
```{r}
ggplot() +
  # geom_point(data = divorce, aes(Marriage, Divorce)) +
  geom_abline(data = samples4.0, aes(slope = b_Marriage, intercept = b_Intercept), size = 0.3, alpha = 0.3) +
  labs(x = "Marriage rate", y = "Divorce rate") +
  scale_x_continuous(limits = c(-3, 3)) +
  scale_y_continuous(limits = c(-3, 3)) +
  coord_fixed() +
  theme_classic()
```

Next, let's sample fit model with data.

```{r}
priors <- c(
  prior("normal(0, 0.2)", class = "Intercept"),
  prior("normal(0, 0.5)", class = "b"),
  prior("normal(0, 1)", class = "sigma")
)
mod4.2 <- brm(
  formula = Divorce ~ Marriage,
  data = divorce,
  family = gaussian(),
  prior = priors,
  chains = 3,
  file = here("models/Divorce~Marriage"),
  sample_prior = "yes"
)
```


```{r}
summary(mod4.2)
```


```{r}
plot(mod4.2)
```

#### Visualize the model's inferences.


```{r}
samples4.2 <- as_tibble(posterior_samples(mod4.2)) %>% 
  sample_n(50)
```

We can see that, after including data to fit our model, region occupied by 
plausible regression lines becomes much more constrained:

```{r}
ggplot() +
  geom_point(data = divorce, aes(Marriage, Divorce)) +
  geom_abline(data = samples4.2, aes(slope = b_Marriage, intercept = b_Intercept), size = 0.3, alpha = 0.3) +
  labs(x = "Marriage rate", y = "Divorce rate") +
  theme_classic()
```


Let's add residuals and fitted values to our data:
```{r}
(post_sum4.2 <- divorce %>% 
   select(Marriage, Divorce) %>% 
   add_epred_draws(mod4.2) %>% 
   summarise_at(".epred", mean))
```

```{r}
coefs <- fixef(mod4.2)
post_sum4.2 %>% 
  ggplot(aes(x = Marriage)) +
  geom_point(aes(y = Divorce)) +
  geom_abline(slope = coefs[2, 1], intercept = coefs[1, 1]) +
  geom_segment(aes(xend = Marriage, y = .epred, yend = Divorce), size = 1/3, color = "blue")
```


#### Conditional effects

Model can be relatively easy visualized using `conditional_effects()` function, which displays 
conditional effects of one or more numeric and/or categorical predictors 
including two-way interaction effects.

Our simple model will look like so:
```{r}
plot(conditional_effects(mod4.2), points = TRUE, ask = FALSE)
```


#### Does marriage rate predict divorce rate?

We want to test one-sided hypothesis whether b_Marriage coefficient is bigger than 0.
```{r}
get_variables(mod4.2)
```

We can see that marriage rate predicts divorce rate with very high posterior probability.
```{r}
(h4.2 <- hypothesis(mod4.2, "Marriage > 0"))
```
Plot shows that most of the posterior distribution of Marriage coefficient lies on positive side.
```{r}
plot(h4.2, plot = FALSE)[[1]] +
  geom_vline(xintercept = 0, linetype = "dashed")
```


#### Individual work

We have heights data for adult individuals from Estonia (males and females) listed below, but 
we don't have their weights. Provide predicted weights and 90% quantile intervals 
for each of these individuals.
You can use `data(Howell1)` from **rethinking** package to fit the model.

```{r}
male_heights <- read_csv("
name, height
Martin Müürsepp, 208
Sten-Erik Jantson, 204
Rivo Vesik, 200
Kristjan Sarv, 198 
Jaanus Saks, 197 
Mart Sander, 196
Karl-Erik Taukar, 193
Koit Toome, 192
Rauno Märks, 192
Margus Vaher, 190
Sven Soiver, 189
Rasmus Mägi, 188
Hendrik Toompere jr, 187
Toomas Hendrik Ilves, 186
Hendrik Adler, 186
Martin Saar, 186
Tanel Padar, 186
Karl-Mihkel Salong, 185
Mikk Mäe, 185
Rolf Junior, 184.5 
Ženja Fokin, 184 
Rasmus Kaljujärv, 184
Uku Suviste, 184
Jüri Pootsmann, 183
Indrek Raadik, 182
Ott Lepland, 180.5
Mart Laar, 180
Rasmus Rändvee, 179
Ardo Kaljuvee, 179
Lauri Pedaja, 177
Allan Roosileht, 174
Erkki Sarapuu, 172")

female_heights <- read_csv("name, height
Kati Toots, 169
Monika Tuvi, 169
Kerli Kõiv, 157
Merlyn Uusküla, 165
Merliis Rinne, 164
Anna-Maria Galojan, 162
Laura Kõrgemäe, 178
Siiri Sõnajalg, 176.5
Viivi Sõnajalg, 178
Beatrice, 173
Ines Karu-Salo, 171
Jana Kask, 171
Jana Pulk, 168
Maarja-Liis Ilus, 170
Kristiina Aigro, 176
Kaia Kanepi, 181
Evelyn Sepp, 182
")

male_heights$male <- 1
female_heights$male <- 0
(heights_est <- bind_rows(male_heights, female_heights))
```

Howell data can be loaded like so:
```{r}
data(Howell1)
```


```{r}

```


### Multiple regression -- adding another predictor

So, we found that marriage rate can predict divorce rate.
But how are marriage and divorce rates related to age (at marriage). 


We can see that divorce rate decreases with increasing age at marriage.
```{r}
divorce %>% 
  ggplot(aes(MedianAgeMarriage, Divorce)) +
  geom_point() +
  geom_smooth(method = "lm")
```
So does marriage rate -- with increasing age marriage rate decreases.
```{r}
divorce %>% 
  ggplot(aes(MedianAgeMarriage, Marriage)) +
  geom_point() +
  geom_smooth(method = "lm")
```


#### Model

To accommodate more variables into our model, we just expand the linear model by adding more coefficients.

$$D_i \sim \text{Normal}(\mu_i, \sigma)\quad\text{[likelihood]}$$

$$\mu_i = \alpha + \beta_M M_i + \beta_A A_i \quad\text{[linear model]}$$

$$\alpha \sim \text{Normal}(0, 0.2)\quad[\alpha\ \text{prior}]$$
$$\beta_M \sim \text{Normal}(0, 0.5)\quad[\beta_M\ \text{prior}]$$

$$\beta_A \sim \text{Normal}(0, 0.5)\quad[\beta_A\ \text{prior}]$$

$$\sigma \sim \text{Normal}(0, 1)\quad[\sigma\ \text{prior}]$$

- Now, $\beta_M$ and $\beta_A$ are coefficients for the predictor variable *M*, 
marriage rate, and predictor variable *A*, median age at marriage, respectively.
   

Here are our default priors:
```{r}
get_prior(formula = Divorce ~ Marriage + MedianAgeMarriage, data = divorce, family = gaussian())
```


```{r}
priors <- c(
  prior("normal(0, 0.2)", class = "Intercept"),
  prior("normal(0, 0.5)", class = "b"),
  prior("normal(0, 1)", class = "sigma")
)
mod4.3 <- brm(
  formula = Divorce ~ Marriage + MedianAgeMarriage,
  data = divorce,
  family = gaussian(),
  prior = priors,
  chains = 3,
  file = here("models/Divorce~Marriage+MedianAgeMarriage"),
  sample_prior = "yes"
)
```

After adding MedianAgeMarriage to our model, we can see that Marriage is now 
close to zero with 95% CI spanning from negative to positive values.
Whereas, MedianAgeMarriage coefficient is clearly negative.

```{r}
summary(mod4.3)
```

```{r}
plot(mod4.3)
```


```{r}
pp_check(mod4.3)
```

If we leave marriage rate out of model altogether, median age at marriage is 
still very similar to previous model and clearly negative.

```{r}
mod4.4 <- brm(
  formula = Divorce ~ MedianAgeMarriage,
  data = divorce,
  family = gaussian(),
  prior = priors,
  chains = 3,
  file = here("models/Divorce~MedianAgeMarriage"),
  sample_prior = "yes"
)
```

```{r}
summary(mod4.4)
```


#### Spurius association

Let's compare posteriors of parameters from all these models, we can drop intercept as it's expected to be 0.

```{r}
samples4.2 <- posterior_samples(mod4.2)
samples4.3 <- posterior_samples(mod4.3)
samples4.4 <- posterior_samples(mod4.4)
(samples <- list(samples4.2 = samples4.2, samples4.3 = samples4.3, samples4.4 = samples4.4) %>% 
  bind_rows(.id = "id") %>% 
  select(id, b_Marriage, b_MedianAgeMarriage) %>% 
  pivot_longer(cols = starts_with("b_")) %>% 
  drop_na())
```

Here we can see that Marriage coefficient is clearly different from zero only 
when MedianAgeMarriage is left out from model. By adding Marriage to model, 
MedianAgeMarriage becomes slightly more uncertain. 

```{r}
samples %>% 
  ggplot() +
  stat_pointinterval(aes(id, value)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  facet_wrap(~name) +
  coord_flip() +
  theme(axis.title.y = element_blank())
```

Once we know median age at marriage, there is little or no additional predictive 
power in also knowing the rate of marriage. This does not mean that there is 
no value in knowing marriage rate, besides when median age at marriage is 
not available, then marriage would be very valuable.

The association between marriage rate and divorce rate is spurious, 
caused by the influence of age of marriage on both marriage rate and divorce rate.


#### Plots to evaluate multiple regression fit

R. McElreath proposes in his 'Rethinking' book (2020) three types of plots to 
assess multiple regression model. "None of these techniques is suitable for all 
jobs, and most do not generalize beyond linear regression."

- **Predictor residual plots** show the outcome against residual predictor values to better understanding the statistical model

- **Counterfactual plots** show the implied predictions for imaginary 
experiments (generated data) to explore the causal implications of manipulating 
one or more variables.

- **Posterior prediction plots** show model-based predictions against raw data, 
or otherwise display the error in prediction to check fit and assessing predictions.


##### Posterior prediction plot

In addition to understanding the estimates, it's important to check the model 
fit against the observed data. 

Here, we plot predicted values versus observed values at data point level:
```{r}
divorce %>% 
  add_epred_draws(mod4.4) %>% 
  mean_qi() %>% 
  ggplot(aes(x = Divorce, y = .epred)) +
  geom_abline(linetype = "dashed") +
  geom_point() +
  geom_linerange(aes(ymin = .lower, ymax = .upper)) +
  geom_text(data = . %>% filter(Loc %in% c("ID", "UT")),
            aes(label = Loc), 
            hjust = 0, nudge_x = - 0.3) +
  labs(x = "Observed divorce", 
       y = "Predicted divorce",
       caption = "Points denote mean and\nline denotes 95% quantile interval") +
  coord_fixed() +
  theme_classic()
```


##### Predictor residual plots

*A predictor residual is the average prediction error when we use all of the 
other predictor variables to model a predictor of interest.*

We have two predictors in our model -- Marriage and MedianAgeMarriage.
Let's fit two models.

First Marriage ~ MedianAgeMarriage.
```{r}
priors <- c(
  prior("normal(0, 0.2)", class = "Intercept"),
  prior("normal(0, 0.5)", class = "b"),
  prior("normal(0, 1)", class = "sigma")
)
mod4.5 <- brm(
  formula = Marriage ~ MedianAgeMarriage,
  data = divorce,
  family = gaussian(),
  prior = priors,
  chains = 3,
  file = here("models/Marriage~MedianAgeMarriage"),
  sample_prior = "yes"
)
```


```{r}
summary(mod4.5)
```

Then MedianAgeMarriage ~ Marriage.
```{r}
mod4.6 <- brm(
  formula = MedianAgeMarriage ~ Marriage,
  data = divorce,
  family = gaussian(),
  prior = priors,
  chains = 3,
  file = here("models/MedianAgeMarriage~Marriage"),
  sample_prior = "yes"
)
```


```{r}
summary(mod4.6)
```


```{r}
add_residual_draws(divorce, mod4.5) %>% 
  mean_qi() %>% 
  ggplot(aes(MedianAgeMarriage, Marriage)) +
  geom_point() +
  geom_abline(slope = fixef(mod4.5)[2,1], intercept = fixef(mod4.5)[1,1]) +
  geom_segment(aes(x = MedianAgeMarriage, xend = MedianAgeMarriage, y = Marriage, yend = Marriage - .residual), color = "blue")  +
  coord_fixed()
```

Residual variation in marriage rate shows little association with divorce rate.
This plot displays the linear relationship between divorce and marriage rates, 
having conditioned already on median age of marriage. 
```{r}
add_residual_draws(divorce, mod4.5) %>% 
  mean_qi() %>% 
  ggplot(aes(.residual, Divorce)) +
  geom_point() +
  geom_smooth(method = "lm") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  labs(x = "Marriage rate residuals", y = "Divorce rate") +
  coord_fixed()
```


```{r}
add_residual_draws(divorce, mod4.6) %>% 
  mean_qi() %>% 
  ggplot(aes(Marriage, MedianAgeMarriage)) +
  geom_point() +
  geom_abline(slope = fixef(mod4.6)[2,1], intercept = fixef(mod4.6)[1,1]) +
  geom_segment(aes(x = Marriage, xend = Marriage, y = MedianAgeMarriage, yend = MedianAgeMarriage - .residual), color = "blue")
```

Divorce rate on age at marriage residuals, showing remaining variation, and 
this variation is associated with divorce rate.
States in which people marry older than expected for a given rate of marriage 
tend to have less divorce:
```{r}
add_residual_draws(divorce, mod4.6) %>% 
  mean_qi() %>% 
  ggplot(aes(.residual, Divorce)) +
  geom_point() +
  geom_smooth(method = "lm") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  labs(x = "Median age of marriage residuals", y = "Divorce rate") +
  coord_fixed()
```


> These predictor-based models show that full regression models measure the 
remaining association of each predictor with the outcome, after already 
knowing the other predictors.


##### Counterfactual plots

Counterfactual plots display the causal implications of the model. 
They are called counterfactual (*relating to or expressing what has not happened or is not the case*), because they can be produced for any values of the predictor variables, even unobserved combinations.

Basic use would be to observe how the outcome would change as you change one 
predictor at a time.

Here we generate data for marriage rate and keep median age of marriage at 0 (never happens).
```{r}
tibble(Marriage = seq(from = -3, to = 3, length.out = 30),
       MedianAgeMarriage = mean(divorce$MedianAgeMarriage)) %>% 
  add_predicted_draws(mod4.3) %>% 
  mean_qi(.prediction) %>% 
  ggplot(aes(Marriage, .prediction)) +
  geom_line() +
  geom_ribbon(aes(ymin = .lower, ymax = .upper), alpha = 1/5) +
  labs(subtitle = "Counterfactual plot for which MedianAgeMarriage = 0",
       y = "Divorce", 
       caption = "Line denotes model prediction and\nlight gray ribbon denotes prediction 95% qi.") +
  coord_fixed() +
  theme_classic()
```

Here we generate data for median age of marriage and keeping marriage rate 0.
```{r}
tibble(MedianAgeMarriage = seq(from = -3, to = 3, length.out = 30),
       Marriage = mean(divorce$Marriage)) %>% 
  add_predicted_draws(mod4.3) %>% 
  mean_qi(.prediction) %>% 
  ggplot(aes(MedianAgeMarriage, .prediction)) +
  geom_line() +
  geom_ribbon(aes(ymin = .lower, ymax = .upper), alpha = 1/5) +
  labs(subtitle = "Counterfactual plot for which Marriage = 0",
       y = "Divorce", 
       caption = "Line denotes model prediction and\nlight gray ribbon denotes prediction 95% qi.") +
  coord_fixed() +
  theme_classic()
```