71_example_glmm_muscatine.Rmd

# GLMM, Binary Outcome: Muscatine Obesity

```{r, include=FALSE}
knitr::opts_chunk$set(comment     = "",
                      echo        = TRUE, 
                      warning     = FALSE, 
                      message     = FALSE,
                      fig.align   = "center", # center all figures
                      fig.width   = 6,        # set default figure width to 4 inches
                      fig.height  = 4)        # set default figure height to 3 inches
```

## Packages

### CRAN

```{r, message=FALSE, error=FALSE}
library(tidyverse)    # all things tidy
library(pander)       # nice looking genderal tabulations
library(furniture)    # nice table1() descriptives
library(texreg)       # Convert Regression Output to LaTeX or HTML Tables
library(psych)        # contains some useful functions, like headTail

library(lme4)         # Linear, generalized linear, & nonlinear mixed models
library(gee)          # Generalized Estimating Equations
library(effects)      # Plotting estimated marginal means
library(performance)
library(interactions)

library(patchwork)    # combining graphics
```


### GitHub

Helper `extract` functions for exponentiating parameters form generalized regression models within a `texreg` table of model parameters.

```{r, message=FALSE, error=FALSE}
# install.packages("devtools")
# library(devtools)
# install_github("SarBearSchwartz/texreghelpr")

library(texreghelpr)
```


## Data Prep

> Data on Obesity from the Muscatine Coronary Risk Factor Study.

**Source:** 

Table 10 (page 96) in Woolson and Clarke (1984). 
With permission of Blackwell Publishing.

**Reference:** 

Woolson, R.F. and Clarke, W.R. (1984). Analysis of categorical incompletel longitudinal data. Journal of the Royal Statistical Society, Series A, 147, 87-99.

**Description:**

The **Muscatine Coronary Risk Factor Study (MCRFS)** was a longitudinal study of coronary risk factors in school children in Muscatine, Iowa *(Woolson and Clarke 1984; Ekholm and Skinner 1998)*. Five cohorts of children were measured for `height` and `weight` in **1977**, **1979**, and **1981**. `Relative weight` was calculated as the **ratio** of a child's observed weight to the median weight for their age-sex-height group. Children with a relative weight greater than 110% of the median weight for their respective stratum were classified as `obese`. The analysis of this study involves binary data *(1 = obese, 0 = not obese)* collected at successive time points.


This data was also using in an article title *"Missing data methods in longitudinal studies: a review"* (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3016756/).


**Variable List:**

* Indicators   

    + `id` Child's unique identification number
    + `occas` Occasion number: 1, 2, 3
    
    
* Outcome or dependent variable    

    + `obesity` Obesity Status, 0 = no, 1 = yes  
    
    
* Main predictor or independent variable of interest   

    + `gender` 0 = Male, 1 = Female
    + `baseage` Baseline Age, mid-point of age-cohort 
    + `currage` Current Age, mid-point of age-cohort


### Import


```{r}
data_raw <- read.table("https://raw.githubusercontent.com/CEHS-research/data/master/MLM/Muscatine.txt", header=TRUE)
```


```{r}
str(data_raw)
```


```{r}
psych::headTail(data_raw, top = 10)
```



### Restrict to 350ID's of children with complete data for Class Demonstration

Dealing with missing-ness and its implications are beyond the score of this class.  Instead we are going to restrict our class analysis to a subset of 350 children who have complete data

> I am using the `set.seed()` function so that I can replicate the restults later.

```{r}
complete_ids <- data_raw %>% 
  dplyr::filter(obesity %in% c("0", "1")) %>%
  dplyr::group_by(id) %>% 
  dplyr::summarise(n = n()) %>% 
  dplyr::filter(n == 3) %>% 
  dplyr::pull(id)

set.seed(8892)  # needed?

use_ids <- complete_ids %>% sample(350)

head(use_ids)
```


### Long Format


```{r}
data_long <- data_raw %>%  
  dplyr::filter(id %in% use_ids) %>% 
  mutate(id       = id     %>% factor) %>% 
  mutate(gender   = gender %>% factor(levels = 0:1, 
                                      labels = c("Male", "Female"))) %>% 
  mutate(age_base = baseage %>% factor) %>% 
  mutate(age_curr = currage %>% factor) %>% 
  mutate(occation = occas   %>% factor) %>% 
  mutate(obesity  = obesity %>% factor(levels = 0:1, 
                                       labels = c("No", "Yes"))) %>% 
  select(id, gender, age_base, age_curr, occation, obesity)
```


```{r}
str(data_long)
```


```{r}
psych::headTail(data_long, top = 10)
```


### Wide Format

```{r}
data_wide <- data_long %>% 
  tidyr::pivot_wider(names_from = occation,
                     names_sep = "_",
                     values_from = c(obesity, age_curr)) %>% 
  mutate_if(is.character, factor)%>% 
  group_by(id) %>% 
  mutate(num_miss = sum(is.na(c(obesity_1, obesity_2, obesity_3)))) %>% 
  ungroup() %>% 
  mutate(num_miss = as.factor(num_miss))
```


```{r}
str(data_wide)
```


```{r}
psych::headTail(data_wide, top = 10)
```













## Exploratory Data Analysis

### Summary Statistics

#### Demographics and Baseline

```{r}
data_wide %>% 
  dplyr::group_by(gender) %>% 
  furniture::table1("Baseline Age" = age_base, 
                    "Baseline Obesity" = obesity_1,
                    total   = TRUE,
                    test    = TRUE,
                    na.rm   = FALSE,
                    output  = "markdown")
```


#### Status over Time


```{r}
data_summary <- data_long %>% 
  dplyr::group_by(gender, age_curr) %>% 
  dplyr::mutate(obesityN = case_when(obesity == "Yes" ~ 1,
                                     obesity == "No"  ~ 0)) %>% 
  dplyr::filter(complete.cases(gender, age_curr, obesityN)) %>% 
  dplyr::summarise(n = n(),
                   prob_est = mean(obesityN),
                   prob_SD  = sd(obesityN),
                   prob_SE  = prob_SD/sqrt(n))

data_summary
```



### Visualize

#### By cohort and gender

```{r}
data_long %>% 
  dplyr::group_by(gender, age_base, age_curr) %>% 
  dplyr::mutate(obesityN = case_when(obesity == "Yes" ~ 1,
                                     obesity == "No"  ~ 0)) %>% 
  dplyr::filter(complete.cases(gender, age_curr, obesityN)) %>% 
  dplyr::summarise(n = n(),
                   prob_est = mean(obesityN),
                   prob_SD  = sd(obesityN),
                   prob_SE  = prob_SD/sqrt(n)) %>% 
  ggplot(aes(x = age_curr,
             y = prob_est,
             group = age_base,
             color = age_base)) +
  geom_point() +
  geom_line() +
  theme_bw() + 
  labs(x = "Child's Age, years",
       y = "Proportion Obese") +
  facet_grid(. ~ gender)
```


#### BY only gender

```{r}
data_summary %>% 
  ggplot(aes(x = age_curr,
             y = prob_est,
             group = gender)) +
  geom_ribbon(aes(ymin = prob_est - prob_SE,
                  ymax = prob_est + prob_SE,
                  fill = gender),
              alpha = .3) +
  geom_point(aes(color = gender,
                 shape = gender)) +
  geom_line(aes(linetype = gender,
                color = gender)) +
  theme_bw() + 
  scale_color_manual(values = c("dodger blue", "hot pink")) +
  scale_fill_manual(values = c("dodger blue", "hot pink")) +
  labs(x = "Child's Age, years",
       y = "Proportion Obese")
```

Smooth out the trends

```{r}
data_summary %>% 
  ggplot(aes(x = age_curr,
             y = prob_est,
             group = gender,
             color = gender)) +
  geom_smooth(method = "lm",
              formula = y ~ poly(x, 2),
              se = FALSE) +
  theme_bw() + 
  scale_color_manual(values = c("dodger blue", "hot pink")) +
  scale_fill_manual(values = c("dodger blue", "hot pink")) +
  labs(x = "Child's Age, years",
       y = "Proportion Obese")
```



## Analysis Goal

> Does risk of obesity increase with age and are patterns of change similar for both sexes?

There are 5 age cohorts that were measured each for 3 years, baseage and currage are age midpoints of those cohort groups.  Which to include, current age or occasion?  **Assume no cohort effects.**  *If you do think this is an issue, include baseline age (`age_base`) and current age minus baseline age (`time`) in model.*


```{r}
data_long %>% 
  group_by(gender, age_base, occation) %>% 
  summarise(n       = n(),
            count   = sum(obesity == "Yes"),
            prop    = mean(obesity == "Yes"),
            se      = sd(obesity == "Yes")/sqrt(n)) %>%
  mutate(time = (occation %>% as.numeric) * 2 - 2) %>% 
  ggplot(aes(x = time,
             y = prop,
             fill = gender))  +
  geom_ribbon(aes(ymin = prop - se,
                  ymax = prop + se),
              alpha = 0.2) +
  geom_point(aes(color = gender)) +
  geom_line(aes(color = gender)) +
  theme_bw() +
  facet_wrap(~ age_base, labeller = label_both) +
  labs(title    = "Observed Obesity Rates, by Gender within Cohort",
       subtitle = "Subset of 350 children with complete data",
       x        = "Time, years from 1977",
       y        = "Proportion of Children Characterized as Obese") +
  scale_fill_manual(values = c("dodgerblue3", "red")) +
  scale_color_manual(values = c("dodgerblue3", "red")) +
  scale_x_continuous(breaks = seq(from = 0, to = 4, by = 2)) +
  theme(legend.position = c(1, 0),
        legend.justification = c(1, 0),
        legend.background = element_rect(color = "black"))
```


```{r}
data_long %>% 
  group_by(gender, age_curr) %>% 
  summarise(n       = n(),
            count   = sum(obesity == "Yes"),
            prop    = mean(obesity == "Yes"),
            se      = sd(obesity == "Yes")/sqrt(n)) %>%
  ggplot(aes(x = age_curr %>% as.character %>% as.numeric,
             y = prop,
             group = gender,
             fill = gender))  +
  geom_ribbon(aes(ymin = prop - se,
                  ymax = prop + se),
              alpha = 0.2) +
  geom_point(aes(color = gender)) +
  geom_line(aes(color = gender)) +
  theme_bw() +
  geom_vline(xintercept = 12, 
             linetype   = "dashed", 
             size       = 1, 
             color      = "navyblue") +
  labs(title    = "Observed Obesity Rates, by Gender (collapsing cohorts)",
       subtitle = "Subset of 350 children with complete data",
       x        = "Age of Child, years",
       y        = "Proportion of Children Characterized as Obese") +
  scale_fill_manual(values = c("dodgerblue3", "red")) +
  scale_color_manual(values = c("dodgerblue3", "red")) +
  scale_x_continuous(breaks = seq(from = 6, to = 18, by = 2)) +
  theme(legend.position = c(0, 1),
        legend.justification = c(-0.05, 1.05),
        legend.background = element_rect(color = "black"))
```


### Center time at twelve years old

```{r}
data_long <- data_long %>% 
  dplyr::mutate(age_center = age_curr %>% as.character %>% as.numeric -12) %>% 
  dplyr::mutate(obesity_num = obesity %>% as.numeric - 1)

psych::headTail(data_long)
```



## GLM Analysis

### Standard logistic regression

```{r}
fit_glm_1 <- glm(obesity_num ~ gender*age_center + gender*I(age_center^2), 
                 data   = data_long,
                 family = binomial(link = "logit"))

fit_glm_2 <- glm(obesity_num ~ gender + age_center + I(age_center^2), 
                 data   = data_long,
                 family = binomial(link = "logit"))
```


```{r, results='asis'}
texreg::knitreg(list(extract_glm_exp(fit_glm_1), 
                     extract_glm_exp(fit_glm_2)),
                  custom.model.names = c("Interaction",
                                         "Main Effects"),
                  caption = "GLM: Parameter EStimates",
                  single.row = TRUE,
                  ci.test = 1)
```

```{r}
plot_pred_glm <- Effect(c("gender", "age_center"), 
       fit_glm_2,
       xlevels = list(age_center = seq(from = -6, to = 6, by = 0.25))) %>% 
   data.frame %>%
   mutate(age = age_center + 12) %>% 
  dplyr::mutate(gender = forcats::fct_rev(gender)) %>% 
   ggplot(aes(x        = age,
              y        = fit,
              group    = gender,
              linetype = gender,
              fill     = gender)) +
  geom_ribbon(aes(ymin = fit - se,
                  ymax = fit + se),
              alpha = .3) +
   geom_line(aes(color = gender),
             size = 1.5) +
   theme_bw() +
   labs(title    = "Generalized Linear Model: Model #2",
        x        = "Child's Age, years",
        y        = "Predicted Probability of Obesity",
        linetype = "Gender",
        fill     = "Gender",
        color    = "Gender") +
   theme(legend.position = c(0, 1),
         legend.justification = c(-0.05, 1.05),
         legend.background = element_rect(color = "black"),
         legend.key.width = unit(2, "cm")) +
  scale_x_continuous(breaks = seq(from = 6, to = 18, by = 3)) 

plot_pred_glm 
```



## GEE Analysis

> ALWAYS: fix the scale parameter to 1 with binomial data!!!

```{r}
fit_gee_1in <- gee::gee(obesity_num ~ gender*age_center + gender*I(age_center^2), 
                        id          = id, 
                        data        = data_long,
                        family      = binomial(link = "logit"),
                        corstr      = 'independence', 
                        scale.fix   = TRUE,
                        scale.value = 1)

fit_gee_1ex <- gee::gee(obesity_num ~ gender*age_center + gender*I(age_center^2), 
                        id          = id, 
                        data        = data_long,
                        family      = binomial(link = "logit"),
                        corstr      = 'exchangeable', 
                        scale.fix   = TRUE,
                        scale.value = 1)

fit_gee_1un <- gee::gee(obesity_num ~ gender*age_center + gender*I(age_center^2), 
                        id          = id, 
                        data        = data_long,
                        family      = binomial(link = "logit"),
                        corstr      = 'unstructured', 
                        scale.fix   = TRUE,
                        scale.value = 1)
```

```{r, results='asis'}
texreg::knitreg(list(extract_gee_exp(fit_gee_1in), 
                     extract_gee_exp(fit_gee_1ex),
                     extract_gee_exp(fit_gee_1un)),
                custom.model.names = c("Independent",
                                       "Exchangable",
                                       "Unstructured"),
                caption = "Gee Model Parameters: With Interactions",
                single.row = TRUE,
                ci.test = 1)
```

### Drop the interaction with `gender`.

```{r}
fit_gee_2in <- gee::gee(obesity_num ~ gender + age_center + I(age_center^2), 
                        id          = id, 
                        data        = data_long,
                        family      = binomial(link = "logit"),
                        corstr      = 'independence', 
                        scale.fix   = TRUE,
                        scale.value = 1)

fit_gee_2ex <- gee::gee(obesity_num ~ gender + age_center + I(age_center^2), 
                        id          = id, 
                        data        = data_long,
                        family      = binomial(link = "logit"),
                        corstr      = 'exchangeable', 
                        scale.fix   = TRUE,
                        scale.value = 1)

fit_gee_2un <- gee::gee(obesity_num ~ gender + age_center + I(age_center^2),  
                        id          = id, 
                        data        = data_long,
                        family      = binomial(link = "logit"),
                        corstr      = 'unstructured', 
                        scale.fix   = TRUE,
                        scale.value = 1)
```


```{r, results='asis'}
texreg::knitreg(list(extract_gee_exp(fit_gee_2in), 
                       extract_gee_exp(fit_gee_2ex),
                       extract_gee_exp(fit_gee_2un)),
                  custom.model.names = c("Independent",
                                         "Exchangable",
                                         "Unstructured"),
                  caption = "Gee Model Parameters: Main Effects Only",
                  single.row = TRUE,
                  ci.test = 1)
```

### Select the **"final"** model.


```{r}
fit_geeglm_2un <- geepack::geeglm(obesity_num ~ gender + age_center + I(age_center^2),  
                                  id          = id, 
                                  data        = data_long,
                                  family      = binomial(link = "logit"),
                                  corstr      = 'unstructured')

```


```{r}
interactions::interact_plot(model = fit_geeglm_2un,
                            pred = age_center,
                            modx = gender)
```



```{r}
plot_pred_gee <- fit_geeglm_2un %>% 
  emmeans::emmeans(~ gender*age_center,
                   at = list(age_center = seq(from = -6, to = 6, by = .1)),
                   type = "response",
                   level = .685) %>% 
  data.frame() %>% 
  mutate(gender = fct_rev(gender)) %>% 
  mutate(age = age_center + 12) %>% 
  ggplot(aes(x        = age,
             y        = prob,
             group    = gender)) +
  geom_ribbon(aes(ymin = asymp.LCL,
                  ymax = asymp.UCL,
                  fill = gender),
              alpha = .2) +
  geom_line(aes(linetype = gender,
                color    = gender),
            size = 1.5) +
  theme_bw() +
  labs(title = "Generalized Estimating Equation: Model #2, unstructured",
       x     = "Child's Age, years",
       y     = "Predicted Proportion with Obesity",
       linetype = "Gender",
       fill = "Gender",
       color = "Gender") +
  theme(legend.position = c(0, 1),
        legend.justification = c(-0.05, 1.05),
        legend.background = element_rect(color = "black"),
        legend.key.width = unit(2, "cm")) +
  scale_x_continuous(breaks = seq(from = 6, to = 18, by = 3)) 

  
plot_pred_gee
```



## GLMM Analysis

> IT IS GENERALLY NOT RECOMMENDED THAT RANDOM-SLOPES BE ESTIMATED FOR BINOMIAL GLMMS

```{r}
fit_glmm_1 <- lme4::glmer(obesity_num ~ age_center*gender + I(age_center^2)*gender + (1|id), 
                          data = data_long,
                          family      = binomial(link = "logit")) 

fit_glmm_2 <- lme4::glmer(obesity_num ~ gender + age_center + I(age_center^2) + (1|id), 
                          data = data_long,
                          family      = binomial(link = "logit")) 

```

Indicates smaller model is better, no interaction at higher level necessary

```{r}
anova(fit_glmm_1, fit_glmm_2)
```


```{r, results='asis'}
texreg::knitreg(list(extract_glmer_exp(fit_glmm_1), 
                     extract_glmer_exp(fit_glmm_2)),
                custom.model.names = c("Interaction",
                                       "Main Effects"),
                caption = "GLMM: Parameter EStimates",
                single.row = TRUE,
                ci.test = 1)
```

```{r}
interactions::interact_plot(model = fit_glmm_2,
                            pred = age_center,
                            modx = gender,
                            int.width = .685,
                            interval = TRUE)
```



```{r}
Effect(c("gender", "age_center"),fit_glmm_2) %>% 
  data.frame %>% 
  mutate(fit_exp = exp(fit))
```




```{r}
plot_pred_glmm <- fit_glmm_2 %>% 
  emmeans::emmeans(~ gender*age_center,
                   at = list(age_center = seq(from = -6, to = 6, by = .1)),
                   type = "response",
                   level = .685) %>% 
  data.frame() %>% 
  mutate(gender = fct_rev(gender)) %>% 
  mutate(age = age_center + 12) %>% 
  ggplot(aes(x        = age,
             y        = prob,
             group    = gender)) +
  geom_ribbon(aes(ymin = asymp.LCL,
                  ymax = asymp.UCL,
                  fill = gender),
              alpha = .2) +
  geom_line(aes(linetype = gender,
                color    = gender),
            size = 1.5) +
  theme_bw() +
  labs(title = "Generalized Linear Mixed Effects: Model #2, Random Interepts",
       x     = "Child's Age, years",
       y     = "Predicted Probability of Obesity",
       linetype = "Gender",
       fill = "Gender",
       color = "Gender") +
  theme(legend.position = c(0, 1),
        legend.justification = c(-0.05, 1.05),
        legend.background = element_rect(color = "black"),
        legend.key.width = unit(2, "cm")) +
  scale_x_continuous(breaks = seq(from = 6, to = 18, by = 3)) 

plot_pred_glmm 
```

## Compare Methods

```{r, results="asis"}
texreg::knitreg(list(extract_glm_exp(fit_glm_2),
                       extract_gee_exp(fit_gee_2un),
                       extract_glmer_exp(fit_glmm_2)),
                  custom.model.names = c("GLM",
                                         "GEE",
                                         "GLMM"),
                  caption = "Compare Methods: Parameter EStimates",
                  single.row = TRUE,
                  ci.test = 1)
```


```{block type='rmdlink', echo=TRUE}
The goal of patchwork is to make it ridiculously simple to combine separate `ggplots` into the same graphic. As such it tries to solve the same problem as `gridExtra::grid.arrange()` and `cowplot::plot_grid()` but using an API that incites exploration and iteration, and scales to arbitrarily complex layouts.

Website: https://patchwork.data-imaginist.com/index.html
```

```{r, fig.height=10, fig.width=7.5}
plot_pred_glm / plot_pred_gee / plot_pred_glmm
```

```{r, fig.height=10, fig.width=7.5}
(plot_pred_glm + theme(legend.position = "none")) / 
  (plot_pred_gee + theme(legend.position = "none")) / 
  plot_pred_glmm +
  plot_annotation(tag_levels = "A") +
  plot_layout(guides = "auto")  
```



```{r}
data_long %>% 
  dplyr::mutate(pred = predict(fit_glmm_2, re.form = ~ (1|id))) %>% 
  dplyr::mutate(odds = exp(pred)) %>% 
  dplyr::mutate(prob = odds/(1 + odds)) %>% 
  ggplot(aes(x = age_curr,
             y = prob,
             group = id)) +
  geom_line(aes(color = gender)) +
  theme_bw()

```