report.qmd

---
title: "Report"
subtitle: "Inhale, Exhale, Analyze: BMI's Imprint on Impulse Oscillometry Outcomes"
date: "today"
author: "Joshua J. Cook, M.S., ACRP-PM, CCRC, Syed Ahzaz H. Shah, B.S., Jacob Hernandez, B.S., Sara Basili, M.S." 
bibliography: references.bib
csl: asa.csl
format:
  html:
    code-fold: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```

# **1 Introduction**

Linear mixed-effects models (LMMs) are advanced statistical tools designed to analyze data that exhibit complex structures, such as hierarchical organization, repeated measures, and random effects. These models are particularly useful when data violate the assumptions of traditional ANOVA or regression methods, such as the independence of observations, homoscedasticity, and normality of residuals. LMMs accommodate intra-subject differences, allowing for both fixed effects, which are consistent across individuals, and random effects, which vary among subjects or groups.

The implementation of LMMs has been facilitated by various software packages and programming languages. Brown [@brown_introduction_2021] provides a comprehensive guide to implementing LMMs in R, a widely used statistical programming language, offering a step-by-step walkthrough of model syntax without delving deeply into complex mathematical foundations. Additionally, the lme4 package, as detailed by Bates et al. [@bates_fitting_2015], represents a significant evolution in computational methods for fitting mixed models, offering efficient tools and simplified modeling processes for R users, especially for models with crossed random effects. Pymer4, developed by Jolly [@jolly_pymer4_2018], bridges R and Python, offering Python users a flexible and integrated tool for linear mixed modeling by leveraging the capabilities of R's lme4 package. This tool enhances the analytical capabilities within the Python ecosystem, making advanced statistical methods more accessible to a broader audience.

LMMs find applications across various scientific domains, each with its unique data structures and analytical challenges. The paper by Lee and Shang [@lee_estimation_nodate] explores the impact of missing data on the estimation and selection in LMMs, highlighting the challenges and proposing a method to record missingness using an indicator-based matrix. This approach is critical for ensuring model accuracy in the presence of missing data, a common issue in real-world datasets. Wang et al. [@wang_statistical_2022] illustrate the application of LMMs in cardiothoracic surgery outcomes research, using a case study of homograft pulmonary valve replacement data to demonstrate the model's ability to handle repeated measurements and provide more nuanced understandings of clinical outcomes. Aarts et al [@aarts_2015] demonstrates multilevel design experiments in neuroscience and how using linear models on multilevel data can result in increase in false positives. Magezi [@magezi_linear_2015] highlights the use of LMMs in within-participant psychology experiments, addressing the complexities of repeated measures and nested data structures common in psychological research. Harrison et al. [@harrison_brief_2018] and Bolker et al. [@bolker_generalized_2009] discuss the application of LMMs and generalized linear mixed models (GLMMs) in ecology, emphasizing their utility in analyzing ecological data that involve complex relationships and hierarchical data structures with GRU. Grueber et al [@grueber_2011] another ecology research, paper focuses on the model averaging and information theorist with LMMS as an alternative to traditional null hypothesis testing. In the medical field, LMMs are employed to model pandemic-induced mortality changes, as demonstrated by Verbeeck et al. [@verbeeck_linear_2023], and to analyze longitudinal health-related quality of life data in cancer clinical trials, as discussed by Touraine et al. [@touraine_when_2023].

The paper "To transform or not to transform: using generalized linear mixed models to analyse reaction time data" by Lo and Andrews [@lo_transform_2015] challenges the common practice of transforming reaction time data in cognitive psychology, advocating for GLMMs as a more robust alternative. The "LEVEL" guidelines proposed by Monsalves et al. [@monsalves_level_2020] aim to standardize the reporting of multilevel data and analyses, enhancing comparability across studies. Piepho's study [@piepho_analysing_1999] on analyzing disease incidence data with GLMMs underscores the inadequacy of traditional methods like ANOVA for such data, highlighting GLMMs' flexibility. The simulation study by Pusponegoro et al. [@pusponegoro_linear_2017] on children's growth differences emphasizes the importance of choosing the appropriate covariance structure in LMMs for longitudinal data. Lastly, the framework introduced by Steibel et al. [@steibel_powerful_2009] for analyzing RT-PCR data with LMMs showcases the method's statistical power and flexibility, offering a significant advancement over traditional analysis methods. LMMs are used in a wide array of disciplines, but also in varying study designs, as shown in Table 1.

Table 1. Systematic Review of LMM Use-cases [@casals_methodological_2014]

![](images/LMM_uses.png)

The strengths of LMMs lie in their flexibility to model complex data structures and their ability to handle missing data, making them a powerful tool for a wide range of scientific inquiries. However, their application is not without challenges. Peng and Lu [@peng_model_2012] address the difficulty of variable selection and parameter estimation in LMMs, proposing an iterative procedure to improve model accuracy. Barr [@barr_random_2013] critiques existing guidelines for testing interactions within LMMs, proposing new guidelines to ensure more reliable results. The work by Tu [@tu_using_2015] on GLMMs for network meta-analyses showcases how mixed models have evolved to tackle complex data, enhancing the accuracy of combining different studies. On the other hand, Fokkema et al. [@fokkema_generalized_2021] introduce GLMM trees, merging machine learning with mixed models to improve predictions and analysis, particularly useful in mental health research. Despite their robustness, as noted by Schielzeth et al. [@schielzeth_robustness_2020], LMMs require careful evaluation of model assumptions and may present computational challenges, especially with high-dimensional datasets.

The literature reviewed here collectively emphasizes the versatility, robustness, and broad applicability of LMMs and GLMMs across various fields of research. Despite their advantages, the importance of careful model selection, acknowledgment of limitations, and the potential need for more complex models such as joint models in certain scenarios are also highlighted. As the use of LMMs continues to grow, the development of standardized processes, such as the LEVEL framework [@monsalves_level_2020] and the 10 protocol put forth by [@zuur_2016], and user-friendly tools will be crucial in ensuring the accurate and effective application of these models in research.

# **2 Methods**

As mentioned by [@galecki_linear_2014], a LMM is:

> a parametric linear model for clustered, longitudinal, or repeated-measures data that quantifies the relationships between a continuous dependent variable and various predictor variables. An LMM may include by **fixed-effect** parameters associated with one or more continuous or categorical covariates and **random effects** associated with one or more random factors.

Fixed-effect parameters describe the relationships of the covariates to the dependent variable for the entire population. These effects are typically distinct and clearly defined categorical values and are used for classification, such as Gender or Co-morbidities. These effects are commonly utilized in the setting of analysis like ANOVA. Random effects are specific to clusters or subjects within a population. It is typically not possible to include all the distinct levels random effects, but the researcher should always attempt to account for as many random effects as possible to improve the reliability of the LMM.

The selected dataset for this report specifically represent **longitudinal data**, which is data where the dependent variable is measured at several points in time for each unit of analysis. **Participant dropout** is often a concern in the analysis of longitudinal data, with early time points often having a higher compliance rate than later time points. Along with clustered and repeated-measures data, longitudinal data is **hierarchical** because the observations can be placed into hierarchies or levels.

## **2.1 Mathematical Foundations**

LMMs have a mathematical foundation stemming from **linear algebra.** We will be using notation for a 2-level longitudinal model since that is the structure of the dataset in this report. The index *i* is used to denote participants and *t* is used to denote the different time points of the observations. Given this notation *t* is the first level and *i* is the second level.

Simple LMMs can be defined as in Equation 1.

$$
y=X\beta + Zu+ \epsilon
$$ {#eq-1}

where:

-   Y is the response vector.

-   X is the design matrix for fixed effects.

-   β is the vector of fixed effects (parameters associated with the entire population or certain repeatable levels of experimental factors).

-   Z is the design matrix for random effects.

-   *u* is the vector of random effects (represent random deviations from the population parameters (β ) for different subjects or experimental units; i.e., the variability not explained by the fixed effects).

-   ϵ is the vector of residual errors.

Matrix and Vector Dimensions (Random Intercepts)

-   Y is N x 1 matrix where N is the number of the number of repeated measures

-   X is a N x p matrix where p is the number of fixed effects covariates

-   β is p x 1 column vector

-   Z is a N x J matrix where J number of subjects

-   *u* is J x 1 vector

-   ϵ is n x 1 vector

For a model with a random intercept, the first column of the X matrix will be all 1s and the first element in the β vector will pertain to that random intercept. The Z matrix in a random intercepts model is a block diagonal matrix, with the block defined by Z~i~ matrices.

Adding random effects to the model also changes the size of the dimensions of the Z. If one random effect is added to the matrix then the dimensions change to N x 2q which essentially doubles the columns of the Z matrix to account for the random intercept. *u* will also double in length to be 2q x 1.

### 2.1.1 Example

Now let's go over an example with a 2-level longitudinal structure where we have 100 students with 10 test scores per student and the associated study time for those tests. In this case, the dependent variable is the variable concerning test scores, the fixed effect is the study time and the random effect is the student. For the sake of simplicity, we will only consider a random intercepts model.

Variable Breakdown:

-   N=1000: the number of observations which is the number of students multiplied by  the number of test scores 

-   J = 100: the number of students 

-   p = 2: the random intercept and the fixed effect 

Matrix Notations and Dimension laid out:

$Y_{1000\times1} = X_{1000\times 2} \; \beta_{2\times1} + Z_{1000\times100}\;u_{100\times1} + \epsilon_{1000\times1}$

Example Matrices:

$y = \begin{bmatrix} Score\\ 75 \\ 80\\ ... \\ 90 \end{bmatrix} X = \begin{bmatrix} Intercept & Study Time \\1 & 2 \\1 & 3\\... & ... \\1 & 5\end{bmatrix}$

$\beta = \begin{bmatrix} 1.2\\2.3\end{bmatrix}$

The matrix multiplication can also be broken down into individual equations. In the case of our example we get the following equations:

Level 1 (Time):

$Y_{ti} = \beta_{0j} + \beta_{1j} \cdot \text{StudyTime}_{ti} + e_{ti}$

Level 2 (Student):

$\beta_{0j} = \gamma_{00} + u_{0j}$

Since this is a random intercepts model, only the intercept equation is needed. γ~00~ is the grand intercept mean an u~0j~ is the deviation of the j~th~ group, which in our case is the student. [@galecki_linear_2014].

### 2.1.1 Parameter Estimation

LMMs typically use a Maximum Likelihood estimation or variation called Restricted Maximum likelihood Estimation (REML). Both of these methods obtain parameters of β and θ by optimizing the likelihood function. β are the fixed effects parameters and θ are the covariance matrix parameters where θ depends on the number of random effects and the covariance matrix structure. Our model uses the REML method because it is less biased to the covariance parameters and better at modeling random effects [@galecki_linear_2014].

## **2.2 Assumptions**

Although more flexible than other methods such as ANOVA, there are **several assumptions** for LMMs:

1.  The relationship between the predictors and response variable is assumed to be linear, within each level of random effects.

2.  Random effects (*u*) are assumed to follow a normal distribution with mean zero and variance-covariance matrix G.

    $\gamma \sim N(0,G)$

3.  Residual errors (ϵ ) are assumed to follow a normal distribution with mean zero and variance-covariance matrix R.

    $\epsilon \sim N(0,R)$

4.  Random effects (*u*) and residual errors (ϵ ) are assumed to be independent.

5.  Homoscedasticity is assumed for the residuals across all levels of the independent variables.

There are several techniques that can be utilized to overcome violations in the LMM assumptions, including variable transformation (to achieve linearity or normality), using robust variance estimates, modifying the structure of random and fixed effects, and employing non-parametric methods or generalized linear mixed models (GLMMs) [@galecki_linear_2014].

## **2.3 Sample Data Structure**

LMMs require a tidy data set where each variables are columns and observations are rows. Smaller datasets are usually saved as CSV files and are often loaded from a database. The dataset can contain nulls but they still need to be handled whether they are omitted or imputed and depends on the amount and which columns are affected. The lmer() function in the lme4 library will automatically drop any null values so it is important that data is inspected and visualized before constructing any models. Below is an example of tidy data.

![Figure 1. Tidy data, as defined by Wickham Et. Al.](images/tidy-4.png)

LMMs also require that the structure of both random and fixed effects be defined before the model is created. The variables that have random variation across groups and those that are fixed must be identified. There are different hierarchies in LMMs. First of all we have to distinguish between clustered and longitudinal data. With cluttered data, as the name suggests, groups the subjects or the unit of analysis into different groups. For a two level data set we can have students be the unit of analysis and the next level up is the classrooms. For a three level data set we can add schools as the third level. Regardless of the amount of levels, the first level is always the subject of the unit of analysis, In the example mentioned above, its students [@galecki_linear_2014].

Then there is also longitudinal data where repeated measures are at the first level and the unit of analysis is the second level. A dataset with different patient cholesterol data over time has the measures at different timepoints as the first level and the patients at a second level [@galecki_linear_2014].

## **2.4 Implementation in R**

The implementation begins with importing the dataset into R from a file containing longitudinal retrospective data on the impact of BMI on IOS estimates of airway resistance and reactance in children with sickle cell disease (C-SCD) and African-American children with asthma (C-Asthma). This dataset spans from 2015 to 2020. Data import is executed using the appropriate function, with consideration for specifying file paths and handling header information. Following data importation, preprocessing steps, such as handling missing values and ensuring data integrity, are performed [@galecki_linear_2014].

### 2.4.1 Analysis Using lme() Function

After preprocessing the data, we proceed with fitting linear mixed-effects models (LMMs) using the lme() function from the `nlme` package.

The analysis employs the lme() function from the `nlme` package to fit linear mixed-effects models (LMMs). Model formulation involves specifying a model formula that includes both fixed effects (e.g., BMI, diagnosis of asthma, relevant covariates) and random effects (e.g., random intercepts for subjects). The random argument specifies the random effects structure, while the data argument indicates the dataset to be used. The estimation method (method = "REML") is specified to use restricted maximum likelihood estimation. It is advantageous to use `nlme` because it offers a user interface for fitting models with structure in the residuals (including forms of heteroscedasticity and autocorrelation) and in the random-effects covariance matrices).

### 2.4.2 Hypothesis Testing

Hypotheses are tested to guide model selection and refinement. For instance, Hypothesis 3.1 \[1\] assesses whether the variance of random effects is greater than zero, while Hypothesis 3.2 \[2\] investigates the presence of heterogeneous residual variances across treatment groups. These hypotheses are evaluated using likelihood ratio tests or F-tests, depending on the context.

### 2.4.3 Model Refinement

Based on the outcomes of hypothesis testing and model diagnostics, the model may be refined by removing non-significant fixed effects or selecting an appropriate covariance structure for the residuals. This iterative process entails fitting alternative models and comparing their fit statistics or testing additional hypotheses.

### 2.4.4. Analysis Using lmer() Function

An alternative approach involves utilizing the lmer() function from the `lme4` package to fit LMMs. This function follows a similar syntax to lme() but differs in how it handles random effects specification. `lme4` offers several benefits compared to `nlme`, including: more efficient linear algebra tools (with associated performance enhancements), simpler syntax and more efficient implementation for fitting models with crossed random effects, implementation of profile likelihood confidence intervals on random-effects parameters, and the ability to fit GLMMs [@bates_fitting_2015]. Likelihood ratio tests and model diagnostics are employed to assess model fit and inform model selection [@bates_fitting_2015].

### 2.4.5 Final Model Selection

The final model is selected based on a synthesis of statistical criteria, including model fit indices, significance of fixed effects, and the adequacy of the model's assumptions. This selected model is then employed for interpretation and inference concerning the relationships between the predictor variables (e.g., BMI) and the response variable (e.g., IOS measures).

# **3 Analysis and Results**

## **3.1 Packages**

```{r}

if (!requireNamespace(c("tidyverse", "lme4", "nlme", "Matrix", "gt", "RefManageR", "DataExplorer", "gtsummary", "car"), quietly = TRUE)) {
    install.packages(c("tidyverse", "lme4", "nlme", "Matrix", "gt", "RefManageR", "DataExplorer", "gtsummary", "car"))
}

library(tidyverse)
library(lme4)
library(nlme)
library(gt)
library(gtsummary)
library(RefManageR)
library(DataExplorer)
library(Matrix)
library(car)
library(reshape2)

#references <- ReadBib("references.bib")
#summary(references)
```

-   `tidyverse`: used for data wrangling and visualization.

-   `lme4` and `nlme`: used for LMM within R.

-   `Matrix`: used for matrix manipulation.

-   `gt`: used for table generation.

-   `gtsummary`: used for summary table generation of descriptive statistics.

-   `RefManageR`: used for BibTex reference management.

-   `DataExplorer`: used for EDA.

-   `Matrix`: used for sparse and dense matrix classes and methods.

-   `car`: for qq plots.

-   `reshape2`: to reshape date.

::: callout-warning
## An error was encountered with the Matrix and lme4 packages during model creation. If this error is encountered, please:

## remove.packages("Matrix")

## remove.packages("lme4")

## install.packages("lme4", type = "source")

## library(lme4)
:::

## **3.2 Data Ingestion**

```{r}

# Load the dataset
BMI <- read.csv("data/BMI_IOS_SCD_Asthma.csv")

colnames(BMI) <- c("Group", "Subject_ID", "Observation_number", "Hydroxyurea", "Asthma", "ICS", "LABA", "Gender", "Age_months", "Height_cm", "Weight_Kg", "BMI", "R5Hz_PP", "R20Hz_PP", "X5Hz_PP", "Fres_PP")

BMI$Group <- as.factor(BMI$Group)
BMI$Subject_ID <- as.factor(BMI$Subject_ID)
BMI$Observation_number <- as.factor(BMI$Observation_number)
```

### **`BMI` from Kaggle (**[Impact of BMI on IOS measures on children (kaggle.com)](https://www.kaggle.com/datasets/utkarshx27/impact-of-bmi-on-ios-measures))

-   **Description**: This dataset is from a retrospective study to assess the impact of BMI on impulse oscillometry (IOS) estimates of airway resistance and reactance in children with sickle cell disease (C-SCD).

-   **Detailed Description**: The dataset comprises various attributes and measurements across its columns. Categorical variables, such as Group, Subject ID, Observation_number, Hydroxyurea, Asthma, ICS, LABA, and Gender, denote different groupings, individual subjects, and attributes like medication usage and gender. Numerical variables like Age (months), Height (cm), Weight (Kg), BMI, R5Hz_PP, R20Hz_PP, X5Hz_PP, and Fres_PP provide quantitative data on subjects’ characteristics and test results. Notably, the summary also identifies missing values, such as the 14 instances in the Fres_PP variable, which warrant consideration in subsequent analysis. These columns provide measurements and estimates related to airway resistance and reactance obtained using impulse oscillometry (IOS), which is a non-invasive method for assessing respiratory function. These parameters are valuable in understanding the impact of BMI on respiratory measures in children with sickle cell disease (C-SCD) and African-American children with asthma (C-Asthma) participating in the study.

-   **Why suitable for LMMs**: The dataset has multiple observations, over time, for the same set of participants.

## **3.2 Exploratory Data Analysis (EDA)**

The structure of the dataframe and variable descriptions are shown in Table 2 and Figure 2. Figures 3-10 systematically explore the features of the data and are described below.

```{r}

x <- BMI

str(x)

head(x)

variables <- colnames(x)

variables_table <- data.frame(
  Variable = variables,
  Description = c(
    "This column indicates the group to which the subject belongs. There are two groups in the study: children with sickle cell disease (C-SCD) and African-American children with asthma (C-Asthma).",
    "Each subject in the study is assigned a unique identifier or ID, which is listed in this column. The ID is used to differentiate between individual participants.",
    "This column represents the number assigned to each observation or measurement taken for a particular subject. Since this is a longitudinal study, multiple observations may be recorded for each subject over time.",
    "This column indicates whether the subject with sickle cell disease (C-SCD) received hydroxyurea treatment. Hydroxyurea is a medication commonly used for the treatment of sickle cell disease.",
    "This column indicates whether the subject has a diagnosis of asthma. It distinguishes between children with sickle cell disease (C-SCD) and African-American children with asthma (C-Asthma).",
    "This column indicates whether the subject is using inhaled corticosteroids (ICS). ICS is a type of medication commonly used for the treatment of asthma and certain other respiratory conditions.",
    "This column indicates whether the subject is using a long-acting beta-agonist (LABA). LABA is a type of medication often used in combination with inhaled corticosteroids for the treatment of asthma.",
    "This column represents the gender of the subject, indicating whether they are male or female",
    "This column specifies the age of the subject at the time of the observation or measurement. Age is typically measured in months.",
    "This column represents the height of the subject, typically measured in a standard unit of length, such as centimeters or inches. Height is an important variable to consider in assessing the impact of BMI on respiratory measures.",
    "This column indicates the weight of the subject at the time of the observation or measurement. Weight is typically measured in kilograms (Kg) and is an important variable for calculating the body mass index (BMI).",
    "Body Mass Index (BMI) is a measure that assesses body weight relative to height. It is calculated by dividing the weight of an individual (in kilograms) by the square of their height (in meters). The BMI column provides the calculated BMI value for each subject based on their weight and height measurements. BMI is commonly used as an indicator of overall body fatness and is often used to classify individuals into different weight categories (e.g., underweight, normal weight, overweight, obese).",
    "This column represents the estimate of airway resistance at 5 Hz using impulse oscillometry (IOS). Airway resistance is a measure of the impedance encountered by airflow during respiration. The R5Hz_PP value indicates the airway resistance at the frequency of 5 Hz and is obtained through the IOS testing.",
    "This column represents the estimate of airway resistance at 20 Hz using impulse oscillometry (IOS). Similar to R5Hz_PP, R20Hz_PP provides the measure of airway resistance at the frequency of 20 Hz based on the IOS testing.",
    "This column represents the estimate of airway reactance at 5 Hz using impulse oscillometry (IOS). Airway reactance is a measure of the elasticity and stiffness of the airway walls. The X5Hz_PP value indicates the airway reactance at the frequency of 5 Hz and is obtained through the IOS testing.",
    "This column represents the estimate of resonant frequency using impulse oscillometry (IOS). Resonant frequency is a measure of the point at which the reactance of the airways transitions from positive to negative during respiration. The Fres_PP value indicates the resonant frequency and is obtained through the IOS testing.:"
    )
)

variables_table %>%
  gt %>%
  tab_header(
    title = "Table 2. Variable Description"
  ) %>%
  tab_footnote(
    footnote = "Each variable in the dataset, accompanied by a qualitative description from the study team."
  )

plot_str(x)
introduce(x)
plot_intro(x, title="Figure 2. Structure of variables and missing observations.")
```

### **3.2.1 Missing Values**

```{r}

plot_missing(x, title="Figure 3. Breakdown of missing observations.")
```

Based on the missing values count in Figure 3, it appears that there are no missing values in most of the columns, except for Fres_PP, where there are 14 missing values (6.39%). In this case, omitting missing values for Fres_PP is reasonable, considering the small proportion of missing data compared to the total number of observations.

### 3.2.2 Cleaning Data

```{r}

dim(x)
 
x_clean <- na.omit(x) # drops NAs, further analysis is without NA values
x_clean$Gender <- tolower(x_clean$Gender)
dim(x_clean)
str(x_clean)
```

**Count Plots for Categorical Variable**

The bar plots in Figure 4 show frequency distributions for categories within a distribution. This can aid in data cleaning, checking sparseness or checking for class imbalances. It appears that:

-   Most cases are C-SCD compared to C-Asthma (class imbalance).

-   The number of observations decreases at subsequent measurements.

-   Most cases have Asthma (class imbalance).

-   Most cases have LABA (class imbalance).

-   Hydroxyurea, ICS, and Gender are relatively evenly distributed.

```{r}
plot_bar(x_clean, title = "Figure 4. Frequency plots of categorical variables.")
```

**Histograms**

Histograms in Figure 5 show the frequency and distributions of numerical variables This helps identify distribution types among the different variables. Most of the variables below exhibit a normal distribution with some (BMI) showing a slight right skew.

```{r}
plot_histogram(x_clean, title = "Figure 5. Histogram plots of numerical variables.")
```

**Q-Q Plots**

The QQ plots in Figure 6 serve as a visual aid to asses normality in the covariates. The closer the points are to the straight diagonal line, the more "normal" the data is distributed. Most of the variables show a normal distribution. BMI has a substantial amount of points on the right corner of the plot that go off of the diagonal line potentially representing a non-normal distribution. Weight_Kg shows a similar skew.

```{r}
plot_qq(na.omit(x), title = "Figure 6. QQ plots to assess normality of numerical variables.")

```

**Principal Component Analysis (PCA)**

The PCA plots in Figure 7 show the numerical variables in our data set split into principal components. More than half (54.8%) of the variance can be explained with just 4 principal components. This can be useful if we want to simplify our model by only keeping the principal components that explain most of the variance.

```{r}
plot_prcomp(na.omit(x), title = "Figure 7. PCA to assess key principle components that explain the variance.")
```

**Box Plots**

Based on the boxplots in Figure 8, it’s evident that all variables except “Age (months)” and “Height (cm)” contain outliers. Now, let’s pinpoint these outliers and calculate summary statistics (Table 3).

Figure 8. Boxplots of numerical variables.

```{r}
numeric_vars <- x_clean %>% 
  select_if(is.numeric)

# Boxplot for each numeric variable
par(mfrow=c(2, 2))
for (col in colnames(numeric_vars)) {
  boxplot(numeric_vars[[col]], main=col)
}

# Adding a general title for the entire set of boxplots
#mtext("Figure 8. Box plots of numerical variables.", side=3, line=1, outer=TRUE, cex=1.5)
```

```{r}

# Define a function to detect outliers in each column
detect_outliers <- function(column) {
  Q1 <- quantile(column, 0.25)
  Q3 <- quantile(column, 0.75)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  outliers <- column[column < lower_bound | column > upper_bound]
  return(outliers)
}

# Iterate over each column and print outliers; not removed
for (col in names(numeric_vars)) {
  outliers <- detect_outliers(numeric_vars[[col]])
  if (length(outliers) > 0) {
    cat("Outliers in", col, ":\n")
    print(outliers)
    cat("\n")
  }
}

x_clean %>% 
  select(-2) %>%
  tbl_summary( #gtSummary Table
    by=Group,
    type = list(
      c('Age_months', 'Height_cm', 'Weight_Kg', 'BMI', 'R5Hz_PP', 'R20Hz_PP', 'X5Hz_PP', 'Fres_PP') ~ 'continuous2'),
    statistic = all_continuous2() ~ c(
                       "{mean} ± {sd}",
                       "{median} ({p25}, {p75})",
                       "{min}, {max}"
                       ),
    digits = all_continuous2() ~ 2,
    missing="ifany",
  ) %>%
  bold_labels %>%
  italicize_levels() %>%
  as_gt() %>%
  tab_header(
    title = "Table. 3 Summary Statistics"
  ) %>%
  tab_footnote(
    footnote = "Summary statistics for all variables."
  )
```

**Participant Dropout**

Figure 9 and Table 4 show how many subjects had data at each subsequent timepoint, which suggests that this study experienced significant participant dropout over time. This dropout may or may not be attributed to the study itself and should be investigated further. A strength of LMM is that it can handle unbalanced groups (i.e., patients), so we will continue with modeling regardless.

```{r}
x_clean_timepoints <- x_clean %>%
  group_by(Observation_number) %>%
  summarise(Unique_Subjects = n_distinct(Subject_ID))

x_clean_timepoints$Unique_Subjects <- as.numeric(x_clean_timepoints$Unique_Subjects)

ggplot(x_clean_timepoints, aes(x = Observation_number, y = Unique_Subjects)) +
  geom_point(size = 3, color = "blue") + # Add points for each observation
  geom_line(aes(group = 1), color = "blue") + # Connect the points with a line
  theme_minimal() +
  labs(title = "Figure 9. Participant dropout over time.",
       x = "Timepoint",
       y = "Number of Unique Subjects")

x_clean_timepoints %>% 
  gt() %>%
  tab_header(
    title = "Table 4. Number of participants at each timepoint."
  ) %>%
  tab_footnote(
    footnote = "Counts of unique subjects reveal an increasing amount of missing data at subsequent observation visits."
  )
```

### 3.2.3 Correlations

Figure 10 highlights correlations between variables that should be assessed before any modeling. Age (months) and Height (cm): There is a strong positive correlation (0.914) between Age (months) and Height (cm). This implies that as age increases, height tends to increase as well. This correlation is expected, as children tend to grow taller as they get older.

Weight (Kg) and BMI: There is a strong positive correlation (0.927) between Weight (Kg) and BMI. This suggests that as weight increases, BMI (Body Mass Index) tends to increase as well. This correlation is expected because BMI is calculated using weight and height measurements.

Airway Resistance and Reactance: There is a strong positive correlation (0.754) between R5Hz_PP and Fres_PP.

```{r}
plot_correlation(na.omit(x), maxcat=5L, title = "Figure 10. Correlation matrix of all variables.")

correlation_matrix <- cor(numeric_vars)
print(correlation_matrix)
```

## **3.3 Linear Mixed Modeling**

In this dataset, the variables of interest are the measures of airway resistance and reactance. Additionally, controlled variables are present such as group, age, weight, height, and other co-morbidities. These are the fixed effects. On the other hand, random variability may exist between individual observations which are nested in each subject. These represent the random effects, as shown in Table 5. In the [**initial model**]{.underline}, Subject_ID was treated as the sole random effect. In the [**final model**]{.underline}, both random effects were incorporated (Subject_ID, Observation_Number).

```{r}
#| label: FixedOrRandom

variables_table2 <- variables_table %>%
  select(1) %>%
  mutate(Type = c(
    "Fixed",
    "Random",
    "Random",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed",
    "Fixed"
  )
  )

variables_table2 %>%
  gt %>%
  tab_header(
    title = "Table 5. Variable Categorization"
  ) %>%
  tab_footnote(
    footnote = "A break down of random and fixed effects based on the purpose of the study. Variable categorization is a crucial step in the LMM process."
  )
```

```{r}
#| label: InitialModeling

#lme()

# Fit models using a tidy and clear approach
model_lme <- lme(
  fixed = cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP) ~ BMI + Asthma + ICS + LABA + Gender + Age_months + Height_cm + Weight_Kg,
  random = list(Subject_ID = pdIdent(~1)),
  data = x_clean,
  method = "REML"
)

#lmer() 

model_lmer <- lmer(
  formula = R5Hz_PP + R20Hz_PP + X5Hz_PP + Fres_PP ~ BMI + Asthma + ICS + LABA + Gender + Age_months + Height_cm + Weight_Kg + (1 | Subject_ID),
  data = x_clean
)
```

### **3.3.1 Initial Model**

![Equation 1. The initial linear mixed model.](images/initial_model.png){fig-align="center"}{.lightbox}

```{r}
#| label: InitialAIC

# Compare models based on AIC
aic_lme <- AIC(model_lme)
aic_lmer <- AIC(model_lmer)

cat(sprintf("AIC for lme model: %f\n", aic_lme))
cat(sprintf("AIC for lmer model: %f\n", aic_lmer))

# Correctly assign final_model based on AIC comparison
if (aic_lme < aic_lmer) {
  final_model <- model_lme
  model_type <- "lme"
} else {
  final_model <- model_lmer
  model_type <- "lmer"
}
cat(sprintf("Final model selected: %s\n", model_type))

# Since final_model is now correctly assigned, we can call summary on it
summary(final_model)
```

**Akaike Information Criterion (AIC)**

The AIC for both models was calculated. The AIC is a measure of the relative quality of statistical models for a given set of data. Lower AIC values indicate a model that better fits the data without unnecessary complexity.

Here, the AIC for lme was 1898.95 while lmer was 2517.37.

The model with the lower AIC was selected as the **final model (lme)** despite performance improvements offered by the lme4 package. [All additional models were lme.]{.underline}

**Residuals**

Residual plots (Residuals vs. Fitted Values) were created for the lme model to assess the goodness of fit in Figure 11. A horizontal line at y=0 was added as a reference. These plots help in identifying non-linearity, unequal variances, and outliers.

Based on the **residual plot**, the model has an ideal random pattern of scattered values with a few possible outliers.

```{r}
#| label: InitialResiduals

# Residuals
residuals_final <- resid(final_model)

# Calculate fitted values and residuals from the final model
fitted_values <- fitted(final_model)
residual_values <- residuals(final_model)

# Create a data frame explicitly for plotting
plot_data <- data.frame(Fitted = fitted_values, Residuals = residual_values)

# Plotting using ggplot2 for a more flexible and powerful approach
# Residuals vs Fitted Values
ggplot(plot_data, aes(x = Fitted, y = Residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Fitted Values", y = "Residuals", title = "Figure 11. Residuals vs. Fitted Values")
```

**Histogram of Residuals and QQ Plots**

A histogram and a Q-Q (Quantile-Quantile) plot of the residuals were used to check the normality assumption of the residuals (Figure 12). Finally, a QQ plot with a QQ line was produced for a graphical normality check (Figure 13).

Based on the **histogram**, the model [visually]{.underline} had an ideal bell-shaped curve that resembles the normal distribution. Based on the **QQ plot**, the model [graphically]{.underline} may have had some residuals that were not normally distributed toward the ends.

```{r}
#| label: InitialQQ

# Histogram of Residuals
ggplot(plot_data, aes(x = Residuals)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Figure 12. Histogram of Residuals")

# Q-Q Plot
qqPlot(residuals_final, main = "Figure 13. Q-Q Plot of Residuals")
```

```{r}
#| label: InitialNormality

# Shapiro-Wilk Normality Test
shapiro_test_results <- shapiro.test(residuals_final)
print(shapiro_test_results)
```

The Shapiro-Wilk test was conducted on the residuals to formally test for normality.

$H_o$: the residuals are normally distributed.

$H_a$: the residuals are not normally distributed.

$\alpha$ = 0.05

In this case, P = 0.00001163. P \< 0.05, so the null hypothesis was rejected, suggesting that the **residuals were not normally distributed.** This model does not satisfy the assumptions of LMMs.

### **3.3.2 Imputed Model**

Outliers (as mentioned above) were present in most variables, and the residuals of the initial model were not normally distributed. To improve model performance, outliers were imputed using the the threshold values. The model was then regenerated and assessed using the same metrics as above (Figures 14-17).

Figure 14. Box plots of numerical variables.

```{r}
#| label: AddressingAssumptionsAndRandomEffects

# Copy the original dataset
x_clean_imputed <- x_clean

# Define a function for Winsorization
winsorize <- function(x, lower_percentile = 0.10, upper_percentile = 0.90) {
  lower_threshold <- quantile(x, lower_percentile)
  upper_threshold <- quantile(x, upper_percentile)
  x[x < lower_threshold] <- lower_threshold
  x[x > upper_threshold] <- upper_threshold
  
  return(x)
}

# Apply imputation across numeric variables in the copied dataset
numeric_vars <- names(x_clean_imputed %>% select_if(is.numeric))
for (col in numeric_vars) {
  x_clean_imputed[[col]] <- winsorize(x_clean_imputed[[col]])
}

# Visualization with ggplot2
# Plot boxplots for each numeric variable after imputation
for (col in numeric_vars) {
  p <- ggplot(data = x_clean_imputed, aes(x = "", y = !!sym(col))) +
    geom_boxplot(fill = "skyblue", color = "blue") +
    labs(title = paste("Boxplot of", col), x = "", y = col)
  print(p)
}

# Adding a general title for the entire set of boxplots
#mtext("Figure 14. Box plots of numerical variables.", side=3, line=1, outer=TRUE, cex=1.5)

# Modeling with Imputed Data
# Refit the model using the lme function with the cleaned data
model_lme_imputed <- lme(fixed = cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP) ~ BMI + Asthma + ICS + LABA + Gender + Age_months + Height_cm + Weight_Kg,
                       random = list(Subject_ID = pdIdent(~1)),
                       data = x_clean_imputed,
                       method = "REML")

aic_lme_imputed <- AIC(model_lme_imputed)

cat(sprintf("AIC for lme model: %f\n", aic_lme_imputed))

# Extract residuals
residuals_imputed <- resid(model_lme_imputed)

# Residuals vs Fitted Values Plot
ggplot(data = data.frame(Fitted = fitted(model_lme_imputed), Residuals = residuals_imputed), aes(x = Fitted, y = Residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Fitted Values", y = "Residuals", title = "Figure 15. Residuals vs. Fitted Values")

# Histogram of Residuals
ggplot(data = data.frame(Residuals = residuals_imputed), aes(x = Residuals)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Figure 16. Histogram of Residuals")

qqPlot(residuals_imputed, main = "Figure 17. Q-Q Plot of Residuals")

# Q-Q Plot and Shapiro-Wilk Test
shapiro_test_results <- shapiro.test(residuals_imputed)
print(shapiro_test_results)
```

The AIC was calculated as 1790.91, but cannot be used as a direct comparison to the original model due to imputation. 

The Shapiro-Wilk test was conducted on the residuals to formally test for normality.

$H_o$: the residuals are normally distributed.

$H_a$: the residuals are not normally distributed.

$\alpha$ = 0.05

In this case, P = 0.05066. P \>0.05, so we failed to reject the null hypothesis, suggesting that the **residuals were normally distributed** after threshold imputation. This model now satisfies the assumptions of LMMs.

### **3.3.3 Final Model**

This was a longitudinal study involving multiple observations for each subject over time, and subjects are grouped into two categories (children with sickle cell disease and African-American children with asthma). Thus, in this final model, we modeled **`Group`** as a fixed effect since we were interested in the effect of the group itself on the outcome. **`Subject_ID`** should be a random effect to account for the repeated measures within subjects, and **`Observation_number`** was included as a random slope within **`Subject_ID`** (i.e., nested within Subject_ID). The same visualizations and tests were completed to assess the LMM assumptions (Figures 18-20). The residuals show a random pattern (Figure 18), the histogram is approximately normal (Figure 19), and the qq plot follows a straight line (Figure 20), indicating normality.

```{r}

model_lme_imputed_final <- lme(fixed = cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP) ~ BMI + Asthma + ICS + LABA + Gender + Age_months + Height_cm + Weight_Kg + Group,
                         data = x_clean_imputed,
                         random = list(Subject_ID = pdIdent(~1 + Observation_number)),
                         method = "REML")
str(model_lme_imputed_final)

aic_lme_imputed_final <- AIC(model_lme_imputed_final)

cat(sprintf("AIC for lme model: %f\n", aic_lme_imputed_final))

# Extract residuals
residuals_imputed <- resid(model_lme_imputed_final)

# Residuals vs Fitted Values Plot
ggplot(data = data.frame(Fitted = fitted(model_lme_imputed_final), Residuals = residuals_imputed), aes(x = Fitted, y = Residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Fitted Values", y = "Residuals", title = "Figure 18. Residuals vs. Fitted Values")

# Histogram of Residuals
ggplot(data = data.frame(Residuals = residuals_imputed), aes(x = Residuals)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Figure 19. Histogram of Residuals")

# Q-Q Plot of Residuals
qqPlot(residuals_imputed, main = "Figure 20. Q-Q Plot of Residuals")

# Shapiro-Wilk Test for Normality of Residuals
shapiro_test_results <- shapiro.test(residuals_imputed)
print(shapiro_test_results)

```

![Equation 2. The final linear mixed model.](images/final_model.png){fig-align="center"}{.lightbox}

The AIC was calculated as 1801.60, **not exactly an improvment on the less complex imputed model, as shown in Figure 21.** The AIC penalizes model complexity to avoid overfitting, suggesting that the added effects of Group and Observation_number may not be sufficiently increasing model accuracy compared to complexity. However, these effects may still be relevant given the research goal of the project despite the slight increase in AIC, **and thus will be left in the final model.**

```{r}

# Model names
model_names <- c("1. LME Model", "2. LME Imputed Model", "3. LME Imputed Final Model")

# Combining into a dataframe
aic_review <- data.frame(
  Model = model_names,
  AIC = c(aic_lme, aic_lme_imputed, aic_lme_imputed_final)
)

aic_review$Model <- as.factor(aic_review$Model)
aic_review$AIC <- round(aic_review$AIC, 2)

# Check the structure
str(aic_review)

ggplot(aic_review, aes(x = Model, y = AIC, fill = Model)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  theme_minimal() +
  labs(title = "Figure 21. AIC Values for Different Models",
       x = "Model",
       y = "AIC Value") +
  geom_text(aes(label = AIC), vjust = -0.3, size = 3.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()
```

The Shapiro-Wilk test was conducted on the residuals to formally test for normality.

$H_o$: the residuals are normally distributed.

$H_a$: the residuals are not normally distributed.

$\alpha$ = 0.05

In this case, P = 0.0529. P \>0.05, so we failed to reject the null hypothesis, suggesting that the **residuals were normally distributed** after threshold imputation. This final model also satisfies the assumptions of LMMs.

### **3.3.4 Predictions**

```{r}
# Sctter plot of predicted and actuals on y axis and most important category on x axis 
# and split into groups 

set.seed(43)


lme_resids = residuals(model_lme)
lme_imputed_resids = residuals(model_lme_imputed)
lme_imputed_final_resids = residuals(model_lme_imputed_final)


lme_mse = mean(lme_resids^2)
lme_mae = mean(abs(lme_resids))

lme_imputed_mse = mean(lme_imputed_resids^2)
lme_imputed_mae = mean(abs(lme_imputed_resids))

lme_imputed_final_mse = mean(lme_imputed_final_resids^2)
lme_imputed_final_mae = mean(abs(lme_imputed_final_resids))


mse_review <- data.frame(
  Model = model_names,
  MSE = c(lme_mse, lme_imputed_mse, lme_imputed_final_mse)
)


mse_review$MSE <- round(mse_review$MSE, digits = 2)

mae_review <- data.frame(
  Model = model_names,
  MAE = c(lme_mae, lme_imputed_mae, lme_imputed_final_mae)
)

mae_review$MAE <- round(mae_review$MAE, digits = 2)

```

**MSE and MAE**

**Mean Squared Error (MSE)** and **Mean Absolute Error (MAE)** are metrics used to asses the performance of a model. MSE is the mean of the individual residuals squared and MAE is the mean of the individual absolute value of the residuals. As shown in figures 22 and 23, the imputed final model out performs the other two models by a significant margin. It is important to note that MSE is impacted more by larger errors or outliers because it squares the residuals

```{r}
ggplot(mse_review, aes(x = Model, y = MSE, fill = Model)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  theme_minimal() +
  labs(title = "Figure 22. MSE Values for Different Models",
       x = "Model",
       y = "MSE Value") +
  geom_text(aes(label = MSE), vjust = -0.3, size = 3.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()

```

```{r}
ggplot(mae_review, aes(x = Model, y = MAE, fill = Model)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  theme_minimal() +
  labs(title = "Figure 23. MAE Values for Different Models",
       x = "Model",
       y = "MAE Value") +
  geom_text(aes(label = MAE), vjust = -0.3, size = 3.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()
```

**Sample Predictions vs Actual**

The bar graph below compares the [actual]{.underline} `R5Hz_PP` to [predicted]{.underline} `R5Hz_PP` (as a measure of airway resistance and reactance) for **10 random subjects**. The difference in the bars for each subject is the **residual error.** The small residual error present for each subject suggests that the model is accurate at predicting `R5Hz_PP` as a measure of airway resistance and reactance.

```{r}
lme_imputed_final_predictions = predict(model_lme_imputed_final)
lme_imputed_fina_preds_actuals = data.frame(cbind(lme_imputed_final_predictions, x_clean_imputed$R5Hz_PP))

colnames(lme_imputed_fina_preds_actuals) <- c("Predicted_R5Hz_PP", "Actual_R5Hz_PP" )

set.seed(42)

sample_indices <- sample(nrow(x_clean_imputed), 10)
sample_pred_actuals = lme_imputed_fina_preds_actuals[sample_indices, ]
sample_pred_actuals$row <-1:10


sample_pred_actuals_melt <- melt(sample_pred_actuals, id.vars = "row")


ggplot(sample_pred_actuals_melt, aes(x = factor(row), y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Observation", y = "R5Hz_PP", fill = "") +
  theme_minimal() +
  theme(legend.position = "top") +
  ggtitle("Figure 24. Sample Comparison of Predicted and Actual Values")
```

# **4 Conclusion**

Linear mixed models are versatile tools for modeling complex relations with multiple effects (fixed and random), as well as missing and non-independent data. For the given capstone dataset, the final linear mixed model can reliably predict measures of airway resistance and reactance given demographic and co-morbidity data. This model can be reliably used for both children with Sickle Cell Disease and those with asthma to provide insights into their respiratory function.

# **5 References**

::: {#refs}
:::