forked from acohenstat/STA6257_Project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreport.qmd
917 lines (619 loc) · 49.3 KB
/
report.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
---
title: "Report"
subtitle: "Inhale, Exhale, Analyze: BMI's Imprint on Impulse Oscillometry Outcomes"
date: "today"
author: "Joshua J. Cook, M.S., ACRP-PM, CCRC, Syed Ahzaz H. Shah, B.S., Jacob Hernandez, B.S., Sara Basili, M.S."
bibliography: references.bib
csl: asa.csl
format:
html:
code-fold: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
# **1 Introduction**
Linear mixed-effects models (LMMs) are advanced statistical tools designed to analyze data that exhibit complex structures, such as hierarchical organization, repeated measures, and random effects. These models are particularly useful when data violate the assumptions of traditional ANOVA or regression methods, such as the independence of observations, homoscedasticity, and normality of residuals. LMMs accommodate intra-subject differences, allowing for both fixed effects, which are consistent across individuals, and random effects, which vary among subjects or groups.
The implementation of LMMs has been facilitated by various software packages and programming languages. Brown [@brown_introduction_2021] provides a comprehensive guide to implementing LMMs in R, a widely used statistical programming language, offering a step-by-step walkthrough of model syntax without delving deeply into complex mathematical foundations. Additionally, the lme4 package, as detailed by Bates et al. [@bates_fitting_2015], represents a significant evolution in computational methods for fitting mixed models, offering efficient tools and simplified modeling processes for R users, especially for models with crossed random effects. Pymer4, developed by Jolly [@jolly_pymer4_2018], bridges R and Python, offering Python users a flexible and integrated tool for linear mixed modeling by leveraging the capabilities of R's lme4 package. This tool enhances the analytical capabilities within the Python ecosystem, making advanced statistical methods more accessible to a broader audience.
LMMs find applications across various scientific domains, each with its unique data structures and analytical challenges. The paper by Lee and Shang [@lee_estimation_nodate] explores the impact of missing data on the estimation and selection in LMMs, highlighting the challenges and proposing a method to record missingness using an indicator-based matrix. This approach is critical for ensuring model accuracy in the presence of missing data, a common issue in real-world datasets. Wang et al. [@wang_statistical_2022] illustrate the application of LMMs in cardiothoracic surgery outcomes research, using a case study of homograft pulmonary valve replacement data to demonstrate the model's ability to handle repeated measurements and provide more nuanced understandings of clinical outcomes. Aarts et al [@aarts_2015] demonstrates multilevel design experiments in neuroscience and how using linear models on multilevel data can result in increase in false positives. Magezi [@magezi_linear_2015] highlights the use of LMMs in within-participant psychology experiments, addressing the complexities of repeated measures and nested data structures common in psychological research. Harrison et al. [@harrison_brief_2018] and Bolker et al. [@bolker_generalized_2009] discuss the application of LMMs and generalized linear mixed models (GLMMs) in ecology, emphasizing their utility in analyzing ecological data that involve complex relationships and hierarchical data structures with GRU. Grueber et al [@grueber_2011] another ecology research, paper focuses on the model averaging and information theorist with LMMS as an alternative to traditional null hypothesis testing. In the medical field, LMMs are employed to model pandemic-induced mortality changes, as demonstrated by Verbeeck et al. [@verbeeck_linear_2023], and to analyze longitudinal health-related quality of life data in cancer clinical trials, as discussed by Touraine et al. [@touraine_when_2023].
The paper "To transform or not to transform: using generalized linear mixed models to analyse reaction time data" by Lo and Andrews [@lo_transform_2015] challenges the common practice of transforming reaction time data in cognitive psychology, advocating for GLMMs as a more robust alternative. The "LEVEL" guidelines proposed by Monsalves et al. [@monsalves_level_2020] aim to standardize the reporting of multilevel data and analyses, enhancing comparability across studies. Piepho's study [@piepho_analysing_1999] on analyzing disease incidence data with GLMMs underscores the inadequacy of traditional methods like ANOVA for such data, highlighting GLMMs' flexibility. The simulation study by Pusponegoro et al. [@pusponegoro_linear_2017] on children's growth differences emphasizes the importance of choosing the appropriate covariance structure in LMMs for longitudinal data. Lastly, the framework introduced by Steibel et al. [@steibel_powerful_2009] for analyzing RT-PCR data with LMMs showcases the method's statistical power and flexibility, offering a significant advancement over traditional analysis methods. LMMs are used in a wide array of disciplines, but also in varying study designs, as shown in Table 1.
Table 1. Systematic Review of LMM Use-cases [@casals_methodological_2014]
![](images/LMM_uses.png)
The strengths of LMMs lie in their flexibility to model complex data structures and their ability to handle missing data, making them a powerful tool for a wide range of scientific inquiries. However, their application is not without challenges. Peng and Lu [@peng_model_2012] address the difficulty of variable selection and parameter estimation in LMMs, proposing an iterative procedure to improve model accuracy. Barr [@barr_random_2013] critiques existing guidelines for testing interactions within LMMs, proposing new guidelines to ensure more reliable results. The work by Tu [@tu_using_2015] on GLMMs for network meta-analyses showcases how mixed models have evolved to tackle complex data, enhancing the accuracy of combining different studies. On the other hand, Fokkema et al. [@fokkema_generalized_2021] introduce GLMM trees, merging machine learning with mixed models to improve predictions and analysis, particularly useful in mental health research. Despite their robustness, as noted by Schielzeth et al. [@schielzeth_robustness_2020], LMMs require careful evaluation of model assumptions and may present computational challenges, especially with high-dimensional datasets.
The literature reviewed here collectively emphasizes the versatility, robustness, and broad applicability of LMMs and GLMMs across various fields of research. Despite their advantages, the importance of careful model selection, acknowledgment of limitations, and the potential need for more complex models such as joint models in certain scenarios are also highlighted. As the use of LMMs continues to grow, the development of standardized processes, such as the LEVEL framework [@monsalves_level_2020] and the 10 protocol put forth by [@zuur_2016], and user-friendly tools will be crucial in ensuring the accurate and effective application of these models in research.
# **2 Methods**
As mentioned by [@galecki_linear_2014], a LMM is:
> a parametric linear model for clustered, longitudinal, or repeated-measures data that quantifies the relationships between a continuous dependent variable and various predictor variables. An LMM may include by **fixed-effect** parameters associated with one or more continuous or categorical covariates and **random effects** associated with one or more random factors.
Fixed-effect parameters describe the relationships of the covariates to the dependent variable for the entire population. These effects are typically distinct and clearly defined categorical values and are used for classification, such as Gender or Co-morbidities. These effects are commonly utilized in the setting of analysis like ANOVA. Random effects are specific to clusters or subjects within a population. It is typically not possible to include all the distinct levels random effects, but the researcher should always attempt to account for as many random effects as possible to improve the reliability of the LMM.
The selected dataset for this report specifically represent **longitudinal data**, which is data where the dependent variable is measured at several points in time for each unit of analysis. **Participant dropout** is often a concern in the analysis of longitudinal data, with early time points often having a higher compliance rate than later time points. Along with clustered and repeated-measures data, longitudinal data is **hierarchical** because the observations can be placed into hierarchies or levels.
## **2.1 Mathematical Foundations**
LMMs have a mathematical foundation stemming from **linear algebra.** We will be using notation for a 2-level longitudinal model since that is the structure of the dataset in this report. The index *i* is used to denote participants and *t* is used to denote the different time points of the observations. Given this notation *t* is the first level and *i* is the second level.
Simple LMMs can be defined as in Equation 1.
$$
y=X\beta + Zu+ \epsilon
$$ {#eq-1}
where:
- Y is the response vector.
- X is the design matrix for fixed effects.
- β is the vector of fixed effects (parameters associated with the entire population or certain repeatable levels of experimental factors).
- Z is the design matrix for random effects.
- *u* is the vector of random effects (represent random deviations from the population parameters (β ) for different subjects or experimental units; i.e., the variability not explained by the fixed effects).
- ϵ is the vector of residual errors.
Matrix and Vector Dimensions (Random Intercepts)
- Y is N x 1 matrix where N is the number of the number of repeated measures
- X is a N x p matrix where p is the number of fixed effects covariates
- β is p x 1 column vector
- Z is a N x J matrix where J number of subjects
- *u* is J x 1 vector
- ϵ is n x 1 vector
For a model with a random intercept, the first column of the X matrix will be all 1s and the first element in the β vector will pertain to that random intercept. The Z matrix in a random intercepts model is a block diagonal matrix, with the block defined by Z~i~ matrices.
Adding random effects to the model also changes the size of the dimensions of the Z. If one random effect is added to the matrix then the dimensions change to N x 2q which essentially doubles the columns of the Z matrix to account for the random intercept. *u* will also double in length to be 2q x 1.
### 2.1.1 Example
Now let's go over an example with a 2-level longitudinal structure where we have 100 students with 10 test scores per student and the associated study time for those tests. In this case, the dependent variable is the variable concerning test scores, the fixed effect is the study time and the random effect is the student. For the sake of simplicity, we will only consider a random intercepts model.
Variable Breakdown:
- N=1000: the number of observations which is the number of students multiplied by the number of test scores
- J = 100: the number of students
- p = 2: the random intercept and the fixed effect
Matrix Notations and Dimension laid out:
$Y_{1000\times1} = X_{1000\times 2} \; \beta_{2\times1} + Z_{1000\times100}\;u_{100\times1} + \epsilon_{1000\times1}$
Example Matrices:
$y = \begin{bmatrix} Score\\ 75 \\ 80\\ ... \\ 90 \end{bmatrix} X = \begin{bmatrix} Intercept & Study Time \\1 & 2 \\1 & 3\\... & ... \\1 & 5\end{bmatrix}$
$\beta = \begin{bmatrix} 1.2\\2.3\end{bmatrix}$
The matrix multiplication can also be broken down into individual equations. In the case of our example we get the following equations:
Level 1 (Time):
$Y_{ti} = \beta_{0j} + \beta_{1j} \cdot \text{StudyTime}_{ti} + e_{ti}$
Level 2 (Student):
$\beta_{0j} = \gamma_{00} + u_{0j}$
Since this is a random intercepts model, only the intercept equation is needed. γ~00~ is the grand intercept mean an u~0j~ is the deviation of the j~th~ group, which in our case is the student. [@galecki_linear_2014].
### 2.1.1 Parameter Estimation
LMMs typically use a Maximum Likelihood estimation or variation called Restricted Maximum likelihood Estimation (REML). Both of these methods obtain parameters of β and θ by optimizing the likelihood function. β are the fixed effects parameters and θ are the covariance matrix parameters where θ depends on the number of random effects and the covariance matrix structure. Our model uses the REML method because it is less biased to the covariance parameters and better at modeling random effects [@galecki_linear_2014].
## **2.2 Assumptions**
Although more flexible than other methods such as ANOVA, there are **several assumptions** for LMMs:
1. The relationship between the predictors and response variable is assumed to be linear, within each level of random effects.
2. Random effects (*u*) are assumed to follow a normal distribution with mean zero and variance-covariance matrix G.
$\gamma \sim N(0,G)$
3. Residual errors (ϵ ) are assumed to follow a normal distribution with mean zero and variance-covariance matrix R.
$\epsilon \sim N(0,R)$
4. Random effects (*u*) and residual errors (ϵ ) are assumed to be independent.
5. Homoscedasticity is assumed for the residuals across all levels of the independent variables.
There are several techniques that can be utilized to overcome violations in the LMM assumptions, including variable transformation (to achieve linearity or normality), using robust variance estimates, modifying the structure of random and fixed effects, and employing non-parametric methods or generalized linear mixed models (GLMMs) [@galecki_linear_2014].
## **2.3 Sample Data Structure**
LMMs require a tidy data set where each variables are columns and observations are rows. Smaller datasets are usually saved as CSV files and are often loaded from a database. The dataset can contain nulls but they still need to be handled whether they are omitted or imputed and depends on the amount and which columns are affected. The lmer() function in the lme4 library will automatically drop any null values so it is important that data is inspected and visualized before constructing any models. Below is an example of tidy data.
![Figure 1. Tidy data, as defined by Wickham Et. Al.](images/tidy-4.png)
LMMs also require that the structure of both random and fixed effects be defined before the model is created. The variables that have random variation across groups and those that are fixed must be identified. There are different hierarchies in LMMs. First of all we have to distinguish between clustered and longitudinal data. With cluttered data, as the name suggests, groups the subjects or the unit of analysis into different groups. For a two level data set we can have students be the unit of analysis and the next level up is the classrooms. For a three level data set we can add schools as the third level. Regardless of the amount of levels, the first level is always the subject of the unit of analysis, In the example mentioned above, its students [@galecki_linear_2014].
Then there is also longitudinal data where repeated measures are at the first level and the unit of analysis is the second level. A dataset with different patient cholesterol data over time has the measures at different timepoints as the first level and the patients at a second level [@galecki_linear_2014].
## **2.4 Implementation in R**
The implementation begins with importing the dataset into R from a file containing longitudinal retrospective data on the impact of BMI on IOS estimates of airway resistance and reactance in children with sickle cell disease (C-SCD) and African-American children with asthma (C-Asthma). This dataset spans from 2015 to 2020. Data import is executed using the appropriate function, with consideration for specifying file paths and handling header information. Following data importation, preprocessing steps, such as handling missing values and ensuring data integrity, are performed [@galecki_linear_2014].
### 2.4.1 Analysis Using lme() Function
After preprocessing the data, we proceed with fitting linear mixed-effects models (LMMs) using the lme() function from the `nlme` package.
The analysis employs the lme() function from the `nlme` package to fit linear mixed-effects models (LMMs). Model formulation involves specifying a model formula that includes both fixed effects (e.g., BMI, diagnosis of asthma, relevant covariates) and random effects (e.g., random intercepts for subjects). The random argument specifies the random effects structure, while the data argument indicates the dataset to be used. The estimation method (method = "REML") is specified to use restricted maximum likelihood estimation. It is advantageous to use `nlme` because it offers a user interface for fitting models with structure in the residuals (including forms of heteroscedasticity and autocorrelation) and in the random-effects covariance matrices).
### 2.4.2 Hypothesis Testing
Hypotheses are tested to guide model selection and refinement. For instance, Hypothesis 3.1 \[1\] assesses whether the variance of random effects is greater than zero, while Hypothesis 3.2 \[2\] investigates the presence of heterogeneous residual variances across treatment groups. These hypotheses are evaluated using likelihood ratio tests or F-tests, depending on the context.
### 2.4.3 Model Refinement
Based on the outcomes of hypothesis testing and model diagnostics, the model may be refined by removing non-significant fixed effects or selecting an appropriate covariance structure for the residuals. This iterative process entails fitting alternative models and comparing their fit statistics or testing additional hypotheses.
### 2.4.4. Analysis Using lmer() Function
An alternative approach involves utilizing the lmer() function from the `lme4` package to fit LMMs. This function follows a similar syntax to lme() but differs in how it handles random effects specification. `lme4` offers several benefits compared to `nlme`, including: more efficient linear algebra tools (with associated performance enhancements), simpler syntax and more efficient implementation for fitting models with crossed random effects, implementation of profile likelihood confidence intervals on random-effects parameters, and the ability to fit GLMMs [@bates_fitting_2015]. Likelihood ratio tests and model diagnostics are employed to assess model fit and inform model selection [@bates_fitting_2015].
### 2.4.5 Final Model Selection
The final model is selected based on a synthesis of statistical criteria, including model fit indices, significance of fixed effects, and the adequacy of the model's assumptions. This selected model is then employed for interpretation and inference concerning the relationships between the predictor variables (e.g., BMI) and the response variable (e.g., IOS measures).
# **3 Analysis and Results**
## **3.1 Packages**
```{r}
if (!requireNamespace(c("tidyverse", "lme4", "nlme", "Matrix", "gt", "RefManageR", "DataExplorer", "gtsummary", "car"), quietly = TRUE)) {
install.packages(c("tidyverse", "lme4", "nlme", "Matrix", "gt", "RefManageR", "DataExplorer", "gtsummary", "car"))
}
library(tidyverse)
library(lme4)
library(nlme)
library(gt)
library(gtsummary)
library(RefManageR)
library(DataExplorer)
library(Matrix)
library(car)
library(reshape2)
#references <- ReadBib("references.bib")
#summary(references)
```
- `tidyverse`: used for data wrangling and visualization.
- `lme4` and `nlme`: used for LMM within R.
- `Matrix`: used for matrix manipulation.
- `gt`: used for table generation.
- `gtsummary`: used for summary table generation of descriptive statistics.
- `RefManageR`: used for BibTex reference management.
- `DataExplorer`: used for EDA.
- `Matrix`: used for sparse and dense matrix classes and methods.
- `car`: for qq plots.
- `reshape2`: to reshape date.
::: callout-warning
## An error was encountered with the Matrix and lme4 packages during model creation. If this error is encountered, please:
## remove.packages("Matrix")
## remove.packages("lme4")
## install.packages("lme4", type = "source")
## library(lme4)
:::
## **3.2 Data Ingestion**
```{r}
# Load the dataset
BMI <- read.csv("data/BMI_IOS_SCD_Asthma.csv")
colnames(BMI) <- c("Group", "Subject_ID", "Observation_number", "Hydroxyurea", "Asthma", "ICS", "LABA", "Gender", "Age_months", "Height_cm", "Weight_Kg", "BMI", "R5Hz_PP", "R20Hz_PP", "X5Hz_PP", "Fres_PP")
BMI$Group <- as.factor(BMI$Group)
BMI$Subject_ID <- as.factor(BMI$Subject_ID)
BMI$Observation_number <- as.factor(BMI$Observation_number)
```
### **`BMI` from Kaggle (**[Impact of BMI on IOS measures on children (kaggle.com)](https://www.kaggle.com/datasets/utkarshx27/impact-of-bmi-on-ios-measures))
- **Description**: This dataset is from a retrospective study to assess the impact of BMI on impulse oscillometry (IOS) estimates of airway resistance and reactance in children with sickle cell disease (C-SCD).
- **Detailed Description**: The dataset comprises various attributes and measurements across its columns. Categorical variables, such as Group, Subject ID, Observation_number, Hydroxyurea, Asthma, ICS, LABA, and Gender, denote different groupings, individual subjects, and attributes like medication usage and gender. Numerical variables like Age (months), Height (cm), Weight (Kg), BMI, R5Hz_PP, R20Hz_PP, X5Hz_PP, and Fres_PP provide quantitative data on subjects’ characteristics and test results. Notably, the summary also identifies missing values, such as the 14 instances in the Fres_PP variable, which warrant consideration in subsequent analysis. These columns provide measurements and estimates related to airway resistance and reactance obtained using impulse oscillometry (IOS), which is a non-invasive method for assessing respiratory function. These parameters are valuable in understanding the impact of BMI on respiratory measures in children with sickle cell disease (C-SCD) and African-American children with asthma (C-Asthma) participating in the study.
- **Why suitable for LMMs**: The dataset has multiple observations, over time, for the same set of participants.
## **3.2 Exploratory Data Analysis (EDA)**
The structure of the dataframe and variable descriptions are shown in Table 2 and Figure 2. Figures 3-10 systematically explore the features of the data and are described below.
```{r}
x <- BMI
str(x)
head(x)
variables <- colnames(x)
variables_table <- data.frame(
Variable = variables,
Description = c(
"This column indicates the group to which the subject belongs. There are two groups in the study: children with sickle cell disease (C-SCD) and African-American children with asthma (C-Asthma).",
"Each subject in the study is assigned a unique identifier or ID, which is listed in this column. The ID is used to differentiate between individual participants.",
"This column represents the number assigned to each observation or measurement taken for a particular subject. Since this is a longitudinal study, multiple observations may be recorded for each subject over time.",
"This column indicates whether the subject with sickle cell disease (C-SCD) received hydroxyurea treatment. Hydroxyurea is a medication commonly used for the treatment of sickle cell disease.",
"This column indicates whether the subject has a diagnosis of asthma. It distinguishes between children with sickle cell disease (C-SCD) and African-American children with asthma (C-Asthma).",
"This column indicates whether the subject is using inhaled corticosteroids (ICS). ICS is a type of medication commonly used for the treatment of asthma and certain other respiratory conditions.",
"This column indicates whether the subject is using a long-acting beta-agonist (LABA). LABA is a type of medication often used in combination with inhaled corticosteroids for the treatment of asthma.",
"This column represents the gender of the subject, indicating whether they are male or female",
"This column specifies the age of the subject at the time of the observation or measurement. Age is typically measured in months.",
"This column represents the height of the subject, typically measured in a standard unit of length, such as centimeters or inches. Height is an important variable to consider in assessing the impact of BMI on respiratory measures.",
"This column indicates the weight of the subject at the time of the observation or measurement. Weight is typically measured in kilograms (Kg) and is an important variable for calculating the body mass index (BMI).",
"Body Mass Index (BMI) is a measure that assesses body weight relative to height. It is calculated by dividing the weight of an individual (in kilograms) by the square of their height (in meters). The BMI column provides the calculated BMI value for each subject based on their weight and height measurements. BMI is commonly used as an indicator of overall body fatness and is often used to classify individuals into different weight categories (e.g., underweight, normal weight, overweight, obese).",
"This column represents the estimate of airway resistance at 5 Hz using impulse oscillometry (IOS). Airway resistance is a measure of the impedance encountered by airflow during respiration. The R5Hz_PP value indicates the airway resistance at the frequency of 5 Hz and is obtained through the IOS testing.",
"This column represents the estimate of airway resistance at 20 Hz using impulse oscillometry (IOS). Similar to R5Hz_PP, R20Hz_PP provides the measure of airway resistance at the frequency of 20 Hz based on the IOS testing.",
"This column represents the estimate of airway reactance at 5 Hz using impulse oscillometry (IOS). Airway reactance is a measure of the elasticity and stiffness of the airway walls. The X5Hz_PP value indicates the airway reactance at the frequency of 5 Hz and is obtained through the IOS testing.",
"This column represents the estimate of resonant frequency using impulse oscillometry (IOS). Resonant frequency is a measure of the point at which the reactance of the airways transitions from positive to negative during respiration. The Fres_PP value indicates the resonant frequency and is obtained through the IOS testing.:"
)
)
variables_table %>%
gt %>%
tab_header(
title = "Table 2. Variable Description"
) %>%
tab_footnote(
footnote = "Each variable in the dataset, accompanied by a qualitative description from the study team."
)
plot_str(x)
introduce(x)
plot_intro(x, title="Figure 2. Structure of variables and missing observations.")
```
### **3.2.1 Missing Values**
```{r}
plot_missing(x, title="Figure 3. Breakdown of missing observations.")
```
Based on the missing values count in Figure 3, it appears that there are no missing values in most of the columns, except for Fres_PP, where there are 14 missing values (6.39%). In this case, omitting missing values for Fres_PP is reasonable, considering the small proportion of missing data compared to the total number of observations.
### 3.2.2 Cleaning Data
```{r}
dim(x)
x_clean <- na.omit(x) # drops NAs, further analysis is without NA values
x_clean$Gender <- tolower(x_clean$Gender)
dim(x_clean)
str(x_clean)
```
**Count Plots for Categorical Variable**
The bar plots in Figure 4 show frequency distributions for categories within a distribution. This can aid in data cleaning, checking sparseness or checking for class imbalances. It appears that:
- Most cases are C-SCD compared to C-Asthma (class imbalance).
- The number of observations decreases at subsequent measurements.
- Most cases have Asthma (class imbalance).
- Most cases have LABA (class imbalance).
- Hydroxyurea, ICS, and Gender are relatively evenly distributed.
```{r}
plot_bar(x_clean, title = "Figure 4. Frequency plots of categorical variables.")
```
**Histograms**
Histograms in Figure 5 show the frequency and distributions of numerical variables This helps identify distribution types among the different variables. Most of the variables below exhibit a normal distribution with some (BMI) showing a slight right skew.
```{r}
plot_histogram(x_clean, title = "Figure 5. Histogram plots of numerical variables.")
```
**Q-Q Plots**
The QQ plots in Figure 6 serve as a visual aid to asses normality in the covariates. The closer the points are to the straight diagonal line, the more "normal" the data is distributed. Most of the variables show a normal distribution. BMI has a substantial amount of points on the right corner of the plot that go off of the diagonal line potentially representing a non-normal distribution. Weight_Kg shows a similar skew.
```{r}
plot_qq(na.omit(x), title = "Figure 6. QQ plots to assess normality of numerical variables.")
```
**Principal Component Analysis (PCA)**
The PCA plots in Figure 7 show the numerical variables in our data set split into principal components. More than half (54.8%) of the variance can be explained with just 4 principal components. This can be useful if we want to simplify our model by only keeping the principal components that explain most of the variance.
```{r}
plot_prcomp(na.omit(x), title = "Figure 7. PCA to assess key principle components that explain the variance.")
```
**Box Plots**
Based on the boxplots in Figure 8, it’s evident that all variables except “Age (months)” and “Height (cm)” contain outliers. Now, let’s pinpoint these outliers and calculate summary statistics (Table 3).
Figure 8. Boxplots of numerical variables.
```{r}
numeric_vars <- x_clean %>%
select_if(is.numeric)
# Boxplot for each numeric variable
par(mfrow=c(2, 2))
for (col in colnames(numeric_vars)) {
boxplot(numeric_vars[[col]], main=col)
}
# Adding a general title for the entire set of boxplots
#mtext("Figure 8. Box plots of numerical variables.", side=3, line=1, outer=TRUE, cex=1.5)
```
```{r}
# Define a function to detect outliers in each column
detect_outliers <- function(column) {
Q1 <- quantile(column, 0.25)
Q3 <- quantile(column, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- column[column < lower_bound | column > upper_bound]
return(outliers)
}
# Iterate over each column and print outliers; not removed
for (col in names(numeric_vars)) {
outliers <- detect_outliers(numeric_vars[[col]])
if (length(outliers) > 0) {
cat("Outliers in", col, ":\n")
print(outliers)
cat("\n")
}
}
x_clean %>%
select(-2) %>%
tbl_summary( #gtSummary Table
by=Group,
type = list(
c('Age_months', 'Height_cm', 'Weight_Kg', 'BMI', 'R5Hz_PP', 'R20Hz_PP', 'X5Hz_PP', 'Fres_PP') ~ 'continuous2'),
statistic = all_continuous2() ~ c(
"{mean} ± {sd}",
"{median} ({p25}, {p75})",
"{min}, {max}"
),
digits = all_continuous2() ~ 2,
missing="ifany",
) %>%
bold_labels %>%
italicize_levels() %>%
as_gt() %>%
tab_header(
title = "Table. 3 Summary Statistics"
) %>%
tab_footnote(
footnote = "Summary statistics for all variables."
)
```
**Participant Dropout**
Figure 9 and Table 4 show how many subjects had data at each subsequent timepoint, which suggests that this study experienced significant participant dropout over time. This dropout may or may not be attributed to the study itself and should be investigated further. A strength of LMM is that it can handle unbalanced groups (i.e., patients), so we will continue with modeling regardless.
```{r}
x_clean_timepoints <- x_clean %>%
group_by(Observation_number) %>%
summarise(Unique_Subjects = n_distinct(Subject_ID))
x_clean_timepoints$Unique_Subjects <- as.numeric(x_clean_timepoints$Unique_Subjects)
ggplot(x_clean_timepoints, aes(x = Observation_number, y = Unique_Subjects)) +
geom_point(size = 3, color = "blue") + # Add points for each observation
geom_line(aes(group = 1), color = "blue") + # Connect the points with a line
theme_minimal() +
labs(title = "Figure 9. Participant dropout over time.",
x = "Timepoint",
y = "Number of Unique Subjects")
x_clean_timepoints %>%
gt() %>%
tab_header(
title = "Table 4. Number of participants at each timepoint."
) %>%
tab_footnote(
footnote = "Counts of unique subjects reveal an increasing amount of missing data at subsequent observation visits."
)
```
### 3.2.3 Correlations
Figure 10 highlights correlations between variables that should be assessed before any modeling. Age (months) and Height (cm): There is a strong positive correlation (0.914) between Age (months) and Height (cm). This implies that as age increases, height tends to increase as well. This correlation is expected, as children tend to grow taller as they get older.
Weight (Kg) and BMI: There is a strong positive correlation (0.927) between Weight (Kg) and BMI. This suggests that as weight increases, BMI (Body Mass Index) tends to increase as well. This correlation is expected because BMI is calculated using weight and height measurements.
Airway Resistance and Reactance: There is a strong positive correlation (0.754) between R5Hz_PP and Fres_PP.
```{r}
plot_correlation(na.omit(x), maxcat=5L, title = "Figure 10. Correlation matrix of all variables.")
correlation_matrix <- cor(numeric_vars)
print(correlation_matrix)
```
## **3.3 Linear Mixed Modeling**
In this dataset, the variables of interest are the measures of airway resistance and reactance. Additionally, controlled variables are present such as group, age, weight, height, and other co-morbidities. These are the fixed effects. On the other hand, random variability may exist between individual observations which are nested in each subject. These represent the random effects, as shown in Table 5. In the [**initial model**]{.underline}, Subject_ID was treated as the sole random effect. In the [**final model**]{.underline}, both random effects were incorporated (Subject_ID, Observation_Number).
```{r}
#| label: FixedOrRandom
variables_table2 <- variables_table %>%
select(1) %>%
mutate(Type = c(
"Fixed",
"Random",
"Random",
"Fixed",
"Fixed",
"Fixed",
"Fixed",
"Fixed",
"Fixed",
"Fixed",
"Fixed",
"Fixed",
"Fixed",
"Fixed",
"Fixed",
"Fixed"
)
)
variables_table2 %>%
gt %>%
tab_header(
title = "Table 5. Variable Categorization"
) %>%
tab_footnote(
footnote = "A break down of random and fixed effects based on the purpose of the study. Variable categorization is a crucial step in the LMM process."
)
```
```{r}
#| label: InitialModeling
#lme()
# Fit models using a tidy and clear approach
model_lme <- lme(
fixed = cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP) ~ BMI + Asthma + ICS + LABA + Gender + Age_months + Height_cm + Weight_Kg,
random = list(Subject_ID = pdIdent(~1)),
data = x_clean,
method = "REML"
)
#lmer()
model_lmer <- lmer(
formula = R5Hz_PP + R20Hz_PP + X5Hz_PP + Fres_PP ~ BMI + Asthma + ICS + LABA + Gender + Age_months + Height_cm + Weight_Kg + (1 | Subject_ID),
data = x_clean
)
```
### **3.3.1 Initial Model**
![Equation 1. The initial linear mixed model.](images/initial_model.png){fig-align="center"}{.lightbox}
```{r}
#| label: InitialAIC
# Compare models based on AIC
aic_lme <- AIC(model_lme)
aic_lmer <- AIC(model_lmer)
cat(sprintf("AIC for lme model: %f\n", aic_lme))
cat(sprintf("AIC for lmer model: %f\n", aic_lmer))
# Correctly assign final_model based on AIC comparison
if (aic_lme < aic_lmer) {
final_model <- model_lme
model_type <- "lme"
} else {
final_model <- model_lmer
model_type <- "lmer"
}
cat(sprintf("Final model selected: %s\n", model_type))
# Since final_model is now correctly assigned, we can call summary on it
summary(final_model)
```
**Akaike Information Criterion (AIC)**
The AIC for both models was calculated. The AIC is a measure of the relative quality of statistical models for a given set of data. Lower AIC values indicate a model that better fits the data without unnecessary complexity.
Here, the AIC for lme was 1898.95 while lmer was 2517.37.
The model with the lower AIC was selected as the **final model (lme)** despite performance improvements offered by the lme4 package. [All additional models were lme.]{.underline}
**Residuals**
Residual plots (Residuals vs. Fitted Values) were created for the lme model to assess the goodness of fit in Figure 11. A horizontal line at y=0 was added as a reference. These plots help in identifying non-linearity, unequal variances, and outliers.
Based on the **residual plot**, the model has an ideal random pattern of scattered values with a few possible outliers.
```{r}
#| label: InitialResiduals
# Residuals
residuals_final <- resid(final_model)
# Calculate fitted values and residuals from the final model
fitted_values <- fitted(final_model)
residual_values <- residuals(final_model)
# Create a data frame explicitly for plotting
plot_data <- data.frame(Fitted = fitted_values, Residuals = residual_values)
# Plotting using ggplot2 for a more flexible and powerful approach
# Residuals vs Fitted Values
ggplot(plot_data, aes(x = Fitted, y = Residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
labs(x = "Fitted Values", y = "Residuals", title = "Figure 11. Residuals vs. Fitted Values")
```
**Histogram of Residuals and QQ Plots**
A histogram and a Q-Q (Quantile-Quantile) plot of the residuals were used to check the normality assumption of the residuals (Figure 12). Finally, a QQ plot with a QQ line was produced for a graphical normality check (Figure 13).
Based on the **histogram**, the model [visually]{.underline} had an ideal bell-shaped curve that resembles the normal distribution. Based on the **QQ plot**, the model [graphically]{.underline} may have had some residuals that were not normally distributed toward the ends.
```{r}
#| label: InitialQQ
# Histogram of Residuals
ggplot(plot_data, aes(x = Residuals)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
labs(title = "Figure 12. Histogram of Residuals")
# Q-Q Plot
qqPlot(residuals_final, main = "Figure 13. Q-Q Plot of Residuals")
```
```{r}
#| label: InitialNormality
# Shapiro-Wilk Normality Test
shapiro_test_results <- shapiro.test(residuals_final)
print(shapiro_test_results)
```
The Shapiro-Wilk test was conducted on the residuals to formally test for normality.
$H_o$: the residuals are normally distributed.
$H_a$: the residuals are not normally distributed.
$\alpha$ = 0.05
In this case, P = 0.00001163. P \< 0.05, so the null hypothesis was rejected, suggesting that the **residuals were not normally distributed.** This model does not satisfy the assumptions of LMMs.
### **3.3.2 Imputed Model**
Outliers (as mentioned above) were present in most variables, and the residuals of the initial model were not normally distributed. To improve model performance, outliers were imputed using the the threshold values. The model was then regenerated and assessed using the same metrics as above (Figures 14-17).
Figure 14. Box plots of numerical variables.
```{r}
#| label: AddressingAssumptionsAndRandomEffects
# Copy the original dataset
x_clean_imputed <- x_clean
# Define a function for Winsorization
winsorize <- function(x, lower_percentile = 0.10, upper_percentile = 0.90) {
lower_threshold <- quantile(x, lower_percentile)
upper_threshold <- quantile(x, upper_percentile)
x[x < lower_threshold] <- lower_threshold
x[x > upper_threshold] <- upper_threshold
return(x)
}
# Apply imputation across numeric variables in the copied dataset
numeric_vars <- names(x_clean_imputed %>% select_if(is.numeric))
for (col in numeric_vars) {
x_clean_imputed[[col]] <- winsorize(x_clean_imputed[[col]])
}
# Visualization with ggplot2
# Plot boxplots for each numeric variable after imputation
for (col in numeric_vars) {
p <- ggplot(data = x_clean_imputed, aes(x = "", y = !!sym(col))) +
geom_boxplot(fill = "skyblue", color = "blue") +
labs(title = paste("Boxplot of", col), x = "", y = col)
print(p)
}
# Adding a general title for the entire set of boxplots
#mtext("Figure 14. Box plots of numerical variables.", side=3, line=1, outer=TRUE, cex=1.5)
# Modeling with Imputed Data
# Refit the model using the lme function with the cleaned data
model_lme_imputed <- lme(fixed = cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP) ~ BMI + Asthma + ICS + LABA + Gender + Age_months + Height_cm + Weight_Kg,
random = list(Subject_ID = pdIdent(~1)),
data = x_clean_imputed,
method = "REML")
aic_lme_imputed <- AIC(model_lme_imputed)
cat(sprintf("AIC for lme model: %f\n", aic_lme_imputed))
# Extract residuals
residuals_imputed <- resid(model_lme_imputed)
# Residuals vs Fitted Values Plot
ggplot(data = data.frame(Fitted = fitted(model_lme_imputed), Residuals = residuals_imputed), aes(x = Fitted, y = Residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
labs(x = "Fitted Values", y = "Residuals", title = "Figure 15. Residuals vs. Fitted Values")
# Histogram of Residuals
ggplot(data = data.frame(Residuals = residuals_imputed), aes(x = Residuals)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
labs(title = "Figure 16. Histogram of Residuals")
qqPlot(residuals_imputed, main = "Figure 17. Q-Q Plot of Residuals")
# Q-Q Plot and Shapiro-Wilk Test
shapiro_test_results <- shapiro.test(residuals_imputed)
print(shapiro_test_results)
```
The AIC was calculated as 1790.91, but cannot be used as a direct comparison to the original model due to imputation.
The Shapiro-Wilk test was conducted on the residuals to formally test for normality.
$H_o$: the residuals are normally distributed.
$H_a$: the residuals are not normally distributed.
$\alpha$ = 0.05
In this case, P = 0.05066. P \>0.05, so we failed to reject the null hypothesis, suggesting that the **residuals were normally distributed** after threshold imputation. This model now satisfies the assumptions of LMMs.
### **3.3.3 Final Model**
This was a longitudinal study involving multiple observations for each subject over time, and subjects are grouped into two categories (children with sickle cell disease and African-American children with asthma). Thus, in this final model, we modeled **`Group`** as a fixed effect since we were interested in the effect of the group itself on the outcome. **`Subject_ID`** should be a random effect to account for the repeated measures within subjects, and **`Observation_number`** was included as a random slope within **`Subject_ID`** (i.e., nested within Subject_ID). The same visualizations and tests were completed to assess the LMM assumptions (Figures 18-20). The residuals show a random pattern (Figure 18), the histogram is approximately normal (Figure 19), and the qq plot follows a straight line (Figure 20), indicating normality.
```{r}
model_lme_imputed_final <- lme(fixed = cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP) ~ BMI + Asthma + ICS + LABA + Gender + Age_months + Height_cm + Weight_Kg + Group,
data = x_clean_imputed,
random = list(Subject_ID = pdIdent(~1 + Observation_number)),
method = "REML")
str(model_lme_imputed_final)
aic_lme_imputed_final <- AIC(model_lme_imputed_final)
cat(sprintf("AIC for lme model: %f\n", aic_lme_imputed_final))
# Extract residuals
residuals_imputed <- resid(model_lme_imputed_final)
# Residuals vs Fitted Values Plot
ggplot(data = data.frame(Fitted = fitted(model_lme_imputed_final), Residuals = residuals_imputed), aes(x = Fitted, y = Residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
labs(x = "Fitted Values", y = "Residuals", title = "Figure 18. Residuals vs. Fitted Values")
# Histogram of Residuals
ggplot(data = data.frame(Residuals = residuals_imputed), aes(x = Residuals)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
labs(title = "Figure 19. Histogram of Residuals")
# Q-Q Plot of Residuals
qqPlot(residuals_imputed, main = "Figure 20. Q-Q Plot of Residuals")
# Shapiro-Wilk Test for Normality of Residuals
shapiro_test_results <- shapiro.test(residuals_imputed)
print(shapiro_test_results)
```
![Equation 2. The final linear mixed model.](images/final_model.png){fig-align="center"}{.lightbox}
The AIC was calculated as 1801.60, **not exactly an improvment on the less complex imputed model, as shown in Figure 21.** The AIC penalizes model complexity to avoid overfitting, suggesting that the added effects of Group and Observation_number may not be sufficiently increasing model accuracy compared to complexity. However, these effects may still be relevant given the research goal of the project despite the slight increase in AIC, **and thus will be left in the final model.**
```{r}
# Model names
model_names <- c("1. LME Model", "2. LME Imputed Model", "3. LME Imputed Final Model")
# Combining into a dataframe
aic_review <- data.frame(
Model = model_names,
AIC = c(aic_lme, aic_lme_imputed, aic_lme_imputed_final)
)
aic_review$Model <- as.factor(aic_review$Model)
aic_review$AIC <- round(aic_review$AIC, 2)
# Check the structure
str(aic_review)
ggplot(aic_review, aes(x = Model, y = AIC, fill = Model)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme_minimal() +
labs(title = "Figure 21. AIC Values for Different Models",
x = "Model",
y = "AIC Value") +
geom_text(aes(label = AIC), vjust = -0.3, size = 3.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
```
The Shapiro-Wilk test was conducted on the residuals to formally test for normality.
$H_o$: the residuals are normally distributed.
$H_a$: the residuals are not normally distributed.
$\alpha$ = 0.05
In this case, P = 0.0529. P \>0.05, so we failed to reject the null hypothesis, suggesting that the **residuals were normally distributed** after threshold imputation. This final model also satisfies the assumptions of LMMs.
### **3.3.4 Predictions**
```{r}
# Sctter plot of predicted and actuals on y axis and most important category on x axis
# and split into groups
set.seed(43)
lme_resids = residuals(model_lme)
lme_imputed_resids = residuals(model_lme_imputed)
lme_imputed_final_resids = residuals(model_lme_imputed_final)
lme_mse = mean(lme_resids^2)
lme_mae = mean(abs(lme_resids))
lme_imputed_mse = mean(lme_imputed_resids^2)
lme_imputed_mae = mean(abs(lme_imputed_resids))
lme_imputed_final_mse = mean(lme_imputed_final_resids^2)
lme_imputed_final_mae = mean(abs(lme_imputed_final_resids))
mse_review <- data.frame(
Model = model_names,
MSE = c(lme_mse, lme_imputed_mse, lme_imputed_final_mse)
)
mse_review$MSE <- round(mse_review$MSE, digits = 2)
mae_review <- data.frame(
Model = model_names,
MAE = c(lme_mae, lme_imputed_mae, lme_imputed_final_mae)
)
mae_review$MAE <- round(mae_review$MAE, digits = 2)
```
**MSE and MAE**
**Mean Squared Error (MSE)** and **Mean Absolute Error (MAE)** are metrics used to asses the performance of a model. MSE is the mean of the individual residuals squared and MAE is the mean of the individual absolute value of the residuals. As shown in figures 22 and 23, the imputed final model out performs the other two models by a significant margin. It is important to note that MSE is impacted more by larger errors or outliers because it squares the residuals
```{r}
ggplot(mse_review, aes(x = Model, y = MSE, fill = Model)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme_minimal() +
labs(title = "Figure 22. MSE Values for Different Models",
x = "Model",
y = "MSE Value") +
geom_text(aes(label = MSE), vjust = -0.3, size = 3.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
```
```{r}
ggplot(mae_review, aes(x = Model, y = MAE, fill = Model)) +
geom_bar(stat = "identity", show.legend = FALSE) +
theme_minimal() +
labs(title = "Figure 23. MAE Values for Different Models",
x = "Model",
y = "MAE Value") +
geom_text(aes(label = MAE), vjust = -0.3, size = 3.5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
```
**Sample Predictions vs Actual**
The bar graph below compares the [actual]{.underline} `R5Hz_PP` to [predicted]{.underline} `R5Hz_PP` (as a measure of airway resistance and reactance) for **10 random subjects**. The difference in the bars for each subject is the **residual error.** The small residual error present for each subject suggests that the model is accurate at predicting `R5Hz_PP` as a measure of airway resistance and reactance.
```{r}
lme_imputed_final_predictions = predict(model_lme_imputed_final)
lme_imputed_fina_preds_actuals = data.frame(cbind(lme_imputed_final_predictions, x_clean_imputed$R5Hz_PP))
colnames(lme_imputed_fina_preds_actuals) <- c("Predicted_R5Hz_PP", "Actual_R5Hz_PP" )
set.seed(42)
sample_indices <- sample(nrow(x_clean_imputed), 10)
sample_pred_actuals = lme_imputed_fina_preds_actuals[sample_indices, ]
sample_pred_actuals$row <-1:10
sample_pred_actuals_melt <- melt(sample_pred_actuals, id.vars = "row")
ggplot(sample_pred_actuals_melt, aes(x = factor(row), y = value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Observation", y = "R5Hz_PP", fill = "") +
theme_minimal() +
theme(legend.position = "top") +
ggtitle("Figure 24. Sample Comparison of Predicted and Actual Values")
```
# **4 Conclusion**
Linear mixed models are versatile tools for modeling complex relations with multiple effects (fixed and random), as well as missing and non-independent data. For the given capstone dataset, the final linear mixed model can reliably predict measures of airway resistance and reactance given demographic and co-morbidity data. This model can be reliably used for both children with Sickle Cell Disease and those with asthma to provide insights into their respiratory function.
# **5 References**
::: {#refs}
:::