-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreg-multi.Rmd
408 lines (327 loc) · 20.6 KB
/
reg-multi.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
---
title: "Statistical Relationships"
subtitle: "Multiple linear regression"
author: "Andrew Rate"
date: "`r Sys.Date()`"
output: html_document
---
<style type="text/css">
body{
font-size: 12pt;
}
</style>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r load-packages-etc, include=FALSE}
library(car)
library(flextable)
library(officer)
library(viridis)
library(png)
library(car)
library(lmtest)
library(effects)
set_flextable_defaults(theme_fun = "theme_zebra",
font.size = 10, fonts_ignore = TRUE)
BorderDk <- officer::fp_border(color = "#B5C3DF", style = "solid", width = 1)
BorderLt <- officer::fp_border(color = "#FFFFFF", style = "solid", width = 1)
addImg <- function(obj, x = NULL, y = NULL, width = NULL, interpolate = TRUE){
if(is.null(x) | is.null(y) | is.null(width)){stop("Must provide args 'x', 'y', and 'width'")}
USR <- par()$usr ; PIN <- par()$pin ; DIM <- dim(obj) ; ARp <- DIM[1]/DIM[2]
WIDi <- width/(USR[2]-USR[1])*PIN[1] ; HEIi <- WIDi * ARp
HEIu <- HEIi/PIN[2]*(USR[4]-USR[3])
rasterImage(image = obj, xleft = x-(width/2), xright = x+(width/2),
ybottom = y-(HEIu/2), ytop = y+(HEIu/2), interpolate = interpolate)
}
```
<div style="border: 2px solid #039; padding: 8px;">
<p style="text-align:center; font-size:12pt;">
<em>Relationships pages</em>: [Correlations](correl.html){style="color:#04b;"} |
[Simple linear regression](regression.html){style="color:#04b;"} |
[Grouped linear regression](reg-group.html){style="color:#04b;"} |
[Multiple linear regression](reg-multi.html){style="color:#04b;"}</p>
</div>
## Introduction
Another way we can improve the ability of a linear regression model, as compared
with a [simple regression model](regression.html), is to allow prediction of our
dependent variable with more than one independent variable – the idea of
multiple predictors, or *multiple regression*. (We have already examined the use
of [grouped linear regression](reg-group.html) to improve prediction relative to
[simple linear regression](regression.html).)
## Developing a multiple linear regression model
<div style="border: 2px solid #039; background-color:#e8e8e8; padding: 8px;">
**NOTE: This section on multiple regression is optional** – it is more advanced material which we will not cover in class, but which may be useful for analysing data from the class project.
</div>
<p> </p>
Sometimes the unexplained variation in our dependent variable (*i.e*. the
residuals) may be able to be explained in part by one or more additional
predictors in our data. If we suspect this to be true, we can add predictors in
a *multiple* linear regression model.
**Multiple regression models** predict the value of one variable
(the *dependent variable*) from two or more *predictor variables*
(or just 'predictors'). They can be very useful in environmental
science, but there are several steps we need to take to make sure
that we have a valid model.
In this example we're going to develop a regression model to
**predict gadolinium (Gd) concentrations** in sediment from several predictors.
It makes sense to choose predictors that represent bulk sediment properties
that could *plausibly* control trace element concentrations. So,
we choose variables like **pH, EC, organic carbon, cation exchange capacity, and some major elements** as predictors
(<span style="color: #B04030;">**but NOT other trace elements**</span>;
different trace elements may be highly correlated (due to common sources &
sinks), but their concentrations are most likely too low to control the
concentration of anything else!)
Since we don't have organic carbon or cation exchange capacity
in this dataset, and there are many missing values for EC, our initial
predictors will be **Al, Ca, Fe, K, Mg, Na, pH, and S**.
Both the predictors and dependent variable need to be
**appropriately transformed** before we start! (Remember that linear regression assumptions are based on the residuals, but we are less likely to fulfil these assumptions if our variables are very skewed. A good option is often to
log~10~-transform variables if necessary.)
Also, some of our initial predictors may be highly correlated
(co-linear) with each other. In multiple regression, we don't want to
include co-linear predictors, since then we'll have two (or more)
predictors which effectively contain the same information – see below.
For practical purposes, we prefer [linear] regression models which are:
1. good at prediction (*i.e*. having the greatest possible R^2^ value);
2. as simple as possible (*i.e*. not having predictors which a co-linear, and only accepting a more complex model if it significantly improves prediction, *e.g*., by `anova()`);
3. meet the assumptions of linear regression (although we have argued previously that even if some assumptions are not met, a regression model can still be used for prediction).
## Read input data
As previously, our first step is to read the data, and change anything we need to:
```{r read-data, message=FALSE, warning=FALSE}
git <- "https://github.com/Ratey-AtUWA/Learn-R-web/raw/main/"
afs1923 <- read.csv(paste0(git,"afs1923.csv"), stringsAsFactors = TRUE)
# re-order the factor levels
afs1923$Type <- factor(afs1923$Type, levels=c("Drain_Sed","Lake_Sed","Saltmarsh","Other"))
afs1923$Type2 <-
factor(afs1923$Type2,
levels=c("Drain_Sed","Lake_Sed","Saltmarsh W","Saltmarsh E","Other"))
```
## Assess collinearity between initial set of predictors
First we inspect the correlation matrix. It's useful to include the
dependent variable as well, just to see which predictors are
most closely correlated.
(We try to generate a 'tidier' table by restricting numbers to 3 significant
digits, and making the values = 1 on the diagonal `NA`. We don't need to use
`corTest()` from the `psych` package, since we're not so interested in P-values
for this purpose.)
<p id="cormat">Note that all variables are **appropriately transformed**!<br>
*You will need to do this yourself...*
```{r invisibly transform variables, echo=FALSE, results='hide'}
afs1923$Ca.log <- log10(afs1923$Ca)
afs1923$Mg.pow <- afs1923$Mg^0.5
afs1923$Na.pow <- afs1923$Na^0.333
afs1923$S.log <- log10(afs1923$S)
afs1923$Gd.pow <- afs1923$Gd^0.5
```
```{r correlation matrix}
cor0 <-
cor(afs1923[,c("Al","Ca.log","Fe","K","Mg.pow","Na.pow","pH","S.log","Gd.pow")],
use="pairwise.complete")
cor0[which(cor0==1)] <- NA # change diagonal to NA
print(round(cor0,3), na.print="") # round to 3 decimal places and print nothing for NAs
rm(cor0) # tidy up
```
<div style="border: 2px solid #039; background-color:#e8e8e8; padding: 8px;">
The rule of thumb we use is that:
> If predictor variables are correlated with Pearson's r ≥ 0.8 or r ≤ -0.8, then the collinearity is too large and one of the correlated predictors should be omitted
</div>
<p> </p>
In the correlation table above this doesn't apply to the correlation between
[transformed] Na and Mg, with **r=0.94**. In this example we will run two
versions of the model, one keeping both Na and Mg, and one omitting Mg.
<div style="border: 2px solid #039; background-color:#fec; padding: 8px;">
We have just added a fifth assumption for linear regression models specifically
for multiple regression:
5. Independent variables (predictors) should not show significant covariance
</div>
<p> </p>
In either case, whether we run the model with or without omitting predictors,
it's a good idea to calculate *Variance Inflation Factors* on the predictor
variables in the model (see below) which can tell us if collinearity is a
problem.
## Generate multiple regression model for Gd (co-linear predictors NOT omitted)
We first delete any observations (rows) with missing (`NA`) values, otherwise
when we change the number of predictors later, we may not have the same number
of observations for all models, in which case we can't compare them.
```{r lm all variables, results='hold'}
# make new data object containing relevant variables with no missing values
afs1923_multreg <- na.omit(afs1923[c("Gd.pow","Al","Ca.log","Fe","K","Mg.pow",
"Na.pow","pH","S.log")])
row.names(afs1923_multreg) <- NULL # reset row indices
# run model using correctly transformed variables
lm_multi <- lm(Gd.pow ~ pH + Al + Ca.log + Fe + K + Mg.pow +
Na.pow + S.log, data=afs1923_multreg)
summary(lm_multi)
```
Note that the null hypothesis probability `Pr(>|t|)` for some predictors (`pH`,
`Ca.log` and `S.log`) is ≥ 0.05, so we can't reject the null hypothesis
– that this predictor has no effect on the dependent variable.
## Calculate variance inflation factors (VIF) for the predictors in the 'maximal' model
To calculate variance inflation factors we use the function `vif()` from the
`car` package. The input for `vif()` is a `lm` object (in this case `lm_multi`).
```{r VIFs for maximal model, results='hold'}
require(car)
{cat("Variance Inflation Factors\n")
vif(lm_multi)}
```
The VIF could be considered the "penalty" from having 2 or more predictors which
(since they contain the same information) don't decrease the unexplained variance
of the model, but add unnecessary complexity. A general rule of thumb is that if
**VIF > 4** we need to do some further investigation, while serious
multi-collinearity exists **requiring correction if VIF > 10** (Hebbali, 2018).
As we probably expected from the correlation coefficient (above), VIFs for both
Na and Mg are >10 in this model, which is too high, so we need to try a model
which omits Na or Mg (we'll choose Mg), since `vif` ≥ 10 suggests
we remove one . . .
## Generate multiple regression model for Gd, omitting co-linear predictors
```{r multi lm omitting collinear, results='hold'}
# make new data object containing relevant variables with no missing values
# run model using correctly transformed variables (omitting co-linear predictors)
lm_multi2 <- lm(Gd.pow ~ pH + Al + Ca.log + Fe + K +
Na.pow + S.log, data=afs1923_multreg)
summary(lm_multi2)
```
Note that again the null hypothesis probability `Pr(>|t|)` for some predictors
(`pH`, `Ca.log` and `S.log`) is ≥ 0.05, so we can't reject the null
hypothesis – that these predictors have no effect on the dependent
variable.
## Calculate variance inflation factors (VIF) for the model omitting co-linear predictors
```{r VIFs for maximal model omitting co-linear predictors, results='hold'}
require(car)
{cat("Variance Inflation Factors\n")
vif(lm_multi2)}
```
**With the co-linear variable(s) omitted (on the basis of |Pearson's r| > 0.8 and `vif` > 10), we now have no VIFs > 10**. We do have one
`vif` > 4 for `K`, but we'll retain this predictor as it may represent
illite-type clay minerals which can be reactive to trace elements like Gd in sediment. We now move on to stepwise refinement of our [new] 'maximal' model...<br>
## Stepwise refinement of maximal multiple regression model (omitting co-linear predictors)
We don't want to have too many predictors in our model – just the
predictors which explain significant proportions of the variance in our
dependent variable. The simplest possible model is best! In addition, our data
may be insufficient to generate a very complex model; one rule-of-thumb suggests
10-20 observations are needed to calculate coefficients for each predictor.
Reimann *et al*. (2008) recommend that the number of observations should be at
least 5 times the number of predictors. So, we use a systematic stepwise
procedure to test variations of the model, which omits unnecessary predictors.
```{r stepwise refinement of multiple regression model}
lm_stepwise <- step(lm_multi2, direction="both", trace=0)
summary(lm_stepwise)
require(car)
{cat("==== Variance Inflation Factors ====\n")
vif(lm_stepwise)}
```
In the optimised model, we find that the stepwise procedure has generated a new
model with fewer predictor variables. You should notice that the p-values
(`Pr(>|t|)`) for intercept and predictors are all now ≤ 0.05, so we can
reject the null hypothesis for all predictors (*i.e*. none of them have 'no
effect' on Gd). Our VIFs are now all close to 1, meaning negligible collinearity
between predictors.
From the correlation matrix we made near the beginning, we see that Gd was most
strongly correlated with Al (r = 0.787).
In the [interests of parsimony](reg-group.html#Occam), it would be sensible to
check whether our multiple regression model is actually better than a simple
model just predicting Gd from Al:
```{r Gd-Al-simple-lm, warning=FALSE, message=FALSE, results='hold'}
lmGdAl <- lm(Gd.pow ~ Al, data = afs1923_multreg)
summary(lmGdAl)
```
Our simple model has R^2^ ≃ 63%, which is quite close to the
≃ 68% from the multiple linear regression, so we should check if
multiple regression really does give an improvement. Since the simple and
multiple models are nested we can use both `AIC()` and `anova()`:
```{r aic-simple-multi, message=FALSE, warning=FALSE, paged.print=FALSE, results='hold'}
AIC(lmGdAl, lm_stepwise)
cat("\n-------------------------------------\n")
anova(lmGdAl, lm_stepwise)
```
In this example, the lower AIC and anova p-value ≤ 0.05 support the idea that
the multiple regression model (`lm_stepwise`) describes our data better.
It's always a good idea to run diagnostic plots (see Figure 1 below) on a
regression model (simple or multiple), to check for (i) any systematic trends in
residuals, (ii) normally distributed residuals (or not), and (iii) any unusually
influential observations.
## Regression diagnostic plots
```{r diagnostic-plots, fig.height=6, fig.width=6, fig.cap="Figure 1: Diagnostic plots for the optimal multiple regression model following backward-forward stepwise refinement."}
par(mfrow=c(2,2), mar=c(3.5,3.5,1.5,1.5), mgp=c(1.6,0.5,0), font.lab=2, font.main=3,
cex.main=0.8, tcl=-0.2)
plot(lm_stepwise, col=4)
par(mfrow=c(1,1))
```
The optimal model is `Gd.pow ~ Al.pow + Ca.pow + Fe.pow + S.pow`, where suffixes
`.pow` and `.log` represent power- and log~10~-transformed variables
respectively. The point labelled `78` does look problematic...
```{r testing regression assumptions, paged.print=FALSE, results="hold"}
require(lmtest)
require(car)
cat("------- Residual autocorrelation (independence assumption):")
bgtest(lm_stepwise) # Breusch-Godfrey test for autocorrelation (independence)
cat("\n------- Test of homoscedasticity assumption:")
bptest(lm_stepwise) # Breusch-Pagan test for homoscedasticity
cat("\n------- Test of linearity assumption:")
raintest(lm_stepwise) # Rainbow test for linearity
cat("\n------- Bonferroni Outlier test for influential observations:\n\n")
outlierTest(lm_stepwise) # Bonferroni outlier test for influential observations
cat("\n------- Test distribution of residuals:\n")
shapiro.test(lm_stepwise$residuals) # Shapiro-Wilk test for normality
cat("\n------- Cook's Distance for any observations which are possibly too influential:\n")
summary(cooks.distance(lm_stepwise)) # Cook's Distance
cat("\n------- Influence Measures for any observations which are possibly too influential:\n")
summary(influence.measures(lm_stepwise))
```
## Multiple regression effect plots
```{r effect-plots, fig.height=6, fig.width=6, fig.cap="Figure 2: Effect plots for individual predictors in the optimal multiple regression model following backward-forward stepwise refinement. Light blue shaded areas on plots represent 95% confidence limits."}
require(effects)
plot(allEffects(lm_stepwise, confidence.level=0.95))
```
## Scatterplot of observed *vs*. fitted values
An 'observed *vs*. fitted' plot (Figure 3) is a way
around trying to plot a function with multiple predictors (*i.e*. multiple
dimensions)! We can get the fitted values since these are stored in the `lm`
object, in our case `lm_stepwise` in an item called `lm_stepwise$fitted.values`.
We also make use of other information stored in the `lm` object, by calling
`summary(lm_stepwise)$adj.r.squared`. Finally we use the `predict()` function
with synthetic data in a dataframe `newPreds` to generate confidence and prediction intervals for out plot.
```{r obs-vs-fitted, fig.height=5, fig.width=5, fig.cap="Figure 3: Measured (observed) vs. predicted values in the optimal multiple regression model, showing uncertainty intervals."}
newPreds <- data.frame(Al=seq(0, max(afs1923_multreg$Al), l=100),
Fe=seq(0, max(afs1923_multreg$Fe), l=100),
K=seq(0, max(afs1923_multreg$K), l=100),
pH=seq(min(afs1923_multreg$pH)*0.9, max(afs1923_multreg$pH), l=100))
conf1 <- predict(lm_stepwise, newPreds, interval = "conf")
conf1 <- conf1[order(conf1[,1]),]
pred1 <- predict(lm_stepwise, newPreds, interval="prediction")
pred1 <- pred1[order(pred1[,1]),]
par(mar=c(4,4,1,1), mgp=c(2,0.5,0), font.lab=2, cex.lab=1,
lend="square", ljoin="mitre")
plot(afs1923_multreg$Gd.pow ~ lm_stepwise$fitted.values,
xlab="Gd.pow predicted from regression model",
ylab="Gd.pow measured values", type="n")
mtext(side=3, line=-5.5, adj=0.05, col="blue3",
text=paste("Adjusted Rsq =",signif(summary(lm_stepwise)$adj.r.squared,3)))
# lines(conf1[,1], conf1[,2], lty=2, col="red")
# lines(conf1[,1], conf1[,3], lty=2, col="red")
polygon(c(pred1[,1],rev(pred1[,1])), c(pred1[,2],rev(pred1[,3])),
col="#2000e040", border = "transparent")
polygon(c(conf1[,1],rev(conf1[,1])), c(conf1[,2],rev(conf1[,3])),
col="#ff000040", border = "transparent")
abline(0,1, col="gold4", lty=2, lwd=2)
points(afs1923_multreg$Gd.pow ~ lm_stepwise$fitted.values,
pch=3, lwd=2, cex=0.8, col="blue3")
legend("topleft", legend=c("Observations","1:1 line"), col=c("blue3","gold4"),
text.col=c("blue3","gold4"), pch=c(3,NA), lty=c(NA,2), pt.lwd=2, lwd=2,
box.col="grey", box.lwd=2, inset=0.02, seg.len=2.7, y.intersp=1.2)
legend("bottomright", bty="n", title="Intervals", pch=15, pt.cex=2,
legend=c("95% confidence", "95% prediction"), col=c("#ff000040", "#2000e040"))
```
## Some brief interpretation
- The adjusted R-squared value of the final model is 0.682, meaning that 68.2%, about two-thirds, of the variance in Gd is explained by variance in the model's predictors. (The remaining 31.8% of variance must therefore be due to random variations, or 'unknown' variables not included in our model.)
- From the model coefficients and the effect plots we can see that Gd increases as Al, Fe, and K increase, but Gd decreases as pH increases. This agrees with the relationships determined by correlation ([see above](reg-multi.html#cormat)); Gd **is** positively correlated with Ca, Fe, and S, and negatively correlated with pH.<br />
(Note that this is not always the case - sometimes a predictor can be significant in a multiple regression model, but not individually correlated with the dependent variable!)
- Although we can't attribute a causal relationship to correlation or regression relationships, the observed effects in our model **are** consistent with real phenomena. For example, gadolinium and other rare earth elements are positively related to iron (Fe) in other estuarine sediments; see Morgan *et al*. (2012). We also know that rare-earth elements in estuarine sediments are often associated with clays (Marmolejo-Rodríguez *et al*., 2007; clays are measured by Al in our data, since clays are aluminosilicate minerals).
# References
Cohen, J. 1988. *Statistical Power Analysis for the Behavioral Sciences*, Second Edition. Erlbaum Associates, Hillsdale, NJ, USA.
Marmolejo-Rodríguez, A. J., Prego, R., Meyer-Willerer, A., Shumilin, E., & Sapozhnikov, D. (2007). Rare earth elements in iron oxy-hydroxide rich sediments from the Marabasco River-Estuary System (pacific coast of Mexico). REE affinity with iron and aluminium. *Journal of Geochemical Exploration*, **94**(1-3), 43-51. [https://doi.org/10.1016/j.gexplo.2007.05.003](https://doi.org/10.1016/j.gexplo.2007.05.003){target="_blank"}
Morgan, B., Rate, A. W., Burton, E. D., & Smirk, M. (2012). Enrichment and fractionation of rare earth elements in FeS-rich eutrophic estuarine sediments receiving acid sulfate soil drainage. *Chemical Geology*, **308-309**, 60-73. [https://doi.org/10.1016/j.chemgeo.2012.03.012](https://doi.org/10.1016/j.chemgeo.2012.03.012){target="_blank"}
Hebbali, A. (2018). Collinearity Diagnostics, Model Fit and Variable Contribution. Vignette for R Package 'olsrr'. Retrieved 2018.04.05, from [https://cran.r-project.org/web/packages/olsrr/vignettes/regression_diagnostics.html](https://cran.r-project.org/web/packages/olsrr/vignettes/regression_diagnostics.html){target="_blank"}.
Reimann, C., Filzmoser, P., Garrett, R.G., Dutter, R., (2008). *Statistical Data Analysis Explained: Applied Environmental Statistics with R*. John Wiley & Sons, Chichester, England (see Chapter 16).