-
Notifications
You must be signed in to change notification settings - Fork 15
/
Copy pathregression.qmd
1174 lines (830 loc) · 73.7 KB
/
regression.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Regression analysis (OLS method)"
---
*Last modified on `r Sys.Date()`*
```{r, echo = FALSE, message = FALSE}
source("libs/Common.R")
```
```{r Load fonts, echo=FALSE, eval=TRUE, warning=FALSE, message=FALSE}
library(extrafont)
font_import(prompt=FALSE, pattern="[A/a]rchitects")
loadfonts(device="postscript")
# loadfonts(device="win") # TO view in windows
std = function(x,y) {
(x - mean(x,na.rm=T)) / sd(x,na.rm=T)
}
```
Packages used in this tutorial:
```{r packages, message=FALSE, results='hide'}
library(car)
library(boot)
library(scatterplot3d)
```
# The simple model
The objective of statistical modeling is to come up with the most parsimonious model that does a good job in predicting some variable. The simplest mode is the sample mean. We can express this model as:
$$
Y = \beta_0 + \varepsilon
$$
where $Y$ is the variable we are trying to predict, $\beta_0$ is the mean, and $\varepsilon$ is the difference (or error) between the model, $\beta_0$, and the actual value $Y$. How well a model fits the actual data is assessed by measuring the error or deviance between the actual data and the predicted data.
In the following working example, we will look at the median per capita income for the State of Maine aggregated by state (the data are provided from the US Census ACS 2007 - 2011 dataset). The dollar amounts are adjusted to 2010 inflation dollars.
```{r tidy=FALSE}
y <- c(23663, 20659, 32277, 21595, 27227,
25023, 26504, 28741, 21735, 23366,
20871, 28370, 21105, 22706, 19527,
28321)
```
Next, we will define the simple model (the mean of all 16 values) for this batch of data:
$\hat Y = \beta_0 = \bar Y =$ `r sprintf("%.0f",mean(y))`
The "hat" on top of $\hat Y$ indicates that this is a value _predicted_ by the model and not the actual value $Y$.
We can plot the true values $Y$ (black dots) along with the values predicted from our simple model $\hat Y$ (blue dots and lines).
```{r tidy=FALSE, fig.width=6,fig.height=4}
plot( y , pch = 16, ylab=NA, mgp=c(4,1,.5) , las=1, bty="n", xaxs="i", xpd = TRUE)
abline(h = mean(y), col="blue")
points(rep(mean(y), length(y)), col="blue", type="b", pch = 21, bg="white", xpd = TRUE)
mtext("Income($)", side=3, adj= -0.1)
```
The function `abline` adds a blue horizontal line whose value is $\bar Y$. Blue dots are added using the `points()` function to highlight the associated predicted values, $\hat Y$, for each value $Y$. The x-axis serves only to _spread_ out the $Y$ values across its range. The `las = 1` option in the `plot` function sets all axes labels to display horizontally (a value of `0` would set the orientation of all labels parallel to the axis which is the default setting).
One way to measure the error between the predicted values $\hat Y$ and $Y$ is to compute the difference (error $\varepsilon$) between both values.
```{r echo=FALSE, results='asis'}
library(knitr)
y.df <- data.frame(Y = y, Predicted = mean(y), E = y - mean(y))
colnames(y.df) <- c("$Y$", "$\\hat Y$", "$\\varepsilon = Y - \\hat Y$" )
kable(y.df, digits=0, align="c", format="markdown")
```
The errors (or differences between $Y$ and $\hat Y$) are displayed as red dashed lines in the following figure:
```{r tidy=FALSE, echo=FALSE, fig.width=6,fig.height=4}
OP <- par("mar"=c(2,5,1,1))
sd.x = sd(seq(1:length(y)),na.rm=T); sd.y = sd(y,na.rm=T)
asp2 = sd.x/sd.y * .5
plot ( y , pch = 16, axes=FALSE,ylab=NA,asp = asp2)
loc = par("usr")
axis(side = 1, tck = -.01, at=seq(1,length(y)+1,1))
axis(side = 2, las = 1, tck = -.01, at=pretty(range(y)))
text(loc[1], loc[4], "Income($)", pos = 2, xpd = T)
abline(h=mean(y),col="blue")
points(rep(mean(y), length(y)), col="blue", type="b", pch = 21, bg="white")
M = lm(y ~ rep(mean(y), length(y)))
segments(seq(1,length(y)),M$fitted.values,seq(1,length(y)), y, col="red", lt=3)
par(OP)
```
To get the magnitude of the total error, we could sum these errors, however, doing so would return a sum of errors of 0 (i.e. the positive errors and negative errors cancel each other out). One solution to circumvent this problem is to take the **absolute** values of the errors then sum these values (this is also known as the **least absolute error**, **LAE**):
$$
LAE = \sum \mid Y - \hat Y \mid
$$
However, by convention, we quantify the difference between $Y$ and $\hat Y$ using the **sum of the squared errors** (**SSE**) instead, or
$$
SSE = \sum (Y - \hat Y)^2
$$
The following table shows the difference between LAE and SSE values
```{r echo=FALSE, results='asis'}
y.df <- data.frame(y, mean(y), abs(y - mean(y)), (y-mean(y))^2)
colnames(y.df) <- c("$Y$", "$\\hat Y$", "$\\mid Y - \\hat Y \\mid$", "$(Y - \\hat Y)^2$")
kable(y.df, digits=0, align="c", format = 'markdown')
```
```{r internal1, echo=FALSE}
LAE.mean = sum(abs(y - mean(y)))
SE.mean = (y-mean(y))^2
SSE.mean = sum((y-mean(y))^2)
```
$LAE_{mean}$ = `r sprintf("%.0f", LAE.mean)` and $SSE_{mean}$ = `r sprintf("%.0f", SSE.mean)`.
Squaring the error has the effect of **amplifying** large errors in the dataset. The following figure highlights the difference by drawing the _square_ of the differences as red shaded squares.
```{r tidy=FALSE, echo=FALSE, fig.width=6,fig.height=4}
OP <- par("mar"=c(2,5,1,1))
sd.x = sd(seq(1:length(y)),na.rm=T); sd.y = sd(y,na.rm=T)
asp2 = sd.x/sd.y * .5
plot( y ~ seq(1:length(y)) , pch = 16, asp = asp2,xlab="Index", axes=FALSE,ylab=NA)
#plot ( y , pch = 16, col="grey", axes=FALSE,ylab=NA)
loc = par("usr")
axis(side = 1, tck = -.01, at=seq(1,length(y)+1,1))
axis(side = 2, las = 1, tck = -.01, at = pretty(range(y)) )
text(loc[1], loc[4], "Income($)", pos = 2, xpd = T)
abline(h = mean(y), col="blue")
points(rep(mean(y), length(y)), col="blue", type="b", pch = 21, bg="white")
segments(seq(1,length(y)),M$fitted.values,seq(1,length(y)), y, col="red", lt=3)
rect(seq(1,length(y)) - abs(M$res)/2*asp2, M$fitted, seq(1,length(y)) + abs(M$res)/2*asp2, M$fitted + M$res, col=rgb(1,0,0,0.3) ,border=NA)
par(OP)
```
It is more natural to visualize a squared amount using a geometric primitive whose area changes proportionately to the squared error changes. Note how the larger error at index 3 has an error line length about 3 times that associated with index 5 (these correspond to the records number three and five in the above table). Yet, the area for index 3 appears almost _9 times_ bigger than that associated with index 5 (as it should since we are _squaring_ the difference). Note too that the unit used to measure the area of each square (the squared error) is the one associated with the _Y_-axis (recall that the error is associated with the prediction of $Y$ and not of $X$).
This simple model (i.e. one where we are using the mean value to predict $Y$ for each county) obviously has its limits. A more common approach in predicting $Y$ is to use a _predictor_ variable (i.e. a covariate).
# The bivariate regression model
The bivariate model augments the simple (mean) model by adding a variable, $X$, that is believed to _co-vary_ with $Y$. In other words, we are using another variable (aka, an **independent** or **predictor** variable) to estimate $Y$. This augmented model takes on the form:
$$
Y = \beta_0 + \beta_1X + \varepsilon
$$
This is the equation of a slope where $\beta_0$ tells us where on the y-axis the slope intersects the axis and $\beta_1$ is the slope of the line (a positive $\beta_1$ value indicates that the slope is upward, a negative $\beta_1$ value indicates that the slope is downward).
Continuing with our example, we will look at the percentage of Mainers having attained a Bachelor's degree or greater by county (the data are provided by the US Census ACS 2007 - 2011 dataset).
The following block of code defines the new variable $X$ (i.e. fraction of Mainers having attained a bachelor's degree or greater within each county) and visualizes the relationship between income and education attainment using a scatter plot. In the process, a `dataframe` called `dat` is created that stores both $X$ and $Y$.
```{r tidy=FALSE, fig.width=6,fig.height=4}
# X represents the fraction of residents in each county having attained a
# bachelor's degree or better
x <- c(0.19, 0.16, 0.40, 0.24, 0.31, 0.24, 0.28,
0.31, 0.18, 0.23, 0.17, 0.31, 0.15, 0.25,
0.19, 0.28)
# Let's combine Income and Education into a single dataframe
dat <- data.frame(Income = y, Education = x)
# We will add county names to each row
row.names(dat) <- c("Androscoggin", "Aroostook", "Cumberland", "Franklin", "Hancock",
"Kennebec", "Knox", "Lincoln", "Oxford", "Penobscot", "Piscataquis",
"Sagadahoc", "Somerset", "Waldo", "Washington", "York")
```
The following displays the contents of our new dataframe `dat`:
```{r echo=FALSE, results='asis'}
kable(dat, format="markdown")
```
Now let's plot the data
```{r tidy=FALSE, fig.width=6,fig.height=4}
plot( Income ~ Education , dat, pch = 16, ylab=NA, mgp=c(4,1,.5) , las=1, bty="n", xaxs="i",
xpd = TRUE, xlab="Fraction having attained a Bachelor's degree")
mtext("Income($)", side=3, adj= -0.1)
```
It's clear from the figure that there appears to be a positive relationship between income and education attainment. The next step is to _capture_ the nature of this relationship from the perspective of a linear equation. We seek to fit a slope that will minimize the overall distance between the line and the true value, hence we seek $\beta_0$ and $\beta_1$ that minimizes the error. The estimate for $\beta_1$ is sometimes represented as the letter $b_1$ or $\hat \beta_1$. We'll stick with the latter notation throughout this exercise. $\hat \beta_1$ is estimated using the linear **least squares** technique:
$$
\hat \beta_1 = \frac{\sum (X_i - \bar X)(Y_i-\bar Y)}{\sum (X_i-\bar X)^2}
$$
The breakdown of the terms in this equation are listed in the following table:
```{r echo=FALSE, results='asis'}
y.df <- data.frame(x,y, (x - mean(x))*(y - mean(y)) , (x - mean(x))^2)
colnames(y.df) <- c("$X$","$Y$","$(X - \\bar X)(Y - \\bar Y)$", "$(X - \\bar X)^2$" )
y.dif2 <- rbind(y.df,SUM =c(NA,NA,colSums(y.df)[3:4]) )
kable(y.dif2, digits=3 , format="markdown" )
```
The bottom of the last two columns show the sums of the numerator and denominator (`r colSums(y.df)[3]` and `r colSums(y.df)[4]` respectively). $\hat \beta_1$, the estimated slope, is therefore `r round(colSums(y.df)[3],2)`/`r round(colSums(y.df)[4],3)` or `r sprintf("%.0f",colSums(y.df)[3] / colSums(y.df)[4])`.
Knowing $\hat \beta_1$ we can solve for the intercept $b_0$:
$$
\hat \beta_0 = \hat Y - \hat \beta_1 X
$$
It just so happens that the slope passes through $\bar Y$ and $\bar X$. So we can substitute $X$ and $\hat Y$ in the equation with their respective means: `r round(mean(x),2)` and `r sprintf("%0.0f",mean(y))`. This gives us $\hat \beta_0$ = `r sprintf("%0.0f",mean(y) - colSums(y.df)[3] / colSums(y.df)[4] * mean(x) )`. Our final linear equation looks like this:
$\hat Y$ = `r sprintf("%0.1f",lm(y~x)$coef[1])` + `r sprintf("%0.1f",lm(y~x)$coef[2])` $X$ + $\varepsilon$
It's no surprise that `R` has a built in function, `lm()`, that will estimate these regression coefficients for us. The function can be called as follows:
```{r tidy=FALSE}
M <- lm( y ~ x, dat)
```
The output of the linear model is saved to the variable `M`. The formula `y ~ x` tells the function to regress the variable `y` against the variable `x`. You must also supply the name of the table or dataframe that stores these variables (i.e. the dataframe `dat`).
The regression coefficients $\hat \beta_0$ and $\hat \beta_1$ computed by `lm` can be extracted from output `M` as follows:
```{r tidy =FALSE}
M$coefficients
```
The regression coefficients are identical to those computed manually from the table.
We can also use the model output to add the regression line to our data. The function `abline()` reads the contents of the model output `M` and is smart enough to extract the slope equation for plotting:
```{r tidy=FALSE, fig.width=6,fig.height=4}
plot( y ~ x , dat, pch = 16, ylab=NA, mgp=c(4,1,.5) , las=1, bty="n", xaxs="i", xpd = TRUE,
xlab="Fraction having attained a Bachelor's degree")
mtext("Income($)", side=3, adj= -0.1)
abline(M, col="blue")
```
Next, we will look at the errors (or residuals) between the estimated values $\hat Y$ from our model and the true values $Y$ from our observations. We will represent these differences as squared errors:
```{r tidy=FALSE, echo=FALSE, fig.width=6, fig.height=4}
OP <- par("mar"=c(6,5,2,1))
plot( y ~ x , dat, pch = 16, ylab=NA, mgp=c(4,1,.5) , las=1, bty="n", xaxs="i", xpd = TRUE,
xlab="Fraction having attained a Bachelor's degree")
mtext("Income($)", side=3, adj= -0.1)
abline(M, col="blue")
#points(rep(mean(y), length(y)), col="blue", type="b", pch = 21, bg="white")
asp <- ( par("pin")[2]/diff(par("usr")[3:4]) ) / ( par("pin")[1]/diff(par("usr")[1:2]) )
segments(x,M$fitted.values,x, y, col="red", lt=3)
rect(x - abs(M$res)/2*asp, M$fitted, x + abs(M$res)/2*asp, M$fitted + M$res, col=rgb(1,0,0,0.3) ,border=NA)
par(OP)
```
Notice how the sizes of the red boxes are smaller then the ones for the simple model (i.e the model defined by the mean $\bar Y$). In fact, we can sum these squared differences to get our new sum of squared error (SSE) as follows:
```{r tidy=FALSE}
SSE <- sum( summary(M)$res^2 )
```
This returns an $SSE$ value of `r sprintf("%.0f", SSE)`.
This is less than the sum of squared errors for the _mean_ model $\bar Y$, $SSE_{mean}$, whose value was `r sprintf("%.2f", SSE.mean)`. $SSE$ is sometimes referred to as the **residual error**. The difference between the residual error $SSE$ and $SSE_{mean}$ is the error _explained_ by our new (bivariate) model or the **sum of squares reduced**, $SSR$:
$$
SSR = SSE_{mean} - SSE
$$
In `R`, this can easily be computed as follows:
```{r tidy=FALSE}
SSR <- SSE.mean - SSE
```
giving us an $SSR$ value of `r sprintf("%.0f", SSR)`.
Graphically, the square root of each squared residual making up $SSR$ represents the distance between the _mean_ model (which is a horizontal line) and the _bivariate_ model (which usually has a none zero slope) for every point. This is _not_ the distance between a point and one of the lines.
```{r tidy=FALSE, echo=FALSE, fig.width=6, fig.height=4, warning=FALSE}
OP <- par("mar"=c(6,5,2,1))
SE = summary(M)$res
SE.mean = (y - mean(y))
SR = SE.mean - SE
plot( y ~ x , dat, pch = 16, ylab=NA, mgp=c(4,1,.5) , las=1, bty="n", xaxs="i", xpd = TRUE, col="grey90",
xlab="Fraction having attained a Bachelor's degree")
mtext("Income($)", side=3, adj= -0.1)
abline(M, col="blue")
asp <- ( par("pin")[2]/diff(par("usr")[3:4]) ) / ( par("pin")[1]/diff(par("usr")[1:2]) )
#segments(x,M$fitted.values,x, y, col="red", lt=3)
abline(h = mean(y), lty=1, col="blue")
segments(x,mean(y),x, mean(y) + SR, col="red", lt=3, lw=2)
text(min(x), mean(y),"mean of Y", adj=c(0,-.5), col="blue", family= "Architects Daughter", cex=1.0)
text(max(x), predict(M, data.frame(x = max(x))),"bivariate model", adj=c(1.2,-.2), col="blue",
family= "Architects Daughter", srt=180/pi*atan(M$coef[2]*asp), cex=1.0 )
par(OP)
```
The linear models are shown in blue lines, the square root of each residual element making up $SSR$ (the difference between the two models) is shown in dashed red lines. Had we shown each _squared_ residual, red boxes would have been drawn as in earlier figures (the boxes were not drawn here to prevent clutter). The take away point here is that $SSR$ is the error that is explained by the new bivariate model.
The following table summarizes some of these key error terms:
Error Notation
----------------- ----------------- -----------------------
Mean model Total error $SSE_{mean}$ or $SST$
Bivariate model Residual error $SSE$
Bivariate - Mean Explained error $SSR$
## Testing if the bivariate model is an improvement over the mean model
### The coefficient of determination, $R^2$
In this working example, the percent error explained by the bivariate model is,
$$
\frac{SSE_{mean} - SSE}{SSE_{mean}} \times 100 = \frac{SSR}{SSE_{mean}} \times 100
$$
or (`r sprintf("%.0f", SSE.mean)` - `r sprintf("%.0f", SSE)`) / `r sprintf("%.0f", SSE.mean)` * 100 = **`r sprintf("%0.1f%%",( SSE.mean - SSE) / SSE.mean * 100)`**. This is a substantial reduction in error.
The ratio between $SSR$ and $SSE_{mean}$ is also referred to as the **proportional error reduction score** (also referred to as the **coefficient of determination**), or $R^2$, hence:
$$
R^2 = \frac{SSR}{SSE_{mean}}
$$
$R^2$ is another value computed by the linear model `lm()` that can be extracted from the output `M` as follows:
```{r tidy=FALSE}
summary(M)$r.squared
```
The $R^2$ value computed by $M$ is the same as that computed manually using the ratio of errors (except that the latter was presented as a percentage and not as a fraction). Another way to describe $R^2$ is to view its value as the fraction of the variance in $Y$ explained by $X$. A $R^2$ value of $0$ implies complete lack of fit of the model to the data whereas a value of $1$ implies perfect fit. In our working example the $R^2$ value of 0.86 implies that the **model explains 86%** of the variance in $Y$.
`R` also outputs $adjusted\; R^2$, a better measure of overall model fit. It 'penalizes' $R^2$ for the number of predictors in the model _vis-a-vis_ the number of observations. As the ratio of predictors to number of observation increases, $R^2$ can be artificially inflated thus providing us with a false sense of model fit quality. If the number of predictors to number of observations ratio is small, $adjusted\; R^2$ will be smaller than $R^2$. $adjusted\; R^2$ can be extracted from the model output as follows:
```{r tidy=FALSE, eval=FALSE}
summary(M)$adj.r.squared
```
The $adjusted\; R^2$ of **`r round(summary(M)$adj.r.squared,2)`** is very close to our $R^2$ value of **`r round(summary(M)$r.squared,2)`**. So our predictive power hasn't changed given our sample size and one predictor variable.
### The F-ratio test
$R^2$ is one way to evaluate the strength of the linear model. Another is to determine if the reduction in error between the bivariate model and mean model is _significant_. We've already noted a decrease in overall residual errors when augmenting our mean model with a predictor variable $X$ (i.e. the fraction of residents having attained at least a bachelor's degree). Now we need to determine if this difference is significant.
It helps to conceptualize residual errors as _spreads_ or variances. The following two plots show the errors explained by the model (left) and the errors not explained by the model for each of the 16 data points. Note that we can derive $SSR$ and $SSE$ from the left and right graphs respectively by _squaring_ then _summing_ the horizontal error bars.
```{r tidy=FALSE, echo=FALSE, fig.width=6, fig.height=4, warning=FALSE, message=FALSE}
library(gplots)
SR <- SE.mean - SE
OP <- par("mar"=c(6,1,1,1), mfrow=c(1,2))
plotCI(y=1:length(y), x=rep(0,length(y)), uiw=abs(SR), err="x",
ylab = "", xlab="", axes=F, col="#777777",sfrac=0.000, cex=.5, type="b",
barcol = "red", pch=21, pt.bg=par("bg"),
xlim=c(-max(abs(c(SE,SR))),max(abs(c(SE,SR)))))
axis(side = 1, lwd = .5, line = 0, family= "Architects Daughter", las = 1, tck = -.025)
mtext(side=1, text="Explained error", family= "Architects Daughter",
las=1,outer = FALSE , adj=0.5, line=3, , cex=1.0)
plotCI(y=1:length(y), x=rep(0,length(y)), uiw=abs(SE), err="x",
ylab = "", xlab="", axes=F, col="#777777",sfrac=0.000, cex=.5, type="b",
barcol = "red", pch=21, pt.bg=par("bg"),
xlim=c(-max(abs(c(SE,SR))),max(abs(c(SE,SR)))))
axis(side = 1, lwd = .5, line = 0, family= "Architects Daughter", las = 1, tck = -.025)
mtext(side=1, text="Unexplained error", family= "Architects Daughter",
las=1,outer = FALSE , adj=0.5, line=3, cex=1.0)
par(OP)
```
We want to assess whether the total amount of _unexplained error_ (right) is significantly less than the total amount of _explained error_ (left). We have 16 records (i.e. data for sixteen counties) therefore we have 16 measures of error. Since the model assumes that the errors are the same across the entire range of $Y$ values, we will take the average of those errors; hence we take the average of $SSR$ and the average of $SSE$. However, because these data are from a _subset_ of possible $Y$ values and not _all_ possible $Y$ values we will need to divide $SSR$ and $SSE$ by their respective $degress\; of\; freedom$ and not by the total number of records, $n$. These "averages" are called **mean squares** ($MS$) and are computed as follows:
$$
MS_{model} = \frac{SSR}{df_{model}} = \frac{SSR}{parameters\; in\; the\; bivariate\; model - parameters\; in\; the\; mean\; model}
$$
$$
MS_{residual} = \frac{SSE}{df_{residual}} = \frac{SSE}{n - number\; of\; parameters}
$$
where the $parameters$ are the number of coefficients in the mean model (where there is just one parameter: $\beta_0 = \hat Y$) and the number of coefficient in the bivariate model (where there are two parameters: $\beta_0$ and $\beta_1$). In our working example, $df_{model}$ = `r anova(M)[1,1]` and $df_{residual}$ = `r anova(M)[2,1]`.
It's important to remember that these measures of error are measures of **spread** and not of a _central_ value. Since these are measures of spread we use the $F$-test (and _not_ the $t$-test) to determine if the difference between $MS_{model}$ and $MS_{residual}$ is significant (recall that the $F$ test is used to determine if two measures of spread between samples are significantly different). The $F$ ratio is computed as follows:
$$
F = \frac{MS_{model}}{MS_{residual}}
$$
A simple way to implement an $F$ test with on our bivariate regression model is to call the `anova()` function as follows:
```{r tidy = FALSE, eval=FALSE}
anova(M)
```
where the variable `M` is the output from our regression model created earlier. The output of `anova()` is summarized in the following table:
```{r echo=FALSE, results='asis'}
kable(anova(M) , format="markdown" , align="c")
```
The first row in the table (labeled as `x`) represents the error terms $SSR$ and $MS_{model}$ for the basic model (the _mean_ model in our case) while the second row labeled `Residuals` represents error terms $SSE$ and $MS_{residual}$ for the current model. The column labeled `Df` displays the degrees of freedom, the column labeled `Sum Sq` displays the sum of squares $SSR$ and $SSE$, the column labeled `Mean Sq` displays the _mean squares_ $MS$ and the column labeled `F value` displays the $F$ ratio $MS_{model}/MS_{residual}$.
The larger the $F$ ratio, the greater the difference between the bivariate model and mean model. The question then becomes 'how significant is the observed ratio?'. To answer this question, we must setup a hypothesis test. We setup the test as follows:
$H_o$: The addition of the term $\beta_1$ (and hence the predictor $X$) does not improve the prediction of $Y$ over the simple _mean_ model.
$H_a$: The addition of the term $\beta_1$ (and hence the predictor $X$) helps improve the prediction of $Y$ over the simple _mean_ model.
The next step is to determine how likely we are (as a measure of probability) to have a $F$ ratio as large as the one observed under the assumption $H_o$ that the bivariate model does _not_ improve on our prediction of $Y$. The last column in the `anova` output gives us this probability (as a fraction): $P$ = `r round(anova(M)[1,5], 3)`. In other words, there is a `r round(anova(M)[1,5], 3)* 100`% chance that we could have computed an $F$ ratio as extreme as ours had the _bivariate_ model not improved over the _mean_ model.
The following graph shows the frequency distribution of $F$ values we would expect if indeed $H_o$ was true. Most of the values under $H_o$ fall between 0 and 5. The red line to the far right of the curve is where our observed $F$ value lies along this continuum--this is not a value one would expect to get if $H_o$ were true. It's clear from this output that our _bivariate_ model is a significant improvement over our _mean_ model.
```{r echo=FALSE, message=FALSE, fig.width=7, fig.height=3.0, warning=FALSE }
OP <- par("mar"=c(4,4,3,1), xpd=NA)
df1 = 1
df2 = 14
Ft = 86
xmin = 0
xmax = 100
p10 = qf(0.10,df1,df2)
p025 = qf(0.025,df1,df2)
p975 = qf(0.975,df1,df2)
ncurve.x = seq(xmin, xmax,length.out= 200)
ncurve.y = df(ncurve.x, df1, df2 )
plot( ncurve.y ~ ncurve.x, type="l", ylab=NA, xlab="F ratios",axes=FALSE,main=NA,
family= "Architects Daughter")
axis(1, family= "Architects Daughter")
lines(x = c(Ft,Ft), y=c(-.01,df(Ft,df1,df2)), col="red" , lty=1, lw=2)
text(x=Ft, y = df(Ft,df1,df2), eval(Ft), col="red", pos=3, family= "Architects Daughter")
ncurve.x.mSE <- seq(Ft, xmax, length.out= 100)
ncurve.y.mSE <- df(ncurve.x.mSE, df1, df2 )
ncurve.x.mSE <- c(ncurve.x.mSE, max(ncurve.x.mSE), min(ncurve.x.mSE))
ncurve.y.mSE <- c(ncurve.y.mSE, 0,0)
polygon( ncurve.x.mSE, ncurve.y.mSE, col=rgb(1, 0, 0,.3), border=NA)
par(OP)
```
The F-ratio is not only used to compare the bivariate model to the mean, in fact, it's more commonly used to compare a _two-predictor_ variable model with a _one-predictor_ variable model or a _three-predictor_ model with a _two-predictor_ model and so on. In each case, one model is compared with a similar model _minus_ one predictor variable.
When two different models are compared, the function `anova()` requires two parameters: the two regression model outputs (we'll call `M1` and `M2`).
```{r tidy = FALSE, eval=FALSE}
anova(M1, M2)
```
An example will be provided in a later section when we tackle multivariate regression.
## Testing if the estimated slope is significantly different from 0
We've just assessed that our _bivariate_ model is a significant improvement over the much _simpler_ mean model. We can also test if each regression coefficient $\beta$ is significantly different from 0. If we are dealing with a model that has just one predictor $X$, then the $F$ test just described will also tell us if the regression coefficient $\beta_1$ is significant. However, if more than one predictor variable is present in the regression model, then you should perform an $F$-test to test _overall_ model improvement, then test each regression term independently for significance.
When assessing if a regression coefficient is significantly different from zero, we are setting up yet another hypothesis test where:
$H_o$: $\beta_1$ = 0 (i.e. the predictor variable _does not_ contribute significantly to the overall improvement of the predictive ability of the model).
$H_a$: $\beta_1$ < 0 or $\beta_1$ > 0 for one-tailed test, or $\beta_1 \neq$ 0 for a two-tailed test (i.e. the predictor variable _does_ contribute significantly to the overall improvement of the predictive ability of the model).
Our hypothesis in this working example is that the level of education attainment ($X$) can predict per-capita income ($Y$), at least at the county level (which is the level at which our data is aggregated). So what we set out to test is the **null hypothesis**, $H_o$, that $X$ and $Y$ are **not linearly related**. If $H_o$ is true, then we would expect the regression coefficient $\beta_1$ to be very close to zero. So this begs the question 'how close should $\beta_1$ be to 0 for us **not to reject** $H_o$?'
To answer this question, we need to perform a $t$-test where we compare our calculated regression coefficient $\hat \beta_1$ to what we would expect to get if $X$ did not contribute significantly to the model's predictive capabilities. So the statistic $t$ is computed as follows:
$$
t = \frac{\hat \beta_1 - \beta_{1_o}}{s_{\hat \beta_1}}
$$
Since $\beta_1$ is zero under $H_o$ , we can rewrite the above equation as follows:
$$
t = \frac{\hat \beta_1 - 0}{s_{\hat \beta_1}} = \frac{\hat \beta_1 }{s_{\hat \beta_1}}
$$
where the estimate $s_{\hat \beta_1}$ is the **standard error of $\beta$** which can be computed from:
$$
s_{\hat \beta_1} = \frac{\sqrt{MS_{residual}}}{\sqrt{SSX}}
$$
where $SSX$ is the sum of squares of the variable $X$ (i.e. $\sum (X - \bar X)^2$) which was computed in an earlier table.
In our working example, $s_{\hat \beta_1}$ is `r sqrt(anova(M)[2,3])` / `r sqrt( sum((x - mean(x))^2))` or $s_{\hat \beta_1}$ = `r round(sqrt(anova(M)[2,3]) / sqrt(sum((x - mean(x))^2)), 1)`. The test statistic $t$ is therefore `r sprintf("%0.2f",M$coef[2])` / `r round(sqrt(anova(M)[2,3]) / sqrt(sum((x - mean(x))^2)), 1)` or $t$ = `r sprintf("%0.2f",M$coef[2] / (sqrt(anova(M)[2,3]) / sqrt(sum((x - mean(x))^2))))`.
Next we need to determine if the computed value of $t$ is significantly different from the values of $t$ expected under $H_o$. The following figure plots the frequency distribution of $\beta_{1_o}$ (i.e. the kind of $\beta_1$ values we could expect to get under the assumption that the predictor $X$ does not contribute to the model) along with the red regions showing where our observed $\hat \beta_1$ must lie for us to safely reject $H_o$ at the 5% confidence level (note that because we are performing a two-sided test, i.e. that $\hat \beta_1$ is different from 0, the red regions each represent 2.5%; when combined both tails give us a 5% rejection region).
```{r echo=FALSE, message=FALSE, fig.width=7, fig.height=3.0, warning=FALSE }
OP <- par("mar"=c(4,3,3,1), xpd=NA)
n = 16
obs = 49670
ho = 0
se = 5355
z = 9.28 #z statistic
ncurve.x = seq(-9, +9, length.out= 200)
ncurve.y = dnorm(ncurve.x, mean=0, sd=1 )
plot( ncurve.y ~ ncurve.x, type="l", ylab=NA, xlab="slope beta",axes=FALSE,main=NA,
family= "Architects Daughter")
axis(1, family= "Architects Daughter", at=seq(-10,10,2),
labels=round(seq(-10,10,2)*se + ho,2))
lines(x = c(z,z), y=c(0,dnorm(0,0,1)), col="red" , lty=1, lw=1)
lines(x = c(0,0), y=c(0,dnorm(0,0,1)), col="black", lty=2)
lines(x = c(-1.96,-1.96) , y=c(0,dnorm(-1.96,0,1)) , col="red", lty=2)
lines(x = c(+1.96,+1.96) , y=c(0,dnorm(+1.96,0,1)) , col="red", lty=2)
text(x=0, y = max(ncurve.y), "Ho (b = 0)", col="black", pos=3, family= "Architects Daughter")
text(x=z, y = dnorm(0,0,1)/2, "Observed b", col="red", pos=2, family= "Architects Daughter")
text(x=-1.96, y = dnorm(-1.96,0,1), "p = -0. 025", col=rgb(1, 0, 0,.5), pos=2, family= "Architects Daughter")
text(x=+1.96, y = dnorm(+1.96,0,1), "p = +0. 025", col=rgb(1, 0, 0,.5), pos=4, family= "Architects Daughter")
# Right rejection region at 0.975
ncurve.x.mSE <- seq(1.96,4, by=.03)
ncurve.y.mSE <- dnorm(ncurve.x.mSE, 0,1)
ncurve.x.mSE <- c(ncurve.x.mSE, max(ncurve.x.mSE), min(ncurve.x.mSE))
ncurve.y.mSE <- c(ncurve.y.mSE, 0,0)
polygon( ncurve.x.mSE, ncurve.y.mSE, col=rgb(1, 0, 0,.3), border=NA)
# Left rejection region at 0.975
ncurve.x.mSE <- seq(-4,-1.96, by=.03)
ncurve.y.mSE <- dnorm(ncurve.x.mSE, 0,1)
ncurve.x.mSE <- c(ncurve.x.mSE, max(ncurve.x.mSE), min(ncurve.x.mSE))
ncurve.y.mSE <- c(ncurve.y.mSE, 0,0)
polygon( ncurve.x.mSE, ncurve.y.mSE, col=rgb(1, 0, 0,.3), border=NA)
par(OP)
```
It's clear that seeing where our observed $\hat \beta_1$ lies along the distribution allows us to be fairly confident that our $\hat \beta_1$ is far from being a _typical value_ under $H_o$, in fact we can be quite confident in rejecting the null hypothesis, $H_o$. This implies that $X$ (education attainment) **does** contribute information towards the prediction of $Y$ (per-capita income) when modeled as a _linear_ relationship.
Fortunately, we do not need to burden ourselves with all these calculations. The aforementioned standard error, $t$ value and $P$ value are computed as part of the `lm` analysis. You can list these and many other output parameters using the `summary()` function as follows:
```{r eval=FALSE}
summary(M)[[4]]
```
The `[[4]]` option displays the pertinent subset of the `summary` output that we want. The content of this output is displayed in the following table:
```{r echo=FALSE, results='asis'}
library(xtable)
kable(xtable(summary(M)), format="markdown", align="c")
```
The second line of the output displays the value of $\hat \beta_1$, its standard error, its $t$-value and the probability of observing a $\beta_1$ value as extreme as ours under $H_o$.
### Extracting $\hat \beta_1$ confidence interval
$\hat \beta_1$'s standard error $s_{\hat \beta_1}$ can also be used to derive a confidence interval for our estimate. The standard error of **`r round(summary(M)$coef[2,2],0)`** tells us that we can be ~68% confident that our true $\beta_1$ value lies between $\hat \beta_1$ - `r round(summary(M)$coef[2,2],0)` and $\hat \beta_1$ + `r round(summary(M)$coef[2,2],0)`. We can use the standard error to construct different confidence intervals depending on our desired $\alpha$ level. The following table shows upper and lower limits for a few common alpha levels:
```{r echo=FALSE, warning=FALSE}
se68 = qt(.84,M$df) * summary(M)$coef[2,2]
se95 = qt(.975,M$df) * summary(M)$coef[2,2]
se99 = qt(.995,M$df) * summary(M)$coef[2,2]
b0 = M$coef[1]
b1 = M$coef[2]
```
Confidence level $\alpha$ Lower limit Upper limit
-------------------------- ------------- --------------------------------------------
68% `r sprintf("%.0f",b1-se68)` `r sprintf("%.0f",b1+se68)`
95% `r sprintf("%.0f",b1-se95)` `r sprintf("%.0f",b1+se95)`
99% `r sprintf("%.0f",b1-se99)` `r sprintf("%.0f",b1+se99)`
The following figure shows the range of the distribution curve covered by the 68% and 95% confidence intervals for our working example.
```{r echo=FALSE, message=FALSE, fig.width=8, fig.height=3.2, warning=FALSE }
OP <- par("mar"= c(4,3,1,1))
df = 14
obs = 14.16 # Difference
ho = 49670 # expected difference
se = 5355
p68 = qt(.84,df)
p95 = qt(.975,df)
p99 = qt(.995,df)
t = 9.28 # t statistic
ncurve.x = seq(-3.5, 3.5,length.out= 200)
ncurve.y = dt(ncurve.x, df = df, ncp=0 )
plot( ncurve.y ~ ncurve.x, type="l", ylab=NA, xlab=NA,axes=FALSE,main=NA)
axis(1, family= "Architects Daughter", at=c(-p99,-p95,-p68,0,p68,p95,p99),
labels=round(c( NA,-p95,-p68,0,p68,p95,NA)*se+ho,0), xlim=c(-p99,p99))
lines(x = c(0,0) , y=c(0,dt(0,df = df, ncp=0 )) , col="black", lty=1)
lines(x = -c(p68,p68) , y=c(0,dt(-p68,df = df, ncp=0 )) , col="grey", lty=2)
lines(x = +c(p68,p68) , y=c(0,dt(+p68,df = df, ncp=0 )) , col="grey", lty=2)
lines(x = -c(p95,p95) , y=c(0,dt(-p95,df = df, ncp=0 )) , col="grey", lty=2)
lines(x = +c(p95,p95) , y=c(0,dt(+p95,df = df, ncp=0 )) , col="grey", lty=2)
text(x=p68, y = dt(p68,df = df, ncp=0 ), "1 SE", col="#777777", pos=4, family= "Architects Daughter")
text(x=-p68, y = dt(-p68,df = df, ncp=0 ), "-1 SE", col="#777777", pos=2, family= "Architects Daughter")
text(x=-p95, y = dt(-p95,df = df, ncp=0 ), "-2 SE", col="#777777", pos=2, family= "Architects Daughter")
text(x=+p95, y = dt(+p95,df = df, ncp=0 ), "+2 SE", col="#777777", pos=4, family= "Architects Daughter")
arrows(x0=-p68, y0=max(ncurve.y)/2,
x1=+p68, y1=max(ncurve.y)/2,
lty=1,col="blue", length=.15, code=3)
text(x= 0, y = max(ncurve.y)/2, labels="68%",
col="blue", family= "Architects Daughter",pos=3)
arrows(x0=-p95, y0=max(ncurve.y)/11,
x1=+p95,y1=max(ncurve.y)/11,
lty=1,col="blue", length=.15, code=3)
text(x= 0, y = max(ncurve.y)/11, labels="95%",
col="blue", family= "Architects Daughter",pos=3)
par(OP)
```
The confidence intervals for $\hat \beta_1$ can easily be extracted in `R` using the `confint()` function. for example, to get the confidence interval for an $\alpha$ of 95%, call the following function:
```{r tidy=FALSE}
confint(M, parm = 2, level = 0.95)
```
The first parameter, `M`, is the linear regression model output computed earlier. The second parameter `parm = 2` tells `R` which parameter to compute the confidence interval for (a value of `1` means that the confidence interval for the intercept is desired), and the third parameter in the function, `level = 0.95` tells `R` which $\alpha$ value you wish to compute a confidence interval for.
## Checking the residuals
We have already used the residuals $\varepsilon$ to give us and assessment of model fit and the contribution of $X$ to the prediction of $Y$. But the _distribution_ of the residuals can also offer us insight into whether or not all of the following key assumptions are met:
1. The **mean** of the residuals, $\bar \varepsilon = 0$, is close to **0**.
2. The **spread** of the residuals is the same for all values of $X$--i.e. they should be **homoscedastic**.
3. The probability distribution of the errors follows a **normal (Gaussian)** distribution.
4. The residuals are **independent** of one another--i.e. they are not **autocorrelated**.
We will explore each assumption in the following sections.
### Assumption 1: $\bar \varepsilon = 0$
The `lm` regression model outputs not only parameters as seen in earlier sections, but residuals as well. For each $Y$ value, a residual (or error) accounting for the difference between the model $\hat Y$ and actual $Y$ value is computed. We can plot the residuals as a function of the predictor variable $X$. We will also draw the line at $\varepsilon$ = 0 to see how $\varepsilon$ fluctuates around 0.
```{r tidy=FALSE, fig.width=4, fig.height=4.5, warning=FALSE, message=FALSE}
plot(M$residuals ~ x, pch = 16, ylab= "Residuals")
abline(h = 0, lty = 3)
```
We can compute the mean of the residuals, $\bar \varepsilon$, as follows:
```{r }
mean(M$residuals)
```
which gives us a value very close to zero as expected.
### Assumption 2: homoscedasticity
It helps to demonstrate via figures what does and does not constitute a homoscedastic residual:
```{r tidy=FALSE, echo=FALSE, fig.width=9, fig.height=4, warning=FALSE, message=FALSE}
set.seed(6)
y.h1 = rep(1,20) + rnorm(20, 0,1)
y.h2 = rep(1,20) + rnorm(20, 0,1) * seq(0,1,length.out=20)
set.seed(11)
y.h3 = rep(1,20) + rnorm(20, 0,1) * seq(1,0,length.out=20)
OP <- par(pty = "s",mfrow=c(1,3), mar= c(0,6,3,0) )
plot(y.h1, pch=16, xlab=NA, main="Homoscedastic", ylab="Residual", cex.main=1.8,family= "Architects Daughter", cex=1.3, cex.lab=1.8, col.axis = "white")
plot(y.h2, pch=16, xlab=NA,main="Not homoscedastic", ylab=NA, cex.main=1.8, family= "Architects Daughter", cex=1.3, col.axis = "white")
plot(y.h3, pch=16, xlab=NA,main="Not homoscedastic", ylab=NA, cex.main=1.8, family= "Architects Daughter", cex=1.3, col.axis = "white")
par(OP)
```
The figure on the left shows no obvious trend in the _variability_ of the residuals as a function of $X$. The middle figure shows an _increase_ in variability of the residuals with increasing $X$. The figure on the right shows a _decrease_ in variability in the residuals with increasing $X$. When the variances of $\varepsilon$ are not uniform across all ranges of $X$, the residuals are said to be **heteroscedastic**.
Our plot of the residuals from the bivariate model may show some signs of heteroscedasticity, particularly on the right hand side of the graph. When dealing with relatively small samples, a visual assessment of the distribution of the residuals may not be enough to assess whether or not the assumption of homoscedasticity is met. We therefore turn to the **Breusch-Pagan test**. A Breusch-Pagan function, `ncvTest()`, is available in the library `car` which will need to be loaded into our `R` session before performing the test. The package `car` is usually installed as part of the normal `R` installation.
```{r tidy=FALSE, warning=FALSE, message=FALSE}
library(car)
ncvTest(M)
```
The Breusch-Pagan tests the (null) hypothesis that the variances are constant across the full range of $X$. The $P$ value tells the likelihood that our distribution of residual variances are consistent with homoscedasticity. A non-significant $P$ value (one usually greater than 0.05) should indicate that our residuals are homoscedastic. Given our high $P$ value of `r round(ncvTest(M)$p,2)`, we can be fairly confident that we have satisfied our second assumption.
### Assumption 3: Normal (Gaussian) distribution of residuals
An important assumption is that the distribution of the residual values follow a normal (Gaussian) distribution. Note that this assumption is **only** needed when computing **confidence intervals** or regression coefficient **$P$-values**. If your interest is solely in finding the best linear _unbiased_ estimator (BLUE), then only the three other assumptions need to be met.
The assumption of normality of residuals is often confused with the belief that $X$ (the predictor variables) must follow a normal distribution. This is incorrect! The least-squares regression model makes **no** assumptions about the distribution of $X$. However, if the _residuals_ do not follow a normal distribution, then one method of resolving this problem is to transform the values of $X$ (which may be the source of confusion).
It helps to first plot the histogram of our residuals then to fit a kernel density function to help see the overall trend in distribution.
```{r tidy=FALSE, fig.width=4, fig.height=4.5, warning=FALSE, message=FALSE}
hist(M$residuals, col="bisque", freq=FALSE, main=NA)
lines(density(M$residuals), col="red")
```
A good visual test of normality is the use of a **Q-Q plot**.
```{r tidy=FALSE, fig.width=4, fig.height=4.5, warning=FALSE, message=FALSE}
plot(M, which = 2)
```
What we are hoping to see is an alignment of the residual values (hollow points on the plot) along the sloped dashed line. Any deviation from the sloped line may indicate lack of normality in our residuals. You will seldom encounter residuals that fit the line exactly. What you are assessing is whether or not the difference is significant. In our working example, there seems to be disagreement between the distribution of our residuals and the _normal_ line on the plot.
A visual assessment of normality using the Q-Q plot is usually the preferred approach, but you can also use the **Shapiro-Wilk** test to assess the _significance_ of deviation from normality. But use this test with caution, large sample sizes tend to always indicate deviation from normality, regardless how close they follow the Gaussian curve. `R` has a function called `shapiro.test()` which we can call as follows:
```{r}
shapiro.test(M$residuals)
```
The output we are interested in is the $P$-value. The _Shapiro-Wilk Test_ tests the null hypothesis that the residuals come from a normally distributed population. A large $P$-value indicates that there is a good chance that the null is true (i.e. that the residuals are close to being normally distributed). If the $P$ value is small, then there is a good chance that the null is false. Our $P$ value is `r round(shapiro.test(M$residuals)$p,2)` which would indicate that we cannot reject the null. Despite the Shapiro-Welk test telling us that there is a good chance that our $\varepsilon$ values come from a normally distributed population, our visual assessment of the Q-Q plot leads us to question this assessment.
Again, remember that the assumption of normality of residuals really only matters when you are seeking a confidence interval for $\hat \beta_1$ or $P$-values.
### Assumption 4: Independence of residuals
Our final assumption pertains to the independence of the residuals. In other words, we want to make sure that a residual value is not auto-correlated with a _neighboring_ residual value. The word _neighbor_ can mean different things. It may refer to the _order_ of the residuals in which case we are concerned with residuals being more similar within a narrow range of $X$, or it may refer to a _spatial_ neighborhood in which case we are concerned with residuals of similar values being _spatially_ clustered. To deal with the former, we can use the **Durbin-Watson** test available in the `R` package `car`.
```{r}
library(car) # Load this package if not already loaded
durbinWatsonTest(M)
```
Ideally, the `D-W` statistic returned by the test should fall within the range of 1 to 3. The $P$-value is the probability that our residual distribution is consistent with what we would expect if there was no auto-correlation. Our test statistic of `r round(durbinWatsonTest(M)$dw,2)` and $P$ value of `r round(durbinWatsonTest(M)$p,2)` suggests that the assumption of independence is met with our model. You might note that the $P$-value changes every time the tests is re-run. This is because the Durbin Watson test, as implemented in `R`, uses a Monte-Carlo approach to compute $P$. If you want to nail $P$ down to a greater precision, you can add the `reps =` parameter to the function (by default, the test runs 1000 bootstrap replications). For example, you could rerun the test using 10,000 bootstrap replications (`durbinWatsonTest(M, reps = 10000)`).
## Influential Observations
We want to avoid a situation where a single, or very small subset of points, have a disproportionately large influence on the model results. It is usually best to remove such influential points from the regression analysis. This is not to say that the influential points should be ignored from the overall analysis, but it may suggest that such points may behave differently then the bulk of the data and therefore may require a different model.
Several tests are available to determine if one or several points are influential. Two of which are covered here: **Cook's distance** and **hat values**.
Cook's distance can be computed from the model using the `cooks.distance()` function. But it's also available as one of `plot.lm`'s diagnostic outputs.
```{r fig.width=4, fig.height=4.5}
plot( M, which = 4)
```
We could have called the `plot.lm` function, but because `R` recognizes that the object `M` is the output of an `lm` regression, it automatically passes the call to `plot.lm`. The option `which = 4` tells `R` which of the 6 diagnostic plots to display (to see a list of all diagnostic plots offered, type `?plot.lm`). Usually, any point with a Cook's distance value **greater than 1** is considered overly influential. In our working example, none of the points are even close to 1 implying that none of our observations wield undue influence.
The other test that can be used to assess if a point has strong leverage is the **hat values** test. The technique involves calculating an average leverage value for all data points; this is simply the ratio between the number of regression coefficients in our model (this includes the intercept) and the number of observations. Once the average leverage is computed, we use a cutoff of either twice this average or three times this average (these are two popular cutoff values). We then look for hat values greater then these cutoffs. In the following chunk of code, we first compute the mean leverage values (whose value is assigned to the `cut` object), we then compute the leverage values for each point using the `hatvalues()` function and plot the resulting leverage values along with the two cutoff values (in dashed red lines).
```{r tidy=FALSE, fig.width=4, fig.height=4.5, warning=FALSE, message=FALSE}
cut <- c(2, 3) * length(coefficients(M)) / length(resid(M))
M.hat <- hatvalues(M)
plot(M.hat)
abline(h = cut, col = "red", lty = 2)
text( which(M.hat > min(cut) ) , M.hat[M.hat > min(cut)] ,
row.names(dat)[M.hat > min(cut)], pos=4, col="red")
```
The last line of code (the one featuring the `text` function) labels only the points having a value greater than the smaller of the two cutoff values.
The influential point of interest is associated with the third record in our `dat` dataset, Cumberland county. Let's see where the point lies relative to the regression slope
```{r tidy=FALSE, fig.width=4, fig.height=4.5, warning=FALSE, message=FALSE}
plot(Income ~ Education, dat)
abline(M, col="blue")
points(dat[M.hat > min(cut),]$Education,
dat[M.hat > min(cut),]$Income, col="red", pch=16)
text( dat[M.hat > min(cut),]$Education,
dat[M.hat > min(cut),]$Income ,
labels = row.names(dat)[M.hat > min(cut)],
col="red", pos=2) # Add label to the left of point
```
The point lies right on the line and happens to be at the far end of the distribution. The concern is that this point might have undue leverage potential. It helps to think of the regression line as a long straight bar hinged somewhere near the center. It requires less _force_ to move the bar about the imaginary hinge point when applied to the ends of the bar then near the center of the bar. The _hat values_ test is suggesting that observation number 3 may have unusual leverage potential. We can assess observation number 3's influence by running a new regression analysis without observation number 3.
```{r}
# Create a new dataframe that omits observation number 3
dat.nolev <- dat[ - which(M.hat > min(cut)),]
# Run a new regression analysis
M.nolev <- lm( Income ~ Education, dat.nolev)
```
We can view a full summary of this analysis using the `summary()` function.
```{r}
summary(M.nolev)
```
You'll note that the $adjusted\; R^2$ value drops from `r round(summary(M)$adj,2)` to `r round(summary(M.nolev)$adj,2)`.
If we compare the estimates (and their standard errors) between both model (see the summary tables below), we'll note that the estimates are nearly identical, however, the standard errors for $\hat \beta_1$ increase by almost 30%.
The original model `M`:
```{r echo=FALSE, results='asis'}
kable(xtable(summary(M)), digits=2, format="markdown", align="c")
```
The new model `M.nolev`:
```{r echo=FALSE, results='asis'}
kable(xtable(summary(M.nolev)), digits=2, format="markdown", align="c")
```
Given that our estimates are nearly identical, but that the overall confidence in the model decreases with the new model, there seems to be no reason why we would omit the 3rd observation based on this analysis.
This little exercise demonstrates the need to use as many different tools as possible to evaluate whether an assumption is satisfied or not.
## What to do if some of the assumptions are not met
### Data transformation
Data transformation (usually via some non-linear re-expression) may be required when one or more of the following apply:
* The residuals are skewed or are heteroscedastic.
* When theory suggests.
* To force a linear relationship between variables
**Do not** transform your data if the sole purpose is to "correct" for outliers.
We observed that our residuals did not follow a normal (Gaussian) distribution. Satisfying this assumption is important if we are to use this model to derive confidence intervals around our $\beta$ estimates or if we are to use this model to compute $P$ values. We can see if transforming our data will help. This means re-expressing the $X$ and/or the $Y$ values (i.e. converting each value in $X$ and/or $Y$ by some expression such as `log(x)` or `log(y)`)
It helps to look at the distribution of the variables (this is something that should normally be done prior to performing a regression analysis). We'll use the `hist()` plotting function.
```{r eval=FALSE}
hist(dat$Income, breaks = with(dat, seq(min(Income), max(Income), length.out=5)) )
hist(dat$Education, breaks = with(dat, seq(min(Education), max(Education), length.out=5)) )
```
```{r echo=FALSE, fig.width=6, fig.height=4}
OP <- par(mfrow=c(1,2))
hist(dat$Income, breaks = with(dat, seq(min(Income), max(Income), length.out=5)) )
hist(dat$Education, breaks = with(dat, seq(min(Education), max(Education), length.out=5)) )
par(OP)
```
Two common transformations are the natural log (implemented as `log` in `R`) and the square. These are special cases of what is called a **Box-Cox** family of transformations.
Data transformation is an iterative process which relies heavily on **Exploratory Data Analysis (EDA)** skills. It's important to keep in mind the goal of the transformation. In our example, its purpose is to help improve the symmetry of the residuals. The following figures show different Q-Q plots of the regression residuals for different transformations of $X$ and $Y$.
Several data transformation scenarios were explored using square root and logarithmic transformation of the $X$'s and/or $Y$'s. Of course, a thorough analysis would consist of a wide range of Box-Cox transformations. The scatter plots (along with the regression lines) are displayed on the left hand side for different transformations (the axes labels indicate which, if any, transformation was applied). The accompanying Q-Q plots are displayed to the right of the scatter plots.
```{r fig.height=12, fig.width=6.5, echo=FALSE, results='hide'}
M.nlog <- lm( Income ~ log(Education), dat)
M.logn <- lm( log(Income) ~ Education, dat)
M.nsqr <- lm( Income ~ sqrt(Education), dat)
M.sqrn <- lm( sqrt(Income) ~ Education, dat)
OP <- par(mfcol = c(4,2), cex.lab=1.3 )
plot( Income ~ log(Education), dat, pch=16); abline(M.nlog, col="red")
plot( log(Income) ~ Education, dat, pch=16); abline(M.logn, col="red")
plot( Income ~ sqrt(Education), dat, pch=16); abline(M.nsqr, col="red")
plot( sqrt(Income) ~ Education, dat, pch=16); abline(M.sqrn, col="red")
qqPlot(M.nlog, lw=1,pch=16, cex=1, ylab=NA, envelope=FALSE)
qqPlot(M.logn, lw=1,pch=16, cex=1, ylab=NA, envelope=FALSE)
qqPlot(M.nsqr, lw=1,pch=16, cex=1, ylab=NA, envelope=FALSE)
qqPlot(M.sqrn, lw=1,pch=16, cex=1, ylab=NA, envelope=FALSE)
par(OP)
```
A visual assessment of the Q-Q plots indicates some mild improvements near the middle of the residual distribution (note how the mid-points line up nicely with the dashed line), particularly in the second and fourth transformation (i.e. for `log(Income) ~ Education` and `sqrt(Income) ~ Education`).
Note that transforming the data can come at a cost, particularly when interpretation of the results is required. For example, what does a log transformation of Income imply? When dealing with concentrations, it can make theoretical sense to take the log of a value--a good example being the concentration of hydrogen ions in a solution which is usually expressed as a log (pH). It's always good practice to take a step back from the data transformation workflow and assess where your regression analysis is heading.
### Bootstrapping
If the assumption of normality of residual is not met, one can overcome this problem by using bootstrapping techniques to come up with regression parameters confidence intervals and $P$-values. The bootstrap technique involves rerunning an analysis, such as the regression analysis in our case, many times, while randomly resampling from our data each time. In concept, we are acting as though our original _sample_ is the actual population and we are sampling, at random (with replacement), from this _pseudo_ population.
One way to implement a bootstrap is to use the `boot` package's function `boot()`. But before we do, we will need to create our own regression function that will be passed to `boot()`. This custom function will not only compute a regression model, but it will also return the regression coefficients for each simulation. The following block of code defines our new custom function `lm.sim`:
```{r message=FALSE, warning=FALSE }
lm.sim <- function(formula, data, i)
{
d.sim <- data [i,]
M.sim <- lm(formula, data=d.sim)
return(M.sim$coef)
}
```
We can now use the `boot()` function to run the simulation. Note the call to our custom function `lm.sim` and the number of bootstrap replicates (i.e. `R = 999`). Also, don't forget to load the `boot` library into the current `R` session (if not already loaded).
```{r message=FALSE, warning=FALSE , tidy=FALSE}
library(boot)
M.boot <- boot(statistic = lm.sim, formula = Income ~ Education, data=dat, R = 999)
```
The results of our bootstrap simulation are now stored in the object `M.boot`. In essence, a new regression line is created for each simulation. In our working example, we created 999 different regression lines. The following plot shows the first 100 regression lines in light grey. Note how they fluctuate about the original regression line (shown in red). The distribution of these slopes is what is used to compute the confidence interval.
```{r echo=FALSE, fig.height=4.5, fig.width=4 }
plot(Income ~ Education, dat, pch=16)
for(i in 1:100 ){
abline(a=M.boot$t[i,1], b = M.boot$t[i,2], col=rgb(.8,.8,.8,.5))
}
abline(M,col="red")
```
We can now use the function `boot.ci` (also in the `boot` package) to extract confidence intervals for our $\beta$ parameters for a given $\alpha$ confidence value. For example, to get the confidence intervals for the parameters $\hat \beta_0$ at a 95% confidence interval, you can type the following command:
```{r message=FALSE, warning=FALSE }
boot.ci(M.boot, conf = 0.95, type = "bca", index=1)
```
Note that your values may differ since the results stem from Monte Carlo simulation. Index `1`, the first parameter in our lm.sim output, is the coefficient for $\hat \beta_0$. `boot.ci` will generate many difference confidence intervals. In this example, we are choosing `bca` (adjusted bootstrap percentile).
To get the confidence interval for $\hat \beta_1$ at the 95% $\alpha$ level just have the function point to index 2:
```{r message=FALSE, warning=FALSE }
boot.ci(M.boot, conf = 0.95, type="bca", index=2)
```
```{r echo=FALSE }
# Internal calculations
b0.sim <- boot.ci(M.boot, conf = 0.95, type="bca", index=1)$bca[4:5]
b1.sim <- boot.ci(M.boot, conf = 0.95, type="bca", index=2)$bca[4:5]
```
The following table summarizes the 95% confidence interval for $\hat \beta_1$ from our simulation. The original confidence interval is displayed for comparison.
Model $\hat \beta_1$ interval
----------- --------------------------
Original [`r sprintf("%.0f, %.0f",b1-se68,b1+se68)`]
Bootstrap [`r sprintf("%.0f, %.0f",b1.sim[1],b1.sim[2])`]
# Multivariate regression model
We do not need to restrict ourselves to one predictor variable, we can add as many predictor variables to the model as needed (if suggested by theory). But we only do so in the hopes that the extra predictors help account for the unexplained variances in $Y$.
Continuing with our working example, we will add a second predictor variable, `x2`. This variable represents the fraction of the employed civilian workforce employed in a professional field (e.g. scientific, management and administrative industries)
```{r tidy=FALSE}
x2 <- c(0.079, 0.062, 0.116, 0.055, 0.103, 0.078,
0.089, 0.079, 0.067, 0.073, 0.054, 0.094,
0.061, 0.072, 0.038, 0.084)
```
We will add this new variable to our `dat` dataframe and name the variable `Professional`.
```{r}
dat$Professional <- x2
```
We now have an updated dataframe, `dat`, with a third column.
```{r results='asis', echo=FALSE}
kable(dat, format="markdown", align="c")
```
Let's rerun our original model, but this time we will call the output of the one predictor model `M1` (this naming convention will help when comparing models).
```{r}
M1 <- lm(Income ~ Education, dat)
```
Let's review the summary. This time we'll have `R` display the parameters _and_ the statistics in a single output:
```{r}
summary(M1)
```
The output should be the same as before.
We now generate an augmented (i.e. adding a second predictor) regression model, `M2`, as follows:
```{r}
M2 <- lm( Income ~ Education + Professional, dat)
```
At this point, it's helpful to visualize what a two-variable model represents. A two-variable model defines the equation of a plane that best fits a 3-D set of points. The regression plane is to a two-variable model what a regression line is to a one-variable model. We can plot the 3-D scatter plot where _x-axis_ = fraction with bachelor's, _y-axis_ = fraction with a professional job and _z-axis_ = income. We draw the 3-D scatter plot using the function `scatterplot3d()` from the `scatterplot3d()` package. We also add the regression plane to the plot.
```{r, echo=TRUE, tidy=FALSE}
library(scatterplot3d)
s3d <- scatterplot3d(dat$Education, dat$Professional , dat$Income,
highlight.3d=TRUE, angle=55, scale.y=0.7, pch=16,
xlab = "Education", ylab = "Professional", zlab="Income")
# Add the 3-D regression plane defined by our model M2
s3d$plane3d(M2, lty.box="solid")
```
## Reviewing the multivariate regression model summary
We can, of course, generate a full summary of the regression results as follows:
```{r}
summary(M2)
```
The interpretation of the model statistics is the same with a multivariate model as it is with a bivariate model. The one difference is that the extra regression coefficient $\hat \beta_2$ (associated with the _Professional_ variable) is added to the list of regression parameters. In our example, $\hat \beta_2$ is significantly different from $0$ ($P$ = 0.0066).
One noticeable difference in the summary output is the presence of a **Multiple R-squared** statistic in lieu of the bivariate simple R-squared. Its interpretation is, however, the same and we can note an increase in the models' overall $R^2$ value. The $F$-statistic's $P$ value of $0$ tells us that the amount of residual error from this multivariate model is significantly less than that of the simpler model, the mean $\bar Y$. So far, things look encouraging.
The equation of the modeled plane is thus:
$\widehat{Income}$ = **`r sprintf("%.0f",M2$coef[1])`** + **`r sprintf("%.0f",M2$coef[2])`** ($Education$) + **`r sprintf("%.0f",M2$coef[3])`** ($Professional$)
## Comparing the two-variable model with the one-variable model
We can compare the two-variable model, `M2`, with the one-variable model (the bivariate model), `M1`. Using the `anova()` function. This test assesses whether or not adding the second variable, `x2`, _significantly_ improves our ability to predict $Y$. It's important to remember that a good model is a parsimonious one. If an augmented regression model does not significantly improve our overall predictive ability, then we should always revert back to the simpler model.
```{r}
anova(M1, M2)
```
The $P$ value is very small indicating that the reduction in residual error in model `M2` over model `M1` is significant. Model `M2` looks promising so far.
## Looking for multi-collinearity
When adding explanatory variables to a model, one must be very careful not to add variables that explain the same 'thing'. In other words, we don't want (nor need) to have two or more variables in the model explain the **same variance**. One popular test for Multi-collinearity is the **Variance Inflation Factor (VIF)** test. The VIF is computed for all regression parameters. For example, VIF for $X_2$ is computed by first regressing $X_1$ against all other predictors (in our example, we would regress $X_1$ against $X_2$) then taking the resulting $R^2$ and plugging it in the following VIF equation:
$$
VIF = \frac{1}{1 - R^2_1}
$$
where the subscript $1$ in $R^2_1$ indicates that the $R^2$ is for the model `x1 ~ x2`. The VIF for $X_2$ is computed in the same way.
What we are avoiding are _large_ VIF values (a _large_ VIF value may indicate high multicollinearity). What constitutes a large VIF value is open for debate. Typical values range from 3 to 10, so VIF's should be interpreted with caution.
We can use `R`'s `vif()` function to compute VIF for the variables _Education_ and _Professional_
```{r}
vif(M2)
```
The _moderately high_ VIF should be a warning. It's always good practice to generate a **scatter plot matrix** to _view_ any potential relationship between _all_ variables involved. We can use `R`'s `pairs()` function to display the scatter plots:
```{r fig.height=6, fig.width= 6}
pairs(dat, panel = panel.smooth)