-
Notifications
You must be signed in to change notification settings - Fork 0
/
Linear Regression.Rmd
228 lines (158 loc) · 8.47 KB
/
Linear Regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
Setting up by calling relevant libraries and opening data file and assigning data variable:
```{r, results='hide'}
library(leaps) #for the regsubsets() function
library(DAAG) #for cross-validation
set.seed(93)
```
```{r}
crimedata <- read.table('uscrime.txt',header = TRUE) #open data file with headers then set header parameter to TRUE
head(crimedata) #First 6 rows of the data
```
First, I tried to fit all the variables to predict the crime rate:
```{r}
all_model <- lm(Crime~.,crimedata)
all_model
```
```{r}
summary(all_model)
```
As we can see, even though R-squared is 0.803 which means the model can account for 80.30% of the variability in the the data, this might just be the case of overfitting since the there are 15 predictors with only 45 data points. As such, the adjusted R-squared penalized that to 70.80%.
Let's try predicting the observed crime rate with the following data using the linear regression model above:
M = 14.0
So = 0
Ed = 10.0
Po1 = 12.0
Po2 = 15.5
LF = 0.640
M.F = 94.0
Pop = 150
NW = 1.1
U1 = 0.120
U2 = 3.6
Wealth = 3200
Ineq = 20.1
Prob = 0.04
Time = 39.0
```{r}
all_predictions <- predict(all_model,data.frame(M = 14.0
,So = 0
,Ed = 10.0
,Po1 = 12.0
,Po2 = 15.5
,LF = 0.640
,M.F = 94.0
,Pop = 150
,NW = 1.1
,U1 = 0.120
,U2 = 3.6
,Wealth = 3200
,Ineq = 20.1
,Prob = 0.04
,Time = 39.0)
,interval ="confidence")
all_predictions
```
As we can see, even though the model predicts a crime rate of 155.43, there confidence interval is extremely wide from -1477 to 1633. However, let's also look at the range of the original data:
```{r}
cat('Crime rate ranges from', min(crimedata$Crime),'to',max(crimedata$Crime))
```
The crime rate is also a lot lower than the smallest crime rate and outside of the original data set. This is a problem with overfitting the model.
As such, I will try to determine the best predictors that can be used to predict. I will employ the regsubsets() function from the 'leaps' library. The regsubsets() looks for best set selection by identifying the model that contains a given number
of predictors with the lowest residual sum squares (RSS).
```{r}
subset_model <- regsubsets(Crime~.,crimedata,nvmax =15) #nvmax = 15 means I wanted to test across 15 predictors
subset_summary <- summary(subset_model)
```
```{r}
subset_summary$outmat
```
The * indicates which predictor(s) are used for the models. For example, for a 1-predictor model, Po1 is used, for a 2 predictors model, Po1 and Ineq are used. We can see that regardless of the number of predictors, all models employ Po1. It is interesting becasue in our all-predictor model above, Po1 has a relatively high P-value of 0.07889. However, now it seems that Po1 helps with minimizing RSS for all model. Maybe P-value alone cannot determin which factor should be kept or removed.
I am interested to see which other results are included in the regsubset() function:
```{r}
names(subset_summary)
```
I see that I can also call RSS, Adjusted R-squared, BIC. First let's plot an elbow diagram for the RSS and R-squared:
```{r}
par(mfrow=c(2,2))
plot(subset_summary$rss, type = 'b', xlab = 'Number of Predictors', ylab = 'Residual Sum Squared - RSS')
plot(subset_summary$rsq, type = 'b', xlab = 'Number of Predictors', ylab = 'R-squared')
```
As we can see, RSS values falls at a decreasing rate when the number of predictors increased. I can see that the marginal benefit of adding a predictor to the model starts to get minimal at 6 predictors. Likewise, R-squared values increases at a decreasing rate when the number of predictors increase. The marginal benefit starts to level off at 6 predictors as well. Anything beyond that is likely due to overfitting.
At 6 predictors, the RSS is:
```{r}
subset_summary$rss[6]
```
At 6 predictors, the R-squared is:
```{r}
subset_summary$rsq[6]
```
Next, I'll plot the BIC graph:
```{r}
plot(subset_summary$bic, type = 'b', xlab = 'Number of Predictors', ylab = 'Bayesian Information Criterion - BIC')
```
The number of predictors that would yield minimum BIC value is:
```{r}
which.min(subset_summary$bic)
```
Here, the model with 6 predictors has the minimum BIC value, which might be the indication that it is the better model.
At 6 predictors, the BIC is:
```{r}
subset_summary$bic[which.min(subset_summary$bic)]
```
To confirm, I will calculate the absolute difference of BIC between 15 and 6 preditors:
```{r}
diff_15vs6 <-abs(subset_summary$bic[6]-subset_summary$bic[15])
diff_15vs6
```
The absolute difference is greater than 10, that means with 6 predictors, the model is 'very likely' to be a better model. Together with our conclustion for RSS and R-squared, model that used the 6 predictors according to regsubset() function is the optimal model to use.
Here is the coefficients and y-intercept associated with the modle using 6 predictors:
```{r}
model6_coef <- coef(subset_model ,6)
model6_coef
```
Interestingly, coefficients of model with 6 predictors actually exclude those with P-value higher than 0.1 as I saw earlier. Therefore, the equation for the linear regression is:
```{r}
cat('Crime =',model6_coef[1],'+', model6_coef[2],names(model6_coef)[2],'+', model6_coef[3],names(model6_coef)[3],'+', model6_coef[4],names(model6_coef)[4],'+', model6_coef[5],names(model6_coef)[5],'+', model6_coef[6],names(model6_coef)[6],'+', model6_coef[7],names(model6_coef)[7])
```
Now let's fit the 6 predictors back into the lm() function for prediction:
```{r}
model6_predictors_name <- names(model6_coef[-1]) #Get the names of the predictors except the intercept
model6_fit <- lm(paste0("Crime~",paste0(model6_predictors_name, collapse="+")), crimedata) #fit the regression model with the 6 best predictors
summary(model6_fit)
```
To confirm that the model with 6 predictors above is a better model than all-predictor model, I will perform a quick 4-fold cross validation on both the models:
Cross validation for all-predictor-model:
```{r}
all_model_cv <- cv.lm(crimedata,all_model,m=4,seed=93,plotit = FALSE)
```
Cross validation for 6-predictor-model:
```{r}
model6_cv <- cv.lm(crimedata,model6_fit,m=4,seed=93,plotit = FALSE)
```
After cross validation, the model with 6 predictos shows smaller mean square errors than the model with all 15 predictors (59,000 vs 81,736). This again confirms that model with 6 predictors is a better model because that means less discrepancy between actual data and the estimations of the model.To confirm I will calulate the R-squared of the cross validated models:
R-squared is SSR/SST = (SST - SSE)/SST. SST should be the same for both models:
```{r}
SST <- sum((as.vector(crimedata$Crime) - mean(crimedata$Crime))**2)
SSE_model15 <- attr(all_model_cv,'ms')*length(crimedata$Crime)
Rsquare_model15 <- (SST - SSE_model15)/SST
Rsquare_model15
```
```{r}
SST <- sum((as.vector(crimedata$Crime) - mean(crimedata$Crime))**2)
SSE_model6 <- attr(model6_cv,'ms')*length(crimedata$Crime)
Rsquare_model6 <- (SST - SSE_model6)/SST
Rsquare_model6
```
As expected, when randomness is accounted for, the model with 6 predictors proved to have higher R-squared (0.597 vs 0.442) after cross-validation. This means it is a better model as it accounts for 59.7% variations in the data.
The given city has the following input for the relevant predictors of the model:
```{r}
model6_predictions <- predict(model6_fit,data.frame(M = 14.0
,Ed = 10.0
,Po1 = 12.0
,U2 = 3.6
,Ineq = 20.1
,Prob = 0.04)
,interval ="confidence")
model6_predictions
```
The predicted crime rate with our 6-predictor model is 1304 crime cases per 100,000 people. I also notice that the confidence interval is much narrower this time.