-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmcspadden2.Rmd
319 lines (216 loc) · 12.8 KB
/
mcspadden2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
---
title: "COMP 4299 Assignment 3"
output: html_document
date: "2023/02/11 - Briana McSpadden"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r, include = FALSE}
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
#forggplot
if(!require(dslabs)) install.packages("dslabs", repos = "http://cran.us.r-project.org")
if(!require(ggplot2)) install.packages("ggplot2", repos = "http://cran.us.r-project.org")
if(!require(cowplot)) install.packages("cowplot", repos = "http://cran.us.r-project.org")
#this is for the plot_grid function to show the scatterplots in part one
if(!require(sjstats)) install.packages("sjstats", repos = "http://cran.us.r-project.org")
#this is for rmse values
# Getting and reading data:
# Get the Concrete Data from GitHub
url <- 'https://raw.githubusercontent.com/jholland5/COMP4299/main/Concrete_Data.csv'
# Read the data into R and name it concrete.
concrete <- read_csv(url)
```
## Part 1 - Concrete Data Analysis
The data we will be looking at in this report looks at 1,030 rows of 9 numeric variables relating to concrete. The variables are listed below and consist of different elements that go into a m^3 mixture of concrete along with age that serve as predictors for the the last variable, strength. The name in the parenthesis is the variable name that is in the data itself.
##### Variable Names and Descriptions:
* **Cement** (cem): Numeric variable that measures the kg of cement in a k^3 mixture of concrete.
* **Slag** (slag): Numeric variable that measures the kg of slag in a k^3 mixture of concrete.
* **Fly Ash** (FA): Numeric variable that measures the kg of fly ash in a k^3 mixture of concrete. Fly Ash is a light material added to mixtures of concrete to improve the work-ability of the mixture.(<https://www.thespruce.com/fly-ash-applications-844761>)
* **Water** (h2o): Numeric variable that measures the kg of water in a k^3 mixture of concrete.
* **Superplasticizer** (plast): Numeric variable that measures the kg of superplasticizers in a k^3 mixture of concrete. They are added to concrete mixtures to make it more pourable without having to add more water. (<https://www.concreteconstruction.net/how-to/materials/how-super-are-superplasticizers_o>)
* **Coarse Aggregate** (cAgg): Numeric variable that measures the kg of course aggregate in a k^3 mixture of concrete. This would be like the rock, stone, or gravel used to make concrete.
* **Fine Aggregate** (fAgg): Numeric variable that measures the kg of fly ash in a k^3 mixture of concrete. This would be the sand and finer natural materials used to make concrete.
* **Age** (age): Numeric variable representing the age of the concrete.
* **Strength** (strength): Numeric variable measuring the overall strength of the concrete.
<br>
##### Here is a look at the first 6 rows of the data:
```{r, echo = FALSE}
#making the data into a data frame
concrete_df <- as.data.frame(concrete)
#creating a variable with the first 6 rows of the data
table_data <- concrete_df[1:6, 1:9]
#printing a table of that variable.
knitr::kable(table_data)
```
<br>
##### Summary Statistics of the data:
```{r, echo = FALSE}
#grouping out the first three variables
ConcreteSummary1 <- summary(concrete_df[,1:3])
#running a summary on them and printing it in a table
knitr::kable(ConcreteSummary1)
#grouping out the second three variables
ConcreteSummary2 <- summary(concrete_df[,4:6])
#running a summary on them and printing it in a table
knitr::kable(ConcreteSummary2)
#grouping out the first three variables
ConcreteSummary3 <- summary(concrete_df[,7:9])
#running a summary on them and printing it in a table
knitr::kable(ConcreteSummary3)
```
<br>
##### Vizualization of the data with Scatterplots:
```{r, echo = FALSE}
#Cement Scatterplot
cem_plot <- concrete %>% ggplot()
cem_plot <- cem_plot + geom_point(aes(strength, cem)) + labs(title = "Strength vs Cement", x = "Strength", y = "Cement (kg)")
#Slag Scatterplot
slag_plot <- concrete %>% ggplot()
slag_plot <- slag_plot + geom_point(aes(strength, slag)) + labs(title = "Strength vs Slag", x = "Strength", y = "Slag (kg)")
#Fly Ash Scatterplot
FA_plot <- concrete %>% ggplot()
FA_plot <- FA_plot + geom_point(aes(strength, FA)) + labs(title = "Strength vs Fly Ash", x = "Strength", y = "Fly Ash (kg)")
#Water Scatterplot
h2o_plot <- concrete %>% ggplot()
h2o_plot <- h2o_plot + geom_point(aes(strength, h2o)) + labs(title = "Strength vs Water", x = "Strength", y = "Water (kg)")
#Superplasticizer Scatterplot
plast_plot <- concrete %>% ggplot()
plast_plot <- h2o_plot + geom_point(aes(strength, plast)) + labs(title = "Strength vs Superplasticizer", x = "Strength", y = "Superplasticizer (kg)")
#Course Aggregate Scatterplot
cAgg_plot <- concrete %>% ggplot()
cAgg_plot <- h2o_plot + geom_point(aes(strength, cAgg)) + labs(title = "Strength vs Course Aggregate", x = "Strength", y = "Course Aggregate (kg)")
#Fine Aggregate Scatterplot
fAgg_plot <- concrete %>% ggplot()
fAgg_plot <- h2o_plot + geom_point(aes(strength, fAgg)) + labs(title = "Strength vs Fine Aggregate", x = "Strength", y = "Fine Aggregate (kg)")
#Age Scatterplot
age_plot <- concrete %>% ggplot()
age_plot <- h2o_plot + geom_point(aes(strength, age)) + labs(title = "Strength vs Age", x = "Strength", y = "Age")
#Grid of the first four scatter plots
plot_grid(cem_plot, slag_plot, FA_plot, h2o_plot, ncol=2)
#Grid of the last four scatter plots
plot_grid(plast_plot, cAgg_plot, fAgg_plot, age_plot, ncol=2)
```
---
## Part 2 - Simple Linear Regression Model
This section will first look at the correlations in the data. After analyzing the correlations, the greatest value will be chosen and used to make a Simple Linear Regression Model.
<br>
##### Correlation Values:
```{r, echo = FALSE}
#Creating a linear model
knitr::kable(round(cor(concrete), 5))
```
<br>
For the purpose of this report, we are looking to see which correlation value is the greatest in regard to strength. The best predictor based on correlation is **cement** with a value of **0.49783.** So, using cement as the predictor, we will compute a simple linear regression summary.
<br>
##### Simple Linear Regression Summary Results:
```{r, echo = FALSE}
#summary stats of lm with strength and cement
simplelr <- lm(strength ~ cem, data = concrete)
summary(simplelr)
```
##### RMSE - Root Mean Square Error:
```{r, echo = FALSE}
rmse(simplelr)
```
<br>
##### Interpretation:
* **Min and Max of Residuals**: In the scatterplot shown below, you can see that the line fits the data well, but not perfect. The minimum and maximum values of residuals are the distances of the farthest points above and below the line of best fit. As you can see from the summary statistics, the value is **-40.593** for the farthest point below the line and **43.24** for the farthest above.
* **Estimate**: The values under the estimates column in the Coefficients section give the values of the equation for the line of best fit. The equation is **strength = 13.442528 + 0.079580(cem).** The y-intercept of the line will be at 13. 44 and the slope of the line will be 0.0796.
* **t-value**: The t-value for the cement predictor is **18.4.** This is calculated by taking the Estimate Value and dividing it by the Standard Error value: 0.079580/0.004324 = 18.40. It's a measure of the number of standard deviations the coefficient estimate is from 0. The greater, the better. 18.4 is a good t-value.
* **p-value**: The p-value is essentially 0 for the cement predictor. P value represents the probability of getting the ratio assuming that there is no relationship between cement and strength of the concrete. Because this value, **<2e-16** is essentially 0, we know that the relationship. The three stars next to the p-value indicate that it is significant.
* **Residual Standard Error**: This is a measure of variation in the residuals. Ideally, the lower the RSE, the better. For this report, we will look closer at the RMSE rather than Residual Standard Error because it uses the same units as the response variable.
* **Multiple R-Squared**: Multiple R-Squared can be helpful in determining how well the model describes the data. 1 would mean that it models the data perfect while 0 would mean the opposite. For these results, the value of Multiple R-Squared is **0.2478.** This means that about a quarter of the variation in strength is explained by cement.
* **RMSE**: RMSE stands for "Root Mean Squared Error". Ideally here, the lower the value the better. This value indicates how far the data points are from the regression line. If the value is higher that would mean that there is a large deviation between the actual value and the residual. We want a lower value to know that the model fits the dataset.
<br>
```{r, echo = FALSE}
#scatter plot with the line of best fit from the lm
ggplot(concrete, aes(x = strength, y = cem)) +
geom_point() +
stat_smooth(method = "lm") +
labs(title = "Strength as a function of Cement Value", x = "Strength", y = "Cement (kg)")
```
---
## Part 3 - Multiple Linear Regression Model Through Backward Elimination
This section will divide the data in half and use one half as the training set and one as the testing set. Backwards elimination will be used to produce the best Multiple Linear Regression Model as possible.
<br>
##### Using Training Data - Model 1:
```{r, echo = FALSE}
#setting the seed so the results are reproducible
set.seed(1965)
#Splitting the data into training and testing
index <- sample(1:1030,515,replace = FALSE)
train_data <- concrete_df[index,]
test_data <- concrete_df[-index,]
#running model one. Uses all of the variables.
model1 <- lm(strength ~., data = train_data)
summary(model1)
```
##### RMSE - Root Mean Square Error for Model 1:
```{r, echo = FALSE}
#caluclating the RMSE
rmse(model1)
```
To make the model better, will remove the value with the highest p-value. The value being removed is Course Aggregate (cAgg).
<br>
##### Using Training Data - Model 2:
```{r, echo = FALSE}
#setting the seed so the results are reproducible
model2 <- lm(strength ~ cem + slag + FA + h2o + plast + fAgg + age, data = train_data)
summary(model2)
```
##### RMSE - Root Mean Square Error for Model 2:
```{r, echo = FALSE}
#caluculating the RMSE
rmse(model2)
```
RMSE did not change significantly. We'll take out the next highest, fAgg.
<br>
##### Using Training Data - Model 3:
```{r, echo = FALSE}
#setting the seed so the results are reproducible
model3 <- lm(strength ~ cem + slag + FA + h2o + plast + age, data = train_data)
summary(model3)
```
##### RMSE - Root Mean Square Error Model 3:
```{r, echo = FALSE}
#caluculating the RMSE
rmse(model3)
```
Take out the last variable with a P-value greater than .05, which is plast.
<br>
##### Using Training Data - Model 4:
```{r, echo = FALSE}
#setting the seed so the results are reproducable
model4 <- lm(strength ~ cem + slag + FA + h2o + age, data = train_data)
summary(model4)
```
##### RMSE - Root Mean Square Error Model 4:
```{r, echo = FALSE}
#caluculating the RMSE
rmse(model4)
```
All of the P-values are significant.
**Model 4 is the chosen model.**
##### The equation of the model:
**strength = 36.463727 + 0.108062(cem) + 0.093776(slag) + 0.080383(FA) - 0.259348(h2o) + 0.108796(age)**
---
## Part 4 - One Sample t-test
This part will evaluate the model found in part three by preforming a two sample t-test on the test data using the chosen model, 4, from part three.
<br>
```{r, echo = FALSE}
#assigning a variable of predicted values
predicted_values <- predict(model4,test_data)
#running the t-test on the data
t.test(test_data$strength - predicted_values)
```
**Null Hypothesis:** The true mean is equal to 0.\
**Alternative Hypothesis:** The true mean is not equal to 0.
<br>
##### Interpretation:
When evaluating the model by running a two sample t-test, there are two values that help to indicate whether or not the model is good: A good p-value and a confidence interval that is tight and around 0. Here, since the p value is .424, not significant, we know that there is a 42.4% chance of getting data like this randomly and its average being 0. We accept the null and reject the alternative hypothesis. This is a sufficient p value for the validation. Along with that, we see that the confidence interval is small, covering a range of less than 2.0 and 0 is within the interval. From this, we can believe that the model is going to make adequate predictions for other data sets.
---
## Part 5 - Changing the Seed
After trying 4 different seeds on this data, we conclude that the results are reliable. In all instances, the RMSE of the model was around the same plus or minus .5 and all of the p values for the variables left in model four were significant. When running the t-test on the test-data, the P value and confidence interval indicated that the model would make adequate predictions. The consistency in changing the seed led to the conclusion that the results are good.
<br>
<br>