-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRegression_Models_Automobiles.Rmd
151 lines (109 loc) · 5.51 KB
/
Regression_Models_Automobiles.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: "Relation between miles per gallon and gear transmission type"
author: "Harish Kumar Rongala"
date: "January 20, 2017"
output:
pdf_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## 1. Executive Summary
This document explores the relationship between **miles per gallon** (MPG) and factors affecting it. We used "mtcars" data set from *datasets* package in R. It is extracted from the 1974 Motor Trend US magazine, which is about the automobile industry. It comprises fuel consumption and **10** other aspects of automobile design and performance for **32** automobiles (1973-74 models).
This document attempts to answer the following questions
+ Which provides better MPG, Automatic or manual transmission ?
+ Quantify the MPG difference between automatic and manual transmissions
## 2. Exploratory data analysis
After looking at individual variable's relationship with **mpg** [see appendix-6.1], we can find significant trends in the plots, showing a positive or negative impact on the mpg. In the **regression modeling section** we will quantify this relationship.
The following 5 variables are factor variables but they are labeled as a numeric class. We have to transform these variables in to factor class to make more sense of them in our modeling.
```{r echo=FALSE}
names(mtcars)[lapply(mtcars[1:11],FUN = function(x){length(unique(x))})<10]
```
```{r}
mtcars$cyl<-factor(mtcars$cyl);
mtcars$vs<-factor(mtcars$vs);
mtcars$am<-factor(mtcars$am);
levels(mtcars$am)<-c("automatic","manual");
mtcars$gear<-factor(mtcars$gear);
mtcars$carb<-factor(mtcars$carb);
```
## 3. Regression modeling
As the dependent variable **mpg** is not binomial or a count variable, we use a linear model to fit our data. In our first model we include all the independent variables in the model, as we found that they have certain degree of impact on mpg from exploratory data analysis.
```{r results="hide"}
## Output of this fit can be found in appendix #Fit0
fit0<-lm(mpg~.,data=mtcars);
round(summary(fit0)$coef,3);
```
P-values of most variables are insignificant, so we drop those variables and re-fit the model. However, we can't drop **am** variable which indicates transmission mode. Because we want to find the relation between **am** and **mpg**. In our second model, our independent variables are **am, hp and wt**.
```{r results="hide"}
## Output of this fit can be found in appendix #Fit1
fit1<-lm(mpg~am+hp+wt,data=mtcars);
round(summary(fit1)$coef,3);
```
We use a step wise model selection algorithm based on **AIC** Akaike's 'An Information Criterion' to fit a new linear model.
```{r results="hide"}
better_fit<-step(fit0,direction = "both");
```
Now, we got 3 models. However, we knew that first model - fit0 is too insignificant when compared to other two models. We use **ANOVA** variance analysis technique to analyze our models.
```{r}
anova(fit1,better_fit);
```
Our anova test shows that model 3 - 'better_fit' is significant than 'fit1', with a P-value of **0.1**. Let's look at the coefficients given by this model.
```{r}
round(summary(better_fit)$coef,5);
```
## 4. Inference & Conclusion
+ Cars with **Manual** transmission will have more miles per (US) gallon by a factor of **1.809**, over an **automatic** transmission. This value is adjusted by considering horsepower, weight and no. of cylinders.
+ **95%** of this factor lies between **-1.06 to 4.67** (see Appendix 6.3.4), as zero lies in the interval, we are not quite confident about its significance.
+ Our linear model explains **86.58%** of the total variance.
## 5. Diagnostics
Refer appendix 6.2, to find the diagnostics plot of the 'better_fit' model.
+ Residual Vs Fitted plot doesn't show any **systematic pattern** or **heteroskedasticity**, so our model fit is good.
+ Remaining plots show few outlier which have some **influence and leverage** on the model fit. The following car models are the outliers
```{r}
df<-as.data.frame(dfbetas(better_fit));
rownames(df[df$ammanual %in% tail(sort(df$ammanual),4),]);
```
## 6. Appendix
### 6.1. Relationship plots
```{r warning=FALSE, echo=FALSE}
library(datasets);
data("mtcars");
library(ggplot2);
library(gridExtra);
plot1<-ggplot(aes(mpg,cyl),data=mtcars)+geom_point()+geom_smooth(method = lm);
plot2<-ggplot(aes(mpg,disp),data=mtcars)+geom_point()+geom_smooth(method = lm);
plot3<-ggplot(aes(mpg,hp),data=mtcars)+geom_point()+geom_smooth(method = lm);
plot4<-ggplot(aes(mpg,drat),data=mtcars)+geom_point()+geom_smooth(method = lm);
plot5<-ggplot(aes(mpg,wt),data=mtcars)+geom_point()+geom_smooth(method = lm);
plot6<-ggplot(aes(mpg,qsec),data=mtcars)+geom_point()+geom_smooth(method = lm);
plot7<-ggplot(aes(mpg,vs),data=mtcars)+geom_point()+geom_smooth(method = lm);
plot8<-ggplot(aes(mpg,am),data=mtcars)+geom_point()+geom_smooth(method = lm);
plot9<-ggplot(aes(mpg,gear),data=mtcars)+geom_point()+geom_smooth(method = lm);
plot10<-ggplot(aes(mpg,carb),data=mtcars)+geom_point()+geom_smooth(method = lm);
grid.arrange(plot1,plot2,plot3,plot4,plot5,plot6,plot7,plot8,plot9,plot10,ncol=5);
```
### 6.2. Diagnostics plot
```{r echo=FALSE}
par(mfrow=c(2,2));
plot(better_fit);
```
### 6.3. Summary of linear models
#### 6.3.1. First model - fit0
```{r echo=FALSE}
round(summary(fit0)$coef,5);
```
#### 6.3.2. Second model - fit1
```{r echo=FALSE}
round(summary(fit1)$coef,5);
```
#### 6.3.3. Third model - better_fit
```{r echo=FALSE}
round(summary(better_fit)$coef,5);
```
#### 6.3.4. 95% Confidence interval
```{r}
# Get the 95% confidence interval
confint(better_fit,'ammanual');
```