-
Notifications
You must be signed in to change notification settings - Fork 0
/
machinelearning.Rmd
183 lines (161 loc) · 6.49 KB
/
machinelearning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
#Practical Machine Learning/ Prediction Assignment
##Background
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement Â- a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.
One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this data set, the participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants toto predict the manner in which praticipants did the exercise.
The dependent variable or response is the "classe" variable in the training set.
##Data
Download and load the data.
```{r}
#download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", destfile = "./pml-training.csv")
#download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", destfile = "./pml-testing.csv")
setwd("C:/Users/murughan/Desktop/coursera")
trainingOrg = read.csv("pml-training.csv", na.strings=c("", "NA", "NULL"))
# data.train = read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", na.strings=c("", "NA", "NULL"))
testingOrg = read.csv("pml-testing.csv", na.strings=c("", "NA", "NULL"))
dim(trainingOrg)
```
```{r}
dim(testingOrg)
```
##Pre-screening the data
There are several approaches for reducing the number of predictors.
Remove variables that we believe have too many NA values.
```{r}
training.dena <- trainingOrg[ , colSums(is.na(trainingOrg)) == 0]
#head(training1)
#training3 <- training.decor[ rowSums(is.na(training.decor)) == 0, ]
dim(training.dena)
```
Remove unrelevant variables There are some unrelevant variables that can be removed as they are unlikely to be related to dependent variable.
```{r}
remove = c('X', 'user_name', 'raw_timestamp_part_1', 'raw_timestamp_part_2', 'cvtd_timestamp', 'new_window', 'num_window')
training.dere <- training.dena[, -which(names(training.dena) %in% remove)]
dim(training.dere)
```
Check the variables that have extremely low variance (this method is useful nearZeroVar()
```{r}
library(caret)
```
```{r}
# only numeric variabls can be evaluated in this way.
zeroVar= nearZeroVar(training.dere[sapply(training.dere, is.numeric)], saveMetrics = TRUE)
training.nonzerovar = training.dere[,zeroVar[, 'nzv']==0]
dim(training.nonzerovar)
```
Remove highly correlated variables 90% (using for example findCorrelation()
```{r}
# only numeric variabls can be evaluated in this way.
corrMatrix <- cor(na.omit(training.nonzerovar[sapply(training.nonzerovar, is.numeric)]))
dim(corrMatrix)
```
```{r}
# there are 52 variables.
corrDF <- expand.grid(row = 1:52, col = 1:52)
corrDF$correlation <- as.vector(corrMatrix)
levelplot(correlation ~ row+ col, corrDF)
```
We are going to remove those variable which have high correlation.
```{r}
removecor = findCorrelation(corrMatrix, cutoff = .90, verbose = TRUE)
training.decor = training.nonzerovar[,-removecor]
dim(training.decor)
```
##Split data to training and testing for cross validation.
```{r}
inTrain <- createDataPartition(y=training.decor$classe, p=0.7, list=FALSE)
training <- training.decor[inTrain,]; testing <- training.decor[-inTrain,]
dim(training);dim(testing)
```
We got 13737 samples and 46 variables for training, 5885 samples and 46 variables for testing.
##Analysis
Regression Tree
Now we fit a tree to these data, and summarize and plot it. First, we use the 'tree' package. It is much faster than 'caret' package.
```{r}
library(tree)
set.seed(12345)
tree.training=tree(classe~.,data=training)
summary(tree.training)
```
```{r}
plot(tree.training)
text(tree.training,pretty=0, cex =.8)
```
This is a bushy tree, and we are going to prune it.
##Rpart form Caret, very slow.
```{r}
library(caret)
modFit <- train(classe ~ .,method="rpart",data=training)
print(modFit$finalModel)
```
##Prettier plots
```{r}
library(rattle)
```
```{r}
fancyRpartPlot(modFit$finalModel)
```
##Cross Validation
We are going to check the performance of the tree on the testing data by cross validation.
```{r}
tree.pred=predict(tree.training,testing,type="class")
predMatrix = with(testing,table(tree.pred,classe))
sum(diag(predMatrix))/sum(as.vector(predMatrix)) # error rate
```
The 0.70 is not very accurate.
```{r}
tree.pred=predict(modFit,testing)
predMatrix = with(testing,table(tree.pred,classe))
sum(diag(predMatrix))/sum(as.vector(predMatrix)) # error rate
```
The 0.50 from 'caret' package is much lower than the result from 'tree' package.
##Pruning tree
This tree was grown to full depth, and might be too variable. We now use Cross Validation to prune it.
```{r}
cv.training=cv.tree(tree.training,FUN=prune.misclass)
cv.training
```
```{r}
plot(cv.training)
```
```{r}
prune.training=prune.misclass(tree.training,best=18)
```
Now lets evaluate this pruned tree on the test data.
```{r}
tree.pred=predict(prune.training,testing,type="class")
predMatrix = with(testing,table(tree.pred,classe))
sum(diag(predMatrix))/sum(as.vector(predMatrix))
```
##Random Forests
Random forests build lots of bushy trees, and then average them to reduce the variance.
```{r}
require(randomForest)
```
```{r}
set.seed(12345)
```
Lets fit a random forest and see how well it performs.
```{r}
rf.training=randomForest(classe~.,data=training,ntree=100, importance=TRUE)
rf.training
```
```{r}
#plot(rf.training, log="y")
varImpPlot(rf.training)
```
##Out-of Sample Accuracy
Our Random Forest model shows OOB estimate of error rate: 0.72% for the training data. Now we will predict it for out-of sample accuracy.
Now lets evaluate this tree on the test data.
```{r}
tree.pred=predict(rf.training,testing,type="class")
predMatrix = with(testing,table(tree.pred,classe))
sum(diag(predMatrix))/sum(as.vector(predMatrix)) # error rate
```
0.99 means we got a very accurate estimate.
##Conclusion
Now we can predict the testing data from the website.
```{r}
answers <- predict(rf.training, testingOrg)
answers
```