-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCP.Rmd
186 lines (146 loc) · 6.33 KB
/
CP.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
title: "Human Activity Recognition"
author: "Christian Otto"
graphics: yes
output:
html_document:
toc: false
theme: united
---
# Human Activity Recognition
Course: Practical Machine Learning
Author: Christian Otto
Date: `r format(Sys.Date(), "%B %d, %Y")`
### Preparation
It is necessary to load R packages and set global options to
`knitr`. Note that it is assumed that all required packages are
installed. In addition, the random generator is initialized with the
same seed in order to obtain reproducibility of the results.
```{r global_options, cache=TRUE, results="hide", message=FALSE, warning=FALSE}
library(plyr)
library(ggplot2)
library(knitr)
library(caret)
set.seed(1337)
opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
```
### Data Processing
#### Retrieving and Loading the Data
Download the data (if necessary) and reading the training and testing
data into a data.frame.
```{r data_loading, cache=TRUE}
## Set URLs accordingly
training.url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
training.file <- "pml-training.csv"
testing.url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
testing.file <- "pml-testing.csv"
## Download files
if (!file.exists(training.file)){
download.file(training.url, training.file, method="curl")
}
if (!file.exists(testing.file)){
download.file(testing.url, testing.file, method="curl")
}
## Load data into R
data <- read.csv(file=training.file, na.strings=c("NA", "", "DIV/0!"))
testing <- read.csv(file=testing.file, na.strings=c("NA", "", "DIV/0!"))
```
#### Partitioning data into training and probing dataset
Partitioning the dataset into 60% training and 40% probing dataset in
order to get an estimate of the out-of-sample error.
```{r data_partitioning, cache=TRUE}
inTrain <- createDataPartition(data$classe, p = 0.6)[[1]]
training <- data[inTrain,]
probing <- data[-inTrain,]
```
#### Filtering covariates
Identify predictors in the training set that have primarily NA values
(i.e., more than 50% of the values are NA). Actually, there are 100
columns with more than 97% missing values and 60 columns with no
missing values and hence it is clear that imputation is not an
option. Moreover, identify columns are that do not contain predictor
variables. Now, the NA and non-predictor columns are removed from all
datasets (training, probing, and testing). In addition, the classe
variable is converted to a factor variable for model building. In the
end, the datasets consist of `52` covariates.
```{r data_filtering, cache=TRUE}
NA.freq <- sapply(training, function(i){mean(is.na(i))})
NA.cols <- names(NA.freq[NA.freq > 0.5])
nonPred.cols <- c("X", "user_name", "raw_timestamp_part_1",
"raw_timestamp_part_2", "cvtd_timestamp",
"new_window", "num_window", "problem_id")
training <- subset(training, select=which(!colnames(training) %in% c(NA.cols, nonPred.cols)))
probing <- subset(probing, select=which(!colnames(probing) %in% c(NA.cols, nonPred.cols)))
testing <- subset(testing, select=which(!colnames(testing) %in% c(NA.cols, nonPred.cols)))
training$classe <- factor(training$classe)
probing$classe <- factor(probing$classe)
```
#### Building models on training dataset
Here, two different prediction models (random forests and boosting)
are built on the training set since it was states in the lectures that
these overall perform very well. The parameters of the models during
training are optimized by 10-fold cross validation.
```{r model_fitting_control, cache=TRUE}
ctr <- trainControl(method = "cv", number = 10)
```
```{r model_fitting_rf, cache=TRUE}
model.rf <- train(classe ~ ., data = training, method = "rf", trControl=ctr)
```
```{r model_fitting_boost, cache=TRUE}
model.boost <- train(classe ~ ., data = training, method = "gbm", verbose=FALSE, trControl=ctr)
```
Now, it is possible to get the maximum accuracy from the cross
validations even though these are not good estimates for the
out-of-sample error (therefore the probing dataset). Nevertheless,
they provide an idea about the models. It turns out that both models
are highly accurate with minor advantage for the random forest
model.
```{r model_cv_acc, cache=TRUE}
data.frame(model=c("random forest", "boosting"),
accuracy=c(round(max(model.rf$results$Accuracy), digits=3),
round(max(model.boost$results$Accuracy), digits=3)))
```
#### Getting estimates on out-of-sample error for both models
First, the predictions are done for the probing dataset using both
models. Second, the `confusionMatrix` command is used to get the
estimate on the out-of-sample error (i.e., 1 - accuracy). Note that
only the overall statistics is reported since the report is rather
large. Again, both models obtain accuracies above 90% but the random
forest performs best by a minor margin. Actually the error estimates
from the 10-fold cross-validations are very close to the ones obtained
in the probing dataset, suggesting that this partitioning may not have
been required to get an accurate out-of-sample error estimate and
train the model on more predictions in the first place.
For random forests:
```{r pred_probing_acc_rf, cache=TRUE}
pred.rf.probing <- predict(model.rf, probing)
round(confusionMatrix(pred.rf.probing, probing$classe)$overall, digits=3)
```
For boosting:
```{r pred_probing_acc_boosting, cache=TRUE}
pred.boost.probing <- predict(model.boost, probing)
round(confusionMatrix(pred.boost.probing, probing$classe)$overall, digits=3)
```
#### Predicting test cases
Instead of using model ensembling and stacking, let's simply perform
the predictions with both models and simply check whether both models
agree or not. If yes, the models seem to work very nicely and the
predictions of the test cases are more confident. Thus, the results
can be written to the files (uses the supplied code) and submitted to
the course website.
```{r pred_testing, cache=TRUE}
pred.rf.testing <- predict(model.rf, testing)
pred.boost.testing <- predict(model.boost, testing)
cat("Do both models agree on all test cases?",
all(pred.rf.testing == pred.boost.testing), "\n")
answers <- pred.rf.testing
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(answers)
```
By automated grading, all test cases turned out to correct.