machinelearning.Rmd

#Practical Machine Learning/ Prediction Assignment
##Background
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement Â- a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this data set, the participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants toto predict the manner in which praticipants did the exercise.

The dependent variable or response is the "classe" variable in the training set.

##Data
Download and load the data.
```{r}

#download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", destfile = "./pml-training.csv")
#download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", destfile = "./pml-testing.csv")

setwd("C:/Users/murughan/Desktop/coursera")

trainingOrg = read.csv("pml-training.csv", na.strings=c("", "NA", "NULL"))
# data.train =  read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", na.strings=c("", "NA", "NULL"))

testingOrg = read.csv("pml-testing.csv", na.strings=c("", "NA", "NULL"))
dim(trainingOrg)
```
```{r}
dim(testingOrg)
``` 
##Pre-screening the data

There are several approaches for reducing the number of predictors.

 Remove variables that we believe have too many NA values.
```{r}
training.dena <- trainingOrg[ , colSums(is.na(trainingOrg)) == 0]
#head(training1)
#training3 <- training.decor[ rowSums(is.na(training.decor)) == 0, ]
dim(training.dena)
```
Remove unrelevant variables There are some unrelevant variables that can be removed as they are unlikely to be related to dependent variable.
```{r}
remove = c('X', 'user_name', 'raw_timestamp_part_1', 'raw_timestamp_part_2', 'cvtd_timestamp', 'new_window', 'num_window')
training.dere <- training.dena[, -which(names(training.dena) %in% remove)]
dim(training.dere)
```
Check the variables that have extremely low variance (this method is useful nearZeroVar() 
```{r}
library(caret)
```
```{r}
# only numeric variabls can be evaluated in this way.

zeroVar= nearZeroVar(training.dere[sapply(training.dere, is.numeric)], saveMetrics = TRUE)
training.nonzerovar = training.dere[,zeroVar[, 'nzv']==0]
dim(training.nonzerovar)
```
Remove highly correlated variables 90% (using for example findCorrelation() 
```{r}
# only numeric variabls can be evaluated in this way.
corrMatrix <- cor(na.omit(training.nonzerovar[sapply(training.nonzerovar, is.numeric)]))
dim(corrMatrix)
```
```{r}
# there are 52 variables.
corrDF <- expand.grid(row = 1:52, col = 1:52)
corrDF$correlation <- as.vector(corrMatrix)
levelplot(correlation ~ row+ col, corrDF)
```
We are going to remove those variable which have high correlation.
```{r}
removecor = findCorrelation(corrMatrix, cutoff = .90, verbose = TRUE)
training.decor = training.nonzerovar[,-removecor]
dim(training.decor)
```
##Split data to training and testing for cross validation.
```{r}
inTrain <- createDataPartition(y=training.decor$classe, p=0.7, list=FALSE)
training <- training.decor[inTrain,]; testing <- training.decor[-inTrain,]
dim(training);dim(testing)
```
We got 13737 samples and 46 variables for training, 5885 samples and 46 variables for testing.

##Analysis

Regression Tree

Now we fit a tree to these data, and summarize and plot it. First, we use the 'tree' package. It is much faster than 'caret' package.
```{r}
library(tree)
set.seed(12345)
tree.training=tree(classe~.,data=training)
summary(tree.training)
```
```{r}
plot(tree.training)
text(tree.training,pretty=0, cex =.8)
```

This is a bushy tree, and we are going to prune it.

##Rpart form Caret, very slow.
```{r}
library(caret)
modFit <- train(classe ~ .,method="rpart",data=training)
print(modFit$finalModel)
```
##Prettier plots
```{r}
library(rattle)
```
```{r}
fancyRpartPlot(modFit$finalModel)
```
##Cross Validation

We are going to check the performance of the tree on the testing data by cross validation.
```{r}
tree.pred=predict(tree.training,testing,type="class")
predMatrix = with(testing,table(tree.pred,classe))
sum(diag(predMatrix))/sum(as.vector(predMatrix)) # error rate
```
The 0.70 is not very accurate.
```{r}
tree.pred=predict(modFit,testing)
predMatrix = with(testing,table(tree.pred,classe))
sum(diag(predMatrix))/sum(as.vector(predMatrix)) # error rate
```
The 0.50 from 'caret' package is much lower than the result from 'tree' package.

##Pruning tree

This tree was grown to full depth, and might be too variable. We now use Cross Validation to prune it.
```{r}
cv.training=cv.tree(tree.training,FUN=prune.misclass)
cv.training
```
```{r}
plot(cv.training)
```
```{r}
prune.training=prune.misclass(tree.training,best=18)
```
Now lets evaluate this pruned tree on the test data.
```{r}
tree.pred=predict(prune.training,testing,type="class")
predMatrix = with(testing,table(tree.pred,classe))
sum(diag(predMatrix))/sum(as.vector(predMatrix))
```
##Random Forests

Random forests build lots of bushy trees, and then average them to reduce the variance.
```{r}
require(randomForest)
```
```{r}
set.seed(12345)
```
Lets fit a random forest and see how well it performs.
```{r}
rf.training=randomForest(classe~.,data=training,ntree=100, importance=TRUE)
rf.training
```
```{r}
#plot(rf.training, log="y")
varImpPlot(rf.training)
```
##Out-of Sample Accuracy
Our Random Forest model shows OOB estimate of error rate: 0.72% for the training data. Now we will predict it for out-of sample accuracy.

Now lets evaluate this tree on the test data.
```{r}
tree.pred=predict(rf.training,testing,type="class")
predMatrix = with(testing,table(tree.pred,classe))
sum(diag(predMatrix))/sum(as.vector(predMatrix)) # error rate
```
0.99 means we got a very accurate estimate.
##Conclusion

Now we can predict the testing data from the website.
```{r}
answers <- predict(rf.training, testingOrg)
answers
```