eda.Rmd

---
title: "Bank Customer Churn Prediction"
author: "Mohanapriya Singaravelu"
date: "20/04/2022"
output: html_document
---

```{r}
library(dplyr)
library(ggplot2)
library(corrplot)
library(mltools)
library(caret)
library(caTools)
library(caretEnsemble)
```


Import the dataset 

```{r}
df = read.csv("Churn_Modelling.csv")
head(df)
#summary(df)
glimpse(df)
```


```{r}
notRequired = c(1,2,3)
df1 = df[-notRequired]
head(df1)
```

Check for null values 
```{r}
sum(is.null(df1))
```
No null values 



Get the numeric columns of the dataset

```{r}

num_cols <- unlist(lapply(df1, is.numeric))
numData <- df1[ , num_cols]
numData = data.frame(numData)

```



```{r}
head(numData)
```
Correlations 

```{r}
res = cor(numData)
res
```


Correlation plot
```{r}

corrplot(res, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)
```

```{r}
dummy <- dummyVars(" ~ .", data=df1)
df2 <- data.frame(predict(dummy, newdata=df))
df2

```


```{r}
colnames(df2)[2] = "France"
colnames(df2)[3] = "Germany"
colnames(df2)[4] = "Spain"
colnames(df2)[5] = "Male"
colnames(df2)[6] = "Female"

head(df2)
#df2[,14]
```

```{r}
#df2[,14] = lapply(df2[,14],as.character)
df2$Exited = as.factor(df2$Exited)
levels(df2$Exited) <- c("yes", "no")

```

```{r}
head(df2)
```


Test train split 

```{r}
set.seed(100)
sample <- sample.int(n = nrow(df2), size = floor(.75*nrow(df2)), replace = F)
train <- df2[sample, ]
test  <- df2[-sample, ]

```

```{r}
nrow(test)
nrow(train)

```



Base model - logistic regression. 

This uses all the columns to train the model. 

```{r}
model_glm = glm(Exited ~ . , family="binomial", data = train)

#Predictions on the test set
y_pred1 = predict(model_glm, newdata = test, type = "response")
# Confusion matrix on test set
cm = table(test$Exited, y_pred1 >= 0.5)
```



```{r}
c = data.frame(cm)
correct = c$Freq[1] + c$Freq[4]
correct
base_acc = correct/nrow(test)
base_acc

```


The base model of logistic regression gives an accuracy of 80.16% 

We now build various ensemble models to see if it improves the accuracy. 


Ensemble model 1: Bagging


Bagging model 1: Bagged decision Tree

Bagging, or bootstrap aggregation, is an ensemble method that involves training the same algorithm many times by using different subsets sampled from the training data. The final output prediction is then averaged across the predictions of all the sub-models. The two most popular bagging ensemble techniques are Bagged Decision Trees and Random Forest.

```{r}
set.seed(100)

control1 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)

bagCART_model <- train(Exited ~., data=train, method="treebag", metric="Accuracy", trControl=control1) #treebag denotes bagged decision tree 

#Predictions on the test set
y_pred2 = predict(bagCART_model, newdata = test)
```


```{r}
# Confusion matrix on test set
cm1 = table(test$Exited, y_pred2)
c1 = data.frame(cm1)
correct1 = c1$Freq[1] + c1$Freq[4]
correct1
correct1/nrow(test) 
```

Bagged decision tree model has a slightly higher accuracy of 82.56% 

Bagged Model 2: Random Forest


Random Forest is an extension of bagged decision trees, where the samples of the training dataset are taken with replacement. The trees are constructed with the objective of reducing the correlation between the individual decision trees.
```{r}
set.seed(100)

control2 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)

rf_model <- train(Exited ~., data=train, method="rf", metric="Accuracy", trControl=control2) #rf denotes random forest 

y_pred3 = predict(rf_model, newdata = test, type = "raw")
```

```{r}
cm2 = table(test$Exited, y_pred3)
c2 = data.frame(cm2)
correct2 = c2$Freq[1] + c2$Freq[4]
correct2
correct2/nrow(test) 
```

Random forest has further improved its accuracy on bagged decision tree. It produces an accuracy of 83.56% 



Ensemble technique 2: Stochastic Gradient Boosting

The boosting algorithm focuses on classification problems and aims to convert a set of weak classifiers into a strong one.
In boosting, multiple models are trained sequentially and each model learns from the errors of its predecessors.

```{r}
set.seed(100)
control3 <- trainControl(sampling="rose",method="repeatedcv", number=5, repeats=5)

gbm_model <- train(Exited ~., data=train, method="gbm", metric="Accuracy", trControl=control3)

y_pred4 = predict(gbm_model, newdata = test)

cm3 = table(test$Exited, y_pred4)
c3 = data.frame(cm3)
correct3 = c3$Freq[1] + c3$Freq[4]
correct3
correct3/nrow(test) 
```


Stochastic gradient boosting algorithm has a minor improvement on random forest. It gives an accuracy of 83.8% 

Ensemble Technique 3: Stacking


```{r}
set.seed(100)

control_stacking <- trainControl(method="repeatedcv", number=5, repeats=2, savePredictions=TRUE, classProbs=TRUE)

algorithms_to_use <- c('rpart', 'glm', 'knn', 'svmRadial')

stacked_models <- caretList(Exited ~., data=df2, trControl=control_stacking, methodList=algorithms_to_use)

stacking_results <- resamples(stacked_models)

summary(stacking_results)
```

```{r}
# stack using glm
stackControl <- trainControl(method="repeatedcv", number=5, repeats=3, savePredictions=TRUE, classProbs=TRUE)

set.seed(100)

glm_stack <- caretStack(stacked_models, method="glm", metric="Accuracy", trControl=stackControl)

print(glm_stack)
```


Stacked model gives the highest accuracy of 85.6%. The models that were stacked together are:  Decision tree(rpart), KNN, SVM and Logistic regression. 

```{r}
glm_stack$error$Accuracy
```

Plotting results 


```{r}
results_df = read.csv("results.csv")
results_df
```



```{r}
 ggplot(data = results_df, aes(x = Threshold, y =  Logistic.Regression)) +
  geom_line(color = "red")+
  geom_point()
```



```{r}
 ggplot(data = results_df, aes(x = Threshold, y = Bagged.Decision.Tree )) +
  geom_line(color = "blue")+
  geom_point()+
   ylim(0.75,0.85)

```

```{r}
 ggplot(data = results_df, aes(x = Threshold, y = Random.Forest)) +
  geom_line(color = "orange")+
  geom_point()+
   ylim(0.75,0.85)

```

```{r}
 ggplot(data = results_df, aes(x = Threshold, y = Boosting)) +
  geom_line(color = "green")+
  geom_point()+
   ylim(0.75,0.85)
```


```{r}
ggplot(data = results_df, aes(x = Threshold, y = Stacking)) +
  geom_line(color = "black")+
  geom_point()+
   ylim(0.75,0.88)+
  
```



```{r}
library("reshape2")  
  
data <- melt(results_df, id = "Threshold")
ggplot(data,  aes(x = Threshold, y = value, color = variable)) +  
  geom_line()+
  ggtitle("Accuracies of Models")+
  xlab("Threshold value for correlation")+
  ylab("Accuracy")
                    
```