To predict the behavior of customers who will be churn in the next month in order to retain those customers by analyze all relevant customer data and develop focused customer retention programs.
- Each row represents a customer.
- Each column contains customer’s attributes.
- The data set includes information about:
- Customers who left within the last month – the column is called Churn
- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about customers – gender, age range, and if they have partners and dependents
- The used dataset was obtained in https://www.kaggle.com/blastchar/telco-customer-churn
Some hypotheses were formulated to predict the behavior of customers who are likely to churn or not in the next month. These hypotheses were divided into three groups:
- Services information hypotheses
- These hypotheses are related to the customer contacted products and services
- Customer information hypotheses
- These hypotheses are related to the customer account information, like contract, payment method and charges
- Sociodemographic information hypotheses
- These hypotheses are related to the customer sociodemographic information, like gender, age and so on
Hypotheses summary | ||
---|---|---|
Group | # | Hypothesis |
Services informations |
H1 | Customers with phone services are less likely to churn |
H2 | Customers with fewer phone lines are more likely to churn | |
H3 | Customers with internet services are more likely to churn | |
H4 | Customers with online security service are more likely to churn | |
H5 | Customers with online backup service are more likely to churn | |
H6 | Customers with device protection service are more likely to churn | |
H7 | Customers with tech support service are more likely to churn | |
H8 | Customers with TV streaming service are more likely to churn | |
H9 | Customers with movie streaming service are more likely to churn | |
Customer informations |
H10 | Customers with a longer relationship with the company are less likely to churn |
H11 | Customers monthly contract are more likely to churn | |
H12 | Customers with paperless billing are more likely to churn | |
H13 | Customers with automatic payment methods are less likely to churn | |
H14 | Customers with less monthly expenses are less likely to churn | |
H15 | Customers with less total expenses are less likely to churn | |
Sociodemographic informations |
H16 | Female customers are less likely to churn |
H17 | Senior citizen customers are less likely to churn | |
H18 | Customers with partners are more likely to churn | |
H19 | Customers with dependents are more likely to churn |
The methodology used in the analysis of this case will be the CRISP-DM, through the following division of scripts (click to go to the notebook):
- Exploratory Data Analysis - Business and Data understanding
- Data pre-processing - Data preparation
- Statistical modeling of a churn propensity model - Modeling and Evaluation
- Statistical modeling of a regression model of the customer charge - Modeling and Evaluation
- Statistical modeling of a customer clustering model - Modeling and Evaluation
- Customers with phone service seem to have a slightly greater propensity to churn than others
- Customers with multiple phone lines seem to have a slightly greater propensity to churn than others
- Customers with fiber optic internet service seem to have a greater propensity to churn than DSL customers. Also, customers with internet services seem to have a greater propensity to churn than others
- Customers without online security, online backup, device protection, and tech support services seem to have a slightly greater propensity to churn than others
- Customers without TV streaming and movies streaming seem to have a greater propensity to churn than other customers
- Customers with less tenure seem to have a greater propensity to churn than others
- Customers with lower monthly charges seem to have a lower propensity to churn than others
- Month-to-month customers seem to have a greater propensity to churn than others
- Paperless billing customers seem to have a greater propensity to churn than others
- Customers with eletronic check payment method seem to have a greater propensity to churn than others
- Customers with automatic payment method seem to have less propensity to churn than others
- There are no significant differences between the two groups in the gender variable
- Senior citizens seem to have a greater propensity to churn than others
- Customers without partner seem to have a greater propensity to churn than others
- Customers without dependents seem to have a greater propensity to churn than others
The following models was used:
- Logistic Regression
- K-Nearest Neighbors
- Decision Tree
- Gaussian Naive-Bayes
- Suport Vector Machine
- Gradient Boosting
- Extra Trees
- Ada Boost
- Stochastic gradient descent
- Random Forest
- XGBoost
- Multi Layer Perceptron Neural Network
The model training and evaluate followed the steps below:
- Scenery 1.1: Training all the models with the default hyperparameters in the original dataset
- Scenery 1.2: Training all the models with the default hyperparameters in the dataset with tenure as categorical
- Scenery 1.3: Training all the models with the default hyperparameters in the original dataset after PCA transformation
- Scenery 1.4: Training all the models with the default hyperparameters in the original dataset after SMOTE balancing
- Scenery 1.5: Training all the models with the default hyperparameters in the original dataset after undersampling balancing
- Scenery 1.6: Training all the models with the default hyperparameters in the original dataset after feature selection
- Scenery 2.1: Training all the models with the default hyperparameters using the best three founded scenarios
- Hyperparameter optimization of the best models found in the best scenario
- Creation of voting classifiers with these best models
- Definition of the chosen model
The models evaluation focus mainly in:
- F1-Score (test partition)
- Precision (test partition)
- Cross validation accuracy (in train partition)
- Stability (in cross validation train and the whole sample train)
- Accuracy (test partition)
So, for this model, the most important feature was the tenure. The higher the tenure, the lower the churn propensity. This is inline with the EDA insights and the hypothesis.
According to the lift, the first two deciles are highly important to be encouraged by a marketing campaign in order to stay in the company and not go into churn. The third and fourth deciles also have a higher churn average than the total base average, so they should also be reached by a marketing campaign.
The following models was used:
- Linear regression (OLS)
- Logistic Regression
- K-Nearest Neighbors
- Decision Tree
- Gaussian Naive-Bayes
- Suport Vector Machine
- Gradient Boosting
- Extra Trees
- Ada Boost
- Stochastic gradient descent
- Random Forest
- XGBoost
- Multi Layer Perceptron Neural Network
The model training and evaluate followed the steps below:
- Training all the models with the default hyperparameters in the original dataset
- Hyperparameter optimization of the best models found (if needed)
- Creation of voting regressors with these best models (if needed)
- Definition of the chosen model
The models evaluation focus mainly in:
- R2-score (test partition)
- MAE (test partition)
- MSE (test partition)
The best model was a linear regression.
In this experiment, a regression model was created to predict the monthly charge paid by the customer. This model is useful to be used with the churn propensity model, in the final table of propensity, where it can be compared if the value of the customer prone to churn is outdated (greater) than that of customers who are not prone.
In this way, we were also able to understand the impact on cost that each service purchased has, as well as their combination in bundle.
Using an elbow method, the KN clustering founds 4 clusters to the customers.
From these analysis:
Cluster 0:
- The customers are younger than clusters 1 and 3
- Has more phone services than internet services
- Has usually one phone line
- Customers with approximately 3 years of tenure
- Has a low term contract (monthly)
- Uses an electronic check payment method
- Uses as much fiber optic internet as DSL internet
- Spends an average of 62 dollars per month
Cluster 1:
- Higher percentage of senior citizens than other clusters
- Higher percentage of married people with dependents
- Higher tenure (more than 5 years)
- Has all the internet and phone services
- Has multiple phone lines
- Uses a fiber optic internet
- Higher monthly charges than others clusters
- Spends an average of 100 dollars per month
- Has paperless billing
- High use of TV and movie streaming
- Has a long term contract (2 years)
- Uses an automatic payment method
- Less propensity to churn
Cluster 2:
- Younger customers
- Has more phone services than internet services
- Has one phone line
- Customers with less than 1 year of tenure
- Has a low term contract (monthly)
- Uses an electronic check and mailed check payment method
- Uses as much fiber optic internet as DSL internet
- Spends an average of 48 dollars per month
- High propensity to churn
Cluster 3:
- Has more internet services than phone services
- Has more than one phone line
- Customers with 4 years of tenure
- Has a mid term contract (1 year)
- Uses an electronic check payment method
- Uses a fiber optic internet
- Spends an average of 82 dollars per month
Visit my LinkedIn!