Applied K Means Clustering algorithm to create customer segmentation. By creating customer segmentation, we can identified the behavior of customers and create marketing campaign based on each segmentation
Marketing is crucial for the growth and sustainability of any business. Marketers can help build the company's brand, engage customers, grow revenue, and increase sales.One of the key pain points for marketers is to know their customers and identify their needs. By understanding the customer, marketers can launch a targeted marketing campaign that is tailored for specific needs.
In this project, we are going to perform customer segmentation based on data of the customers. By performing customer segmentation, we can identify customer needs and behaviours. In this way, we can create effective marketing campaign for customers.
We used customer personality analysis from kaggle that contains the information of 2240 customers. The following is the first two rows of the dataset :
ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 2012-04-09 | 58 | 635 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 2012-08-03 | 38 | 11 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
Attributes
People
-
ID
: Customer's unique identifier -
Year_Birth
: Customer's birth year -
Education
: Customer's education level -
Marital_Status
: Customer's marital status -
Income
: Customer's yearly household income -
Kidhome
: Number of children in customer's household -
Teenhome
: Number of teenagers in customer's household -
Dt_Customer
: Date of customer's enrollment with the company -
Recency
: Number of days since customer's last purchase -
Complain
: 1 if the customer complained in the last 2 years, 0 otherwise Products -
MntWines
: Amount spent on wine in last 2 years -
MntFruits
: Amount spent on fruits in last 2 years -
MntMeatProducts
: Amount spent on meat in last 2 years -
MntFishProducts
: Amount spent on fish in last 2 years -
MntSweetProducts
: Amount spent on sweets in last 2 years -
MntGoldProds
: Amount spent on gold in last 2 years
Promotion
-
NumDealsPurchases
: Number of purchases made with a discount -
AcceptedCmp1
: 1 if customer accepted the offer in the 1st campaign, 0 otherwise -
AcceptedCmp2
: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise -
AcceptedCmp3
: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise -
AcceptedCmp4
: 1 if customer accepted the offer in the 4th campaign, 0 otherwise -
AcceptedCmp5
: 1 if customer accepted the offer in the 5th campaign, 0 otherwise -
Response
: 1 if customer accepted the offer in the last campaign, 0 otherwise
Place
-
NumWebPurchases
: Number of purchases made through the company’s website -
NumCatalogPurchases
: Number of purchases made using a catalogue -
NumStorePurchases
: Number of purchases made directly in stores -
NumWebVisitsMonth
: Number of visits to company’s website in the last month
After importing our dataset, we should take a look for missing values in our dataset. We have 24 missing values on Income
column, lets fill them up with average values on income
column.
Before we do visualization of customers profile, lets make new column that contain age of the customers. The following is new dataset with age column :
ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | Age |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 2012-04-09 | 58 | 635 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 | 65 |
2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 2012-08-03 | 38 | 11 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 | 68 |
Customers Age
Boxplot above shows that there are huge outlier on age boxplot, we should drop it later.
We can divide age of customer into 4 category : Young, Adult, Mature, and Senior category
Now we know that all of the customer is consist by 54.1% of customers falls into mature category, 25.8% of customers falls into adult category, 18.8% of customers falls into senior category, 1.25% of customers falls into young category.
Customers Education
Based on chart above, the majority of customer is graduation or bachelor deegre followed by master and PhD deegre. We can divide the education of customer for our model into two categories, undergraduate and postgraduate
Now we know that all of the customer is consist by 88.5% of customer have postgraduate education and 11.5% of customer have undergraduate education
Customers Marital Status
Based on chart above, we know that there are many marital status with absurd and YOLO being one of the marital status of the customers. Lets divide it into two categories, single and in-relationship.
After we divide the marital status into two categories, now we know that all of the customer is consist by 64.5% of customer in relationship and 35.5% of customer are single.
Customers Income
The plot above shows there are huge outlier in income boxplot with income on 666.666k ,we should remove it later.
We can divide the income of each customers into 4 category : low income, low to medium income, medium to high income, and high income.
Distribution of Customers by Children in Their Home
Now lets combine Kidhome and Teenhome into one column. We call it 'Number of Child' column, this column contain info of the child in customer household.
Now we know that majority of the customers have 1 children followed by 0 children and 2 children.
Customers complain
Pie chart above shows that 99.1% of customers never filed a complain while 0.938% of customers have filed a complain
We already take a look into our customer profile, now lets create new columns that contain information of total monthly spend,total number of campaign accepted, and total purchases
We drop Dt_Customer
, Kidhome
, Teenhome
, Recency
, ID
, Year_Birth
, Income
, Age
, Z_CostContact
, Z_Revenue
columns because its unnecesary for our cluster
There are categorical data on model data, we will do ordinal encoding to give label for these columns with categorical data :
Age_group
= {'Young': 1, 'Adult': 2 , 'Mature': 3, 'Senior': 4}Education
= {'Undergraduate': 1, 'Postgraduate': 2}Marital_Status
= {'Single': 1, 'In-Relationship': 2}Income_group
= {'Low income': 1, 'Low to medium income': 2, 'Medium to high income': 3, 'High income': 4}
Before we put the data into our model, we should normalize the data first.Normalization gives equal weights/importance to each variable so that no single variable steers model performance in one direction just because they are bigger numbers. We use Power Transformer to normalize the data
Before Normalization :
After Normalization :
Before we cluster our data, we should find out ideal cluster for us using Elbow Method and silhouette score for validation.
In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The vertical lines of the graph above shows that ideal cluster for our cluster is 4
Silhouette Coefficient or silhouette score is a metric used to calculate the goodness of a clustering technique. Its value ranges from -1 to 1, each clusters are well apart from each other and clearly distinguished if silhouette score is towards 1.
We have bad silhouette score because our silhouette score is towards 0.
To optimize it, we can apply principal component analysis (PCA) for our cluster
PCA is an unsupervised machine learning algorithm. PCA performs dimensionality reductions while attempting at keeping the original information unchanged.
We will fit PCA algorithm into our cluster and visualize it for customer segmentation
Based on visualization of customer clusters, we can tell the characteristic of each clusters
- Majority of customers belong to this cluster
- Have low to medium - low income
- Majority have 1 - 2 children at home
- Spend less money on products
- Buy less products
- Complains a lot
- Don't like to buy product via catalog
- Gives negatives responses on marketing campaign
- Have medium-high to high income
- Have 0-1 children at home
- Cluster with second highest amount of total purchases
- Spent more money on fruits and fish products
- Gives negative responses on marketing campaign
- Likes to buy product via store
- Likes discount
- Have low-medium to medium-high income
- Consist of customers in adult to senior age group
- Have 1-2 children at home
- Spent more money on wine and gold
- Really likes discount
- Gives positives responses on marketing campaign
- Cluster with least number of customers on it
- Have high income
- Majority are child-free
- Cluster with highest amount of total purchases
- Spent more money on all kinds of products
- Less likely to complain
- likes to buy product via catalog
- Don't likes discount
- Gives positive responses on marketing campaign
After we perform clustering to create customer segmentation, we have 4 segmentation for our customers. We can tell that cluster 3 is our best cluster because they are spent more money on all kinds of products, less likely to complain, don't likes discount, and gives positive responses on marketing campaign. Cluster 0 is our worst cluster because thay are spend less money on products, buy less products, complains a lot, and gives negative responses on marketing campaign
For next campaign, we can create marketing campaign towards cluster 2 and cluster 3 because they are giving positive responses on marketing campaign. For cluster 3, we can use catalog as media for our campaign because they likes to buy product via catalog. For cluster 2, we can gives discount with minimum spend on wines and gold products because they likes discount and spend more money on gold and wines product