Skip to content

Latest commit

 

History

History
333 lines (175 loc) · 16.2 KB

README.md

File metadata and controls

333 lines (175 loc) · 16.2 KB

Customer-Segmentation

Applied K Means Clustering algorithm to create customer segmentation. By creating customer segmentation, we can identified the behavior of customers and create marketing campaign based on each segmentation

Understanding the Problem Statement and Business Case

Marketing is crucial for the growth and sustainability of any business. Marketers can help build the company's brand, engage customers, grow revenue, and increase sales.One of the key pain points for marketers is to know their customers and identify their needs. By understanding the customer, marketers can launch a targeted marketing campaign that is tailored for specific needs.

In this project, we are going to perform customer segmentation based on data of the customers. By performing customer segmentation, we can identify customer needs and behaviours. In this way, we can create effective marketing campaign for customers.

Importing Datasets

We used customer personality analysis from kaggle that contains the information of 2240 customers. The following is the first two rows of the dataset :

ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
5524 1957 Graduation Single 58138.0 0 0 2012-04-09 58 635 ... 7 0 0 0 0 0 0 3 11 1
2174 1954 Graduation Single 46344.0 1 1 2012-08-03 38 11 ... 5 0 0 0 0 0 0 3 11 0

Attributes

People

  • ID: Customer's unique identifier

  • Year_Birth: Customer's birth year

  • Education: Customer's education level

  • Marital_Status: Customer's marital status

  • Income: Customer's yearly household income

  • Kidhome: Number of children in customer's household

  • Teenhome: Number of teenagers in customer's household

  • Dt_Customer: Date of customer's enrollment with the company

  • Recency: Number of days since customer's last purchase

  • Complain: 1 if the customer complained in the last 2 years, 0 otherwise Products

  • MntWines: Amount spent on wine in last 2 years

  • MntFruits: Amount spent on fruits in last 2 years

  • MntMeatProducts: Amount spent on meat in last 2 years

  • MntFishProducts: Amount spent on fish in last 2 years

  • MntSweetProducts: Amount spent on sweets in last 2 years

  • MntGoldProds: Amount spent on gold in last 2 years

Promotion

  • NumDealsPurchases: Number of purchases made with a discount

  • AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise

  • AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise

  • AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise

  • AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise

  • AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise

  • Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

Place

  • NumWebPurchases: Number of purchases made through the company’s website

  • NumCatalogPurchases: Number of purchases made using a catalogue

  • NumStorePurchases: Number of purchases made directly in stores

  • NumWebVisitsMonth: Number of visits to company’s website in the last month

After importing our dataset, we should take a look for missing values in our dataset. We have 24 missing values on Income column, lets fill them up with average values on income column.

Before we do visualization of customers profile, lets make new column that contain age of the customers. The following is new dataset with age column :

ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response Age
5524 1957 Graduation Single 58138.0 0 0 2012-04-09 58 635 ... 7 0 0 0 0 0 0 3 11 1 65
2174 1954 Graduation Single 46344.0 1 1 2012-08-03 38 11 ... 5 0 0 0 0 0 0 3 11 0 68

Customers Profile

Customers Age

Age Vis 1

Age Vis 2

Boxplot above shows that there are huge outlier on age boxplot, we should drop it later.

We can divide age of customer into 4 category : Young, Adult, Mature, and Senior category

Age Vis 3

Now we know that all of the customer is consist by 54.1% of customers falls into mature category, 25.8% of customers falls into adult category, 18.8% of customers falls into senior category, 1.25% of customers falls into young category.

Customers Education

Edu Vis 1

Based on chart above, the majority of customer is graduation or bachelor deegre followed by master and PhD deegre. We can divide the education of customer for our model into two categories, undergraduate and postgraduate

Edu Vis 2

Now we know that all of the customer is consist by 88.5% of customer have postgraduate education and 11.5% of customer have undergraduate education

Customers Marital Status

MS Vis 1

Based on chart above, we know that there are many marital status with absurd and YOLO being one of the marital status of the customers. Lets divide it into two categories, single and in-relationship.

MS Vis 2

After we divide the marital status into two categories, now we know that all of the customer is consist by 64.5% of customer in relationship and 35.5% of customer are single.

Customers Income

Income Vis 1

Income Vis 2

The plot above shows there are huge outlier in income boxplot with income on 666.666k ,we should remove it later.

We can divide the income of each customers into 4 category : low income, low to medium income, medium to high income, and high income.

Income Vis 3

Distribution of Customers by Children in Their Home

Kids vis

Teen Vis 1

Now lets combine Kidhome and Teenhome into one column. We call it 'Number of Child' column, this column contain info of the child in customer household.

Children Vis

Now we know that majority of the customers have 1 children followed by 0 children and 2 children.

Customers complain

Complain Vis 1

Pie chart above shows that 99.1% of customers never filed a complain while 0.938% of customers have filed a complain

We already take a look into our customer profile, now lets create new columns that contain information of total monthly spend,total number of campaign accepted, and total purchases

Data Pre-processing

We drop Dt_Customer, Kidhome, Teenhome, Recency, ID, Year_Birth, Income, Age, Z_CostContact, Z_Revenue columns because its unnecesary for our cluster

Data Encoding

There are categorical data on model data, we will do ordinal encoding to give label for these columns with categorical data :

  • Age_group = {'Young': 1, 'Adult': 2 , 'Mature': 3, 'Senior': 4}
  • Education = {'Undergraduate': 1, 'Postgraduate': 2}
  • Marital_Status = {'Single': 1, 'In-Relationship': 2}
  • Income_group = {'Low income': 1, 'Low to medium income': 2, 'Medium to high income': 3, 'High income': 4}

Data Normalization

Before we put the data into our model, we should normalize the data first.Normalization gives equal weights/importance to each variable so that no single variable steers model performance in one direction just because they are bigger numbers. We use Power Transformer to normalize the data

Before Normalization :

Norm 1

After Normalization :

Norm 2

Clustering

Before we cluster our data, we should find out ideal cluster for us using Elbow Method and silhouette score for validation.

Elbow Method

Elbow Score

In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The vertical lines of the graph above shows that ideal cluster for our cluster is 4

Silhouette Score

Silhouette Coefficient or silhouette score is a metric used to calculate the goodness of a clustering technique. Its value ranges from -1 to 1, each clusters are well apart from each other and clearly distinguished if silhouette score is towards 1.

Silhouette Score

We have bad silhouette score because our silhouette score is towards 0.

To optimize it, we can apply principal component analysis (PCA) for our cluster

Principal Component Analysis

PCA is an unsupervised machine learning algorithm. PCA performs dimensionality reductions while attempting at keeping the original information unchanged.

PCA 1

PCA 2

PCA 3

PCA 4

PCA 5

PCA 6

PCA 7

PCA 8

We will fit PCA algorithm into our cluster and visualize it for customer segmentation

Customer Segmentation

Segmen 1

Segmen 2

Segmen 3

Segmen 4

Segmen 5

Segmen 6

Segmen 7

Segmen 8

Segmen 9

Segmen 10

Segmen 11

Segmen 12

Segmen 13

Segmen 14

Segmen 15

Segmen 16

Segmen 17

Summary

Based on visualization of customer clusters, we can tell the characteristic of each clusters

Cluster 0

  • Majority of customers belong to this cluster
  • Have low to medium - low income
  • Majority have 1 - 2 children at home
  • Spend less money on products
  • Buy less products
  • Complains a lot
  • Don't like to buy product via catalog
  • Gives negatives responses on marketing campaign

Cluster 1

  • Have medium-high to high income
  • Have 0-1 children at home
  • Cluster with second highest amount of total purchases
  • Spent more money on fruits and fish products
  • Gives negative responses on marketing campaign
  • Likes to buy product via store
  • Likes discount

Cluster 2

  • Have low-medium to medium-high income
  • Consist of customers in adult to senior age group
  • Have 1-2 children at home
  • Spent more money on wine and gold
  • Really likes discount
  • Gives positives responses on marketing campaign

Cluster 3

  • Cluster with least number of customers on it
  • Have high income
  • Majority are child-free
  • Cluster with highest amount of total purchases
  • Spent more money on all kinds of products
  • Less likely to complain
  • likes to buy product via catalog
  • Don't likes discount
  • Gives positive responses on marketing campaign

Conclusions

After we perform clustering to create customer segmentation, we have 4 segmentation for our customers. We can tell that cluster 3 is our best cluster because they are spent more money on all kinds of products, less likely to complain, don't likes discount, and gives positive responses on marketing campaign. Cluster 0 is our worst cluster because thay are spend less money on products, buy less products, complains a lot, and gives negative responses on marketing campaign

For next campaign, we can create marketing campaign towards cluster 2 and cluster 3 because they are giving positive responses on marketing campaign. For cluster 3, we can use catalog as media for our campaign because they likes to buy product via catalog. For cluster 2, we can gives discount with minimum spend on wines and gold products because they likes discount and spend more money on gold and wines product