(This repo is still under development for further improvement)
This project applies clustering techniques to segment customers of an E-commerce platform based on their transaction history and demographic data. The goal is to group customers with similar behaviors for marketing and personalization purposes. We compare three clustering algorithms: K-Means, DBSCAN, and Hierarchical Clustering, and evaluate their performance using silhouette scores.
The dataset contains customer transaction records along with demographic information, which includes:
- customers: Information on customer IDs, gender, and city.
- transactions: Records of purchases, including transaction status and coupon usage.
- genders: Mapping between gender ID and gender name.
- cities: Mapping between city ID and city name.
Key Features:
gender_name
: Gender of the customer.city_name
: City where the customer resides.coupon_usage_frequency
: Number of times the customer used coupons.total_transactions
: Total number of transactions made by the customer.
The project consists of the following key steps:
- Load data from an Excel file containing multiple sheets.
- Drop irrelevant columns such as
burn_date
from the transaction data. - Handle missing data using SimpleImputer:
- Categorical columns: Imputed with the most frequent value.
- Numerical columns: Imputed with the mean value.
- One-Hot Encoding was applied to categorical features like
gender_name
andcity_name
.
- Merged customer demographic information with transaction data.
- Aggregated transactions by calculating
coupon_usage_frequency
andtotal_transactions
for each customer.
- Selected features:
gender_name
,city_name
,coupon_usage_frequency
, andtotal_transactions
. - Preprocessed the data by scaling for Hierarchical Clustering, but no scaling was applied for K-Means and DBSCAN as scaling negatively impacted their results.
We applied three clustering algorithms to the preprocessed data:
- Tuned the number of clusters (
k
) using the Elbow Method and Silhouette Scores. - The optimal number of clusters was 4, with the highest silhouette score of 0.62.
- Applied grid search to optimize
eps
andmin_samples
. - Best parameters:
eps=0.5
,min_samples=5
. - DBSCAN struggled with varying densities in the dataset, especially in high-dimensional spaces, making it less suitable for this particular problem.
- Tested different linkage methods (Ward, Complete, Average).
- Best performance was achieved with 5 clusters and Average Linkage.
- Hierarchical clustering produced clear groups but was more computationally expensive.
Model | Best Params | Silhouette Score |
---|---|---|
K-Means | n_clusters=4 |
0.62 |
DBSCAN | eps=0.5 , min_samples=5 |
0.99(unstable due to noise) |
Hierarchical | n_clusters=5 , linkage='average' |
0.80 |
Key Findings:
- K-Means performed the best with a silhouette score of 0.62, identifying 4 distinct customer segments.
- DBSCAN had difficulties with noise and clusters of varying densities, making it less effective.
- Hierarchical Clustering was computationally expensive but produced reasonable clusters with an average silhouette score of 0.58.
- Applied PCA (Principal Component Analysis) to reduce the dataset to 2 dimensions for visualizing the clusters.
Visualizations were generated to display the clustering results using PCA for dimensionality reduction:
- K-Means Clustering: 4 distinct clusters with good separation.
- DBSCAN: Struggled with noisy data and did not form clear clusters.
- Hierarchical Clustering: Produced 5 clusters with moderate separation.
- Clone the Repository:
git clone https://github.com/Assem-ElQersh//E-Commerce-Customers-Segmentation.git
cd E-commerce-Customer-Segmentation
- Install the Required Dependencies:
pip install -r requirements.txt
- Run the Jupyter Notebook: Execute the notebook to preprocess the data, apply clustering algorithms, and generate visualizations.
├── data/
│ ├── E-commerce_data.xlsx
├── src/
│ ├── Model.ipynb
├── README.md
└── requirements.txt
-
K-Means was this dataset's most effective clustering method, producing 4 customer segments with the best silhouette score.
-
DBSCAN struggled with noise and varying densities,
resulting in lower performance.
-
Hierarchical Clustering could identify clear segments but was less scalable and more computationally expensive compared to K-Means.
This project demonstrates a comprehensive approach to customer segmentation using various clustering techniques. The analysis offers valuable insights into customer behavior, which can be utilized by the E-commerce platform for targeted marketing strategies and personalized recommendations.
- Please note that this project is part of the MLSC Data Science Graduation Project.