This repository contains a comprehensive analysis and modeling approach for predicting Customer Lifetime Value (CLV) and Retention Duration. It includes data preprocessing, exploratory data analysis (EDA), machine learning model training, hyperparameter tuning, and evaluation. Additionally, it features custom testing functions and survival analysis to support business decision-making.
- Cleaning:
- Handles missing values and removes invalid data (e.g., negative quantities or prices).
- Feature Engineering:
- Creates new features, such as
Revenue
,Retention Duration
, and log-transformed values to handle skewed data.
- Creates new features, such as
- Scaling and Splitting:
- Standardizes numerical features using
StandardScaler
. - Splits data into training and testing sets.
- Standardizes numerical features using
- Visualizations:
- Histograms for feature distributions (e.g.,
Retention Duration
,Total Revenue
). - Scatter plots for pairwise feature relationships.
- Histograms for feature distributions (e.g.,
- Correlation Heatmap:
- Identifies relationships between key features (e.g., Recency, Frequency, Revenue).
- Insights:
- Examines patterns to support feature selection for modeling.
- Implements two models for:
- Retention Duration Prediction
- Customer Lifetime Value (CLV) Prediction
- Grid Search with Cross-Validation:
- Hyperparameter tuning for models using
GridSearchCV
.
- Hyperparameter tuning for models using
- Evaluation Metrics:
- R² Scores
- Root Mean Squared Error (RMSE)
- Residual analysis and visualizations
- Predict Retention and CLV:
- Predicts metrics for individual customers.
- Visualize Customer Activity:
- Displays historical purchase behavior for specific customers.
- Kaplan-Meier Estimator:
- Plots customer retention survival curves to analyze retention trends over time.
-
Retention Model:
- Evaluated with cross-validated R² scores and RMSE on test data.
- Insights into factors affecting retention duration.
-
CLV Model:
- Predicts customer revenue potential with competitive performance metrics.
- Helps prioritize high-value customers for retention efforts.
-
Survival Analysis:
- Demonstrates retention trends over time for strategic planning.
- Python 3.8+
- Required libraries:
pip install pandas numpy matplotlib seaborn scikit-learn lifelines
-
Clone the repository:
git clone https://github.com/your_username/Customer-Retention-CLV.git cd Customer-Retention-CLV
-
Install dependencies:
pip install -r requirements.txt
-
Run the Jupyter Notebook:
jupyter notebook final.ipynb
- final.ipynb: Main Jupyter Notebook containing the entire workflow.
- customer_segmentation.csv: Example dataset for analysis and modeling.
- README.md: Documentation for the repository.
The dataset, customer_segmentation.csv
, contains transactional data, including:
- CustomerID: Unique identifier for each customer.
- Quantity: Number of items purchased.
- UnitPrice: Price per unit.
- InvoiceDate: Date of the transaction.
- Revenue: Calculated as
Quantity × UnitPrice
.
-
Data Preprocessing:
- Load the dataset.
- Perform cleaning and feature engineering.
-
Modeling:
- Train predictive models for retention and CLV.
- Evaluate models using R² and RMSE.
-
Visualization:
- Analyze distributions, relationships, and residuals.
-
Survival Analysis:
- Plot retention survival curves for insights into customer retention trends.
-
Testing:
- Use custom functions to predict retention and CLV for specific customers.
A histogram showing the frequency distribution of retention durations.
Scatter plots analyzing how recent purchases relate to revenue generation.
Histograms showing residuals for both retention and CLV models.
Contributions are welcome! Please open an issue or submit a pull request for improvements or additional features.