Transform customer data into actionable business insights with modern RFM analysis and behavioral segmentation.
Customer segmentation divides your customer base into distinct groups based on shared characteristics and behaviors. This solution creates 6 distinct customer segments:
- Champions - Premium customers generating highest revenue with frequent purchasing patterns
- Loyal - High value customers with consistent purchase patterns and high revenue
- Regular - Regular customers, with normal purchasing patterns and revenue
- New Customers - New customers, only having made one purchase
- At Risk - Customers who are at risk of churning, no recent activity
- Churned - Customers who have already churned, need to win back
This solution uses Databricks Asset Bundle for deployment:
# Clone the repository
git clone https://github.com/databricks-industry-solutions/customer-segmentation.git
cd customer-segmentation
# Deploy to Databricks
databricks bundle deploy
# Run the complete workflow
databricks bundle run customer_segmentation_demo_install
- Databricks workspace with Unity Catalog enabled
- Databricks CLI installed and configured
- Ability to use Serverless compute (or Cluster creation permissions)
customer-segmentation/
βββ databricks.yml # Databricks Asset Bundle configuration
βββ src/
β βββ customer_segmentation.lvdash.json The AI/BI dashboard. Make sure to change the catalog and schema names in this file to your catalog and schema
βββ notebooks/
β βββ 01_Data_Setup.py # Synthetic data generation
β βββ 02a_Segmentation_Lakeflow.py # Lakeflow Declarative Pipelines for segmentation
β βββ 02b_Segmentation_MLflow.py # Unsupervised clustering with MLflow for segmentation (builds off of 02a_Segmentation_Lakeflow)
β βββ 03_Business_Insights.py # Business visualizations
βββ .github/workflows/ # CI/CD automation
The solution implements a 3-stage customer segmentation pipeline:
- Generates 1,000 synthetic customers with realistic demographics
- Creates transaction history with seasonal patterns and behavioral variety
- Stores data in Unity Catalog managed tables
- RFM Analysis: Calculates Recency, Frequency, and Monetary scores
- Behavioral Clustering: Groups customers by purchase patterns
- Segment Profiles: Creates business-ready segment characteristics
- AI/BI Dashboard: A dashboard for viewing RFM scores, trends, and customer demographics
Either:
- Create a
.env
file based on.env.example
:
# databricks.yml variables
variables:
catalog_name: your_catalog_name
schema_name: your_schema_name
warehouse_id: your_warehouse_id
or 2. Create a variable-overrides.json file under .databricks > bundle > {your target}
// variable-overrides.json variables
{
"catalog_name": "your_catalog_name",
"schema_name": "your_schema_name",
"warehouse_id": "your_warehouse_id"
}
Based on industry benchmarks, implementing this segmentation strategy delivers:
- 20% average revenue lift through targeted campaigns
- 15-30% improvement in customer lifetime value
- 40% increase in marketing campaign effectiveness
- 25% reduction in customer acquisition costs
The solution includes 5 essential visualizations:
- Customer Distribution - Segment size analysis
- Revenue Distribution - Revenue concentration by segment
- Performance Metrics - Customer value benchmarks
- Lifetime Value - CLV projections by segment
- ROI Analysis - Business impact projections
- Unity Catalog: Data governance and managed tables
- Lakeflow Declarative Pipelines: Declarative data pipelines
- Serverless Compute: Cost-effective processing
- Plotly Express: Accessible, interactive visualizations
- Synthetic Data: Faker
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
Β© 2025 Databricks, Inc. All rights reserved. The source in this project is provided subject to the Databricks License [https://databricks.com/db-license-source]. All included or referenced third party libraries are subject to the licenses set forth below.
Package | License | Copyright |
---|---|---|
plotly>=5.15.0 | MIT | Copyright (c) 2016-2023 Plotly, Inc |
numpy>=1.21.0 | BSD-3-Clause | Copyright (c) 2005-2023, NumPy Developers |
pandas>=1.5.0 | BSD-3-Clause | Copyright (c) 2008-2023, AQR Capital Management, LLC |
scikit-learn>=1.3.0 | BSD-3-Clause | Copyright (c) 2007-2023 The scikit-learn developers |
Faker | MIT | Copyright (c) 2012-2023 joke2k |
This project is licensed under the Databricks License - see the LICENSE file for details.
Please note the code in this project is provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.