Skip to content

Commit

Permalink
RFM Segmentation (#680)
Browse files Browse the repository at this point in the history
* init rfm_segments func

* TODOs

* docstrings and for loop

* docstrings and for loop

* WIP dev notebook debugging

* checkpoint commit for remote pull

* code testing in dev notebook

* unit tests added

* dev notebook cleanup

* clean up type hints

* comments and code cleanup

* docstrings

* move formatting to rfm_summary and quickstart edits

* fix rfm_train_test_split bug

* added test for rfm_quartile_labels

* added rfm score warning
  • Loading branch information
ColtAllen authored and twiecki committed Sep 10, 2024

Verified

This commit was signed with the committer’s verified signature. The key has expired.
Ithanil Jan Kessler
1 parent 690a79a commit 65f5eb3
Showing 5 changed files with 498 additions and 89 deletions.
8 changes: 5 additions & 3 deletions docs/source/notebooks/clv/clv_quickstart.ipynb
Original file line number Diff line number Diff line change
@@ -67,10 +67,10 @@
"* `customer_id` represents a unique identifier for each customer.\n",
"* `frequency` represents the number of _repeat_ purchases that a customer has made, i.e. one less than the total number of purchases.\n",
"* `T` represents a customer's \"age\", i.e. the duration between a customer's first purchase and the end of the period of study. In this example notebook, the units of time are in weeks.\n",
"* `recency` represents the timepoint when a customer made their most recent purchase. This is also equal to the duration between a customer’s first non-repeat purchase (usually time 0) and last purchase. If a customer has made only 1 purchase, their recency is 0;\n",
"* `recency` represents the time period when a customer made their most recent purchase. This is equal to the duration between a customer’s first and last purchase. If a customer has made only 1 purchase, their recency is 0.\n",
"* `monetary_value` represents the average value of a given customer’s repeat purchases. Customers who have only made a single purchase have monetary values of zero.\n",
"\n",
"If working with raw transaction data, the `rfm_summary` function can be used to preprocess data for modeling:"
"The `rfm_summary` function can be used to preprocess raw transaction data for modeling:"
]
},
{
@@ -339,6 +339,8 @@
"id": "514ee548",
"metadata": {},
"source": [
"It is important to note these definitions differ from that used in RFM segmentation, where the first purchase is included, `T` is not used, and `recency` is the number of time periods since a customer's most recent purchase.\n",
"\n",
"To visualize data in RFM format, we can plot the recency and T of the customers with the `plot_customer_exposure` function. We see a large chunk (>60%) of customers haven't made another purchase in a while."
]
},
@@ -2579,7 +2581,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
"version": "3.10.14"
},
"toc": {
"base_numbering": 1,
186 changes: 151 additions & 35 deletions docs/source/notebooks/clv/dev/utilities_plotting.ipynb
Original file line number Diff line number Diff line change
@@ -5,15 +5,7 @@
"execution_count": 1,
"id": "435ed203-5c3c-4efc-93d1-abac66ce7187",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.\n"
]
}
],
"outputs": [],
"source": [
"from pymc_marketing.clv import utils\n",
"\n",
@@ -30,7 +22,7 @@
},
{
"cell_type": "code",
"execution_count": 69,
"execution_count": 2,
"id": "7de7f396-1d5b-4457-916b-c29ed90aa132",
"metadata": {},
"outputs": [],
@@ -66,7 +58,7 @@
},
{
"cell_type": "code",
"execution_count": 70,
"execution_count": 3,
"id": "932e8db6-78cf-49df-aa4a-83ee6584e5dd",
"metadata": {},
"outputs": [
@@ -196,7 +188,7 @@
"13 6 2015-02-02 True"
]
},
"execution_count": 70,
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
@@ -223,7 +215,7 @@
},
{
"cell_type": "code",
"execution_count": 74,
"execution_count": 4,
"id": "4c0a7de5-8825-40af-84e5-6cd0ad26a0e3",
"metadata": {},
"outputs": [
@@ -259,57 +251,57 @@
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>5.0</td>\n",
" <td>5.0</td>\n",
" <td>2.0</td>\n",
" <td>1.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>5.0</td>\n",
" <td>5.0</td>\n",
" <td>4.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>8.0</td>\n",
" <td>7.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>12.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" customer_id frequency recency T monetary_value\n",
"0 1 1.0 5.0 5.0 2.0\n",
"1 2 0.0 0.0 5.0 0.0\n",
"2 3 1.0 1.0 5.0 5.0\n",
"3 4 1.0 3.0 3.0 8.0\n",
"4 5 0.0 0.0 3.0 0.0"
"0 1 2.0 5.0 5.0 1.5\n",
"1 2 1.0 0.0 5.0 2.0\n",
"2 3 2.0 1.0 5.0 4.5\n",
"3 4 2.0 3.0 3.0 7.0\n",
"4 5 1.0 0.0 3.0 12.0"
]
},
"execution_count": 74,
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
@@ -323,7 +315,7 @@
" observation_period_end = \"2015-02-06\",\n",
" datetime_format = \"%Y-%m-%d\",\n",
" time_unit = \"W\",\n",
" include_first_transaction=False,\n",
" include_first_transaction=True,\n",
")\n",
"\n",
"rfm_df.head()"
@@ -339,7 +331,7 @@
},
{
"cell_type": "code",
"execution_count": 76,
"execution_count": 5,
"id": "761edfe9-1b69-4966-83bf-4f1242eda2d5",
"metadata": {},
"outputs": [
@@ -450,7 +442,7 @@
"4 0.0 5.0 "
]
},
"execution_count": 76,
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
@@ -467,13 +459,137 @@
"train_test.head()"
]
},
{
"cell_type": "markdown",
"id": "73dc1b93-6a4f-4171-b838-30759b2c1e0e",
"metadata": {},
"source": [
"`rfm_segments` will assign customer to segments based on their recency, frequency, and monetary value. It uses a quartile-based RFM score approach that is very computationally efficient, but defining custom segments is a rather subjective exercise. The returned dataframe also cannot be used for modeling because it does not zero out the initial transactions."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 40,
"id": "c7b3f800-8dfb-4e5a-b939-5f908281563c",
"metadata": {},
"outputs": [],
"source": []
"source": [
"segments = utils.rfm_segments(\n",
" test_data, \n",
" customer_id_col = \"id\", \n",
" datetime_col = \"date\", \n",
" monetary_value_col = \"monetary_value\",\n",
" observation_period_end = \"2015-02-06\",\n",
" datetime_format = \"%Y-%m-%d\",\n",
" time_unit = \"W\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "932ac4e5-361e-42fa-97d3-d8e508128944",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>customer_id</th>\n",
" <th>frequency</th>\n",
" <th>recency</th>\n",
" <th>monetary_value</th>\n",
" <th>segment</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>1.5</td>\n",
" <td>Other</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>5.0</td>\n",
" <td>2.0</td>\n",
" <td>Inactive Customer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>2.0</td>\n",
" <td>4.0</td>\n",
" <td>4.5</td>\n",
" <td>At Risk Customer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>7.0</td>\n",
" <td>Top Spender</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>1.0</td>\n",
" <td>3.0</td>\n",
" <td>12.0</td>\n",
" <td>At Risk Customer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>5.0</td>\n",
" <td>Top Spender</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" customer_id frequency recency monetary_value segment\n",
"0 1 2.0 0.0 1.5 Other\n",
"1 2 1.0 5.0 2.0 Inactive Customer\n",
"2 3 2.0 4.0 4.5 At Risk Customer\n",
"3 4 2.0 0.0 7.0 Top Spender\n",
"4 5 1.0 3.0 12.0 At Risk Customer\n",
"5 6 1.0 0.0 5.0 Top Spender"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"segments"
]
}
],
"metadata": {
@@ -492,7 +608,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
"version": "3.10.14"
}
},
"nbformat": 4,
2 changes: 2 additions & 0 deletions pymc_marketing/clv/__init__.py
Original file line number Diff line number Diff line change
@@ -25,6 +25,7 @@
)
from pymc_marketing.clv.utils import (
customer_lifetime_value,
rfm_segments,
rfm_summary,
rfm_train_test_split,
)
@@ -39,6 +40,7 @@
"plot_customer_exposure",
"plot_frequency_recency_matrix",
"plot_probability_alive_matrix",
"rfm_segments",
"rfm_summary",
"rfm_train_test_split",
)
Loading

0 comments on commit 65f5eb3

Please sign in to comment.