-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug in clv.utils.rfm_train_test_split() #688
Comments
Hi @jdsemeyn, Thank you for raising this issue. Are you able to get a work around for this issue with the latest version of the package? Would you be interested in making this fix in a PR? It may be good to raise an error earlier on if the required columns don't exist! |
@wd60622 I'll look into doing a PR, it should be an easy fix. Not sure that there needs to be separate checking to see if the required columns don't exist since the function requires specifying the intended columns, so would throw an error anyway if those columns don't exist. |
Great! I just suggest that to fail earlier and to provide a less cryptic error message I am rereading your original message and see that you are using a 3rd party data set. Is there a renaming requirement for the datasets that pymc-marketing provides? reference |
The example dataset used for clv, cdnow_transactions, has the id column as "id" and the date column as "date", which is likely why the issue wasn't caught before; however, the function should be able to accept column names that don't exactly match "id" and "date" as it has you specify which column is the id column and which column is the date column and handles the renaming. A similar function existed and worked as expected with 3rd party data without renaming in both the lifetimes library and the btyd library. |
Are you able to use the |
Those parameters should accommodate my needs, and that's the crux of the issue. The function tries to rename a column explicitly named "id" and a column explicitly named "date" rather than the inputs |
Understood! thanks for catching me up to speed If you creating a PR, we will need a test case which seems to be missing at the moment |
When attempting to use the clv.utils.rfm_train_test_split() function, I ran into the following error:
KeyError: 'customer_id'
After testing it again with publicly available data (the online retail data from uci), I got the same error. Looking at the code, I believe it is being introduced by lines 565-567:
test_rfm_data = test_rfm_data.rename( columns={"id": "customer_id", "date": "test_frequency"} )
where "id" and "date" should instead be customer_id_col and datetime_col. As is, if your id col and date col are not named exactly "id" and "date", they will not be renamed properly, which causes the error when the code then attempts to merge with the monetary value (lines 579-584).
If a monetary_value is not provided, then the error occurs later in lines 589-591 when it tries to merge back to the training_rfm_data.
How to reproduce error:
use clv.utils.rfm_train_test_split() with dataset that does not have the customer_id column explicitly named "customer_id" and the date column not explicitly named "date".
The text was updated successfully, but these errors were encountered: