Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in clv.utils.rfm_train_test_split() #688

Closed
jdsemeyn opened this issue May 17, 2024 · 7 comments · Fixed by #698
Closed

bug in clv.utils.rfm_train_test_split() #688

jdsemeyn opened this issue May 17, 2024 · 7 comments · Fixed by #698
Labels
bug Something isn't working CLV good first issue Good for newcomers . Doesn't require extensive knowledge of the repo and package

Comments

@jdsemeyn
Copy link

When attempting to use the clv.utils.rfm_train_test_split() function, I ran into the following error:
KeyError: 'customer_id'
After testing it again with publicly available data (the online retail data from uci), I got the same error. Looking at the code, I believe it is being introduced by lines 565-567:
test_rfm_data = test_rfm_data.rename( columns={"id": "customer_id", "date": "test_frequency"} )
where "id" and "date" should instead be customer_id_col and datetime_col. As is, if your id col and date col are not named exactly "id" and "date", they will not be renamed properly, which causes the error when the code then attempts to merge with the monetary value (lines 579-584).
If a monetary_value is not provided, then the error occurs later in lines 589-591 when it tries to merge back to the training_rfm_data.

How to reproduce error:
use clv.utils.rfm_train_test_split() with dataset that does not have the customer_id column explicitly named "customer_id" and the date column not explicitly named "date".

@wd60622
Copy link
Contributor

wd60622 commented May 17, 2024

Hi @jdsemeyn,

Thank you for raising this issue. Are you able to get a work around for this issue with the latest version of the package?

Would you be interested in making this fix in a PR? It may be good to raise an error earlier on if the required columns don't exist!

@wd60622 wd60622 added bug Something isn't working good first issue Good for newcomers . Doesn't require extensive knowledge of the repo and package CLV labels May 17, 2024
@jdsemeyn
Copy link
Author

@wd60622
I can work around it by just setting the column names in my data to the ones used by the function.

I'll look into doing a PR, it should be an easy fix. Not sure that there needs to be separate checking to see if the required columns don't exist since the function requires specifying the intended columns, so would throw an error anyway if those columns don't exist.

@wd60622
Copy link
Contributor

wd60622 commented May 18, 2024

I'll look into doing a PR, it should be an easy fix. Not sure that there needs to be separate checking to see if the required columns don't exist since the function requires specifying the intended columns, so would throw an error anyway if those columns don't exist.

Great! I just suggest that to fail earlier and to provide a less cryptic error message

I am rereading your original message and see that you are using a 3rd party data set. Is there a renaming requirement for the datasets that pymc-marketing provides? reference

@jdsemeyn
Copy link
Author

The example dataset used for clv, cdnow_transactions, has the id column as "id" and the date column as "date", which is likely why the issue wasn't caught before; however, the function should be able to accept column names that don't exactly match "id" and "date" as it has you specify which column is the id column and which column is the date column and handles the renaming. A similar function existed and worked as expected with 3rd party data without renaming in both the lifetimes library and the btyd library.
Additionally, the rfm_summary() function has similar inputs and works as expected with 3rd party data, again without having to rename the columns prior to using the function.

@wd60622
Copy link
Contributor

wd60622 commented May 20, 2024

Are you able to use the customer_id_col and datetime_col arguments of the rfm_train_test_split function? Don't these parameters accommodate your needs?

@jdsemeyn
Copy link
Author

Those parameters should accommodate my needs, and that's the crux of the issue. The function tries to rename a column explicitly named "id" and a column explicitly named "date" rather than the inputs customer_id_col and datetime_col, which then causes the KeyError.
An error won't be raised when trying to rename the nonexistent "id" and "date" columns because the pandas .rename doesn't care if the things it is trying to rename exist or not.
The error is caused by the following code, which is lines 565-567 in clv utils:
test_rfm_data = test_rfm_data.rename(columns={"id": "customer_id", "date": "test_frequency"})

@wd60622
Copy link
Contributor

wd60622 commented May 20, 2024

Understood! thanks for catching me up to speed

If you creating a PR, we will need a test case which seems to be missing at the moment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CLV good first issue Good for newcomers . Doesn't require extensive knowledge of the repo and package
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants