Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] LightGBM predict_proba() corrupts pandas categorical columns with unseen values #6195

Closed
fingoldo opened this issue Nov 15, 2023 · 2 comments · Fixed by #6218
Labels

Comments

@fingoldo
Copy link

fingoldo commented Nov 15, 2023

Description

In predict_proba of LGBMClassifier at least, if the input is a pandas dataframe, in a categorical column, when a value is not seen while fitting, entire column becomes corrupt.

Some might argue it's not important, but this behaviour is not documented, unexpected, and took me a lot of time to detect. It has lead to appearance of nulls out of nowhere in a chain of models making predictions on the same data. IMHO no model should change its inputs, (if there are performance reasons, still at least not without some special flag explicitly set?).

Reproducible example

import lightgbm
import pandas as pd, numpy as np
from lightgbm import LGBMClassifier

nsamples=50

X_train = pd.DataFrame(np.random.random(size=(nsamples, 4)))
X_train["cat"] = np.random.choice(["a", "b"], size=nsamples, replace=True)
X_train['cat']=X_train['cat'].astype('category')

est=LGBMClassifier(verbose=0)
est.fit(X_train, np.random.randint(0, 2, size=nsamples))

X_test = pd.DataFrame(np.random.random(size=(nsamples, 4)))
X_test["cat"] = np.random.choice(["a", "c"], size=nsamples, replace=True) # note that c is unseen before
X_test["cat"] = X_test["cat"].astype("category")
print(X_test["cat"].value_counts(dropna=False)) # and it's retained

#cat
#a    30
#c    20
#Name: count, dtype: int64

est.predict_proba(X_test) # but not after predict_proba
print(X_test["cat"].value_counts(dropna=False)) # note that values in cat column have been corrupted alltogether

#cat
#a      25
#NaN    25
#b       0
#Name: count, dtype: int64

Environment info

print(np.version.version,pd.version,lightgbm.version)

1.24.4 2.0.3 4.1.0
OS=Windows

Command(s) you used to install LightGBM

pip install lightgbm

Additional Comments

@jameslamb jameslamb changed the title LightGBM corrupts categorical columns with unseen values on prediction [python-package] LightGBM predict_proba() corrupts pandas categorical columns with unseen values Nov 15, 2023
@jameslamb jameslamb added the bug label Nov 15, 2023
@jmoralez
Copy link
Collaborator

Hey @fingoldo, thanks for using LightGBM and sorry for the troubles. We used to take a shallow copy there but it wasn't obvious that the predict step depended on that and a recent refactor removed it. We'll work on a fix.

@fingoldo
Copy link
Author

fingoldo commented Nov 15, 2023

Hey @fingoldo, thanks for using LightGBM and sorry for the troubles. We used to take a shallow copy there but it wasn't obvious that the predict step depended on that and a recent refactor removed it. We'll work on a fix.

Thank you so much Jose, that's what I call a fast turnaround! ;-) For now I just pass a .copy() of a dataframe to LightGBM, then other models of the ensemble are not affected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants