Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[questions] How to properly deal with categorical variables #4932

Closed
HarryAtDelphia opened this issue Jan 6, 2022 · 5 comments · Fixed by #4959
Closed

[questions] How to properly deal with categorical variables #4932

HarryAtDelphia opened this issue Jan 6, 2022 · 5 comments · Fixed by #4959
Labels

Comments

@HarryAtDelphia
Copy link

According to official documents, the categorical features are better to be encoded as non-negative integers in LightGBM.(https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html). I encoded the categorical features as non-negative integers using OrdinalEncoder, but when I converted pandas dataframe to numpy array, the features will be converted to float. My question is could LGBM properly treat categorical features as float? What is the best way to deal with categorical features?

I am using sklearn API. The versions are python 3.8, LightGBM 3.3.1, pandas 1.1.5, numpy 1.19.5 and scikit-learn 1.0.1.

Here is a sample of my code. Thank you for help.

oe = preprocessing.OrdinalEncoder(dtype=int, handle_unknown='use_encoded_value', unknown_value=999)
feature_df[categorical_features] = oe.fit(feature_df[categorical_features].astype(str))

X = np.array(feature_df)
y = np.squeeze(np.array(target_df))
model.fit(X, y, sample_weight=weights,
                       categorical_feature=categorical_features,
                       feature_name=feature_features)
@jmoralez
Copy link
Collaborator

Hi @HarryAtDelphia. You don't have to worry about the categoricals being floats as long as you tell lightgbm that those features are meant to be treated as categoricals. Here's an example:

import lightgbm as lgb
import numpy as np


n_samples = 1_000
n_categoricals = 2
n_continuous = 2
categoricals = np.random.randint(0, 20, size=(n_samples, n_categoricals))
continuous = np.random.rand(n_samples, n_continuous)
X = np.hstack([categoricals, continuous])
print(X.dtype)  # float64
y = (X[:, 0] == 10) * X[:, -1]
model = lgb.LGBMRegressor(
    n_estimators=1,
    num_leaves=15,
    categorical_feature=np.arange(n_categoricals),
)
model.fit(X, y)
lgb.plot_tree(model)

which yieds the following
image

You can see here that the first split is asking whether the first feature is either 0, 10 or 12 (i.e. treating that feature as categorical).

@HarryAtDelphia
Copy link
Author

Hi @HarryAtDelphia. You don't have to worry about the categoricals being floats as long as you tell lightgbm that those features are meant to be treated as categoricals. Here's an example:

import lightgbm as lgb
import numpy as np


n_samples = 1_000
n_categoricals = 2
n_continuous = 2
categoricals = np.random.randint(0, 20, size=(n_samples, n_categoricals))
continuous = np.random.rand(n_samples, n_continuous)
X = np.hstack([categoricals, continuous])
print(X.dtype)  # float64
y = (X[:, 0] == 10) * X[:, -1]
model = lgb.LGBMRegressor(
    n_estimators=1,
    num_leaves=15,
    categorical_feature=np.arange(n_categoricals),
)
model.fit(X, y)
lgb.plot_tree(model)

which yieds the following image

You can see here that the first split is asking whether the first feature is either 0, 10 or 12 (i.e. treating that feature as categorical).

Thank you so much @jmoralez. This really solves my puzzle.

@jmoralez
Copy link
Collaborator

jmoralez commented Jan 11, 2022

Thanks for raising this @HarryAtDelphia, I believe we should clarify this in the docs. Do you think that changing

Categorical features must be encoded as non-negative integers (int) less than Int32.MaxValue (2147483647).

to

Categorical features will be converted to int so they must be encoded as non-negative integers (negative values will be treated as missing) less than the maximum int32 value (2147483647).

makes it a bit clearer?

@HarryAtDelphia
Copy link
Author

I agree with you @jmoralez. This will be a clearer introduction on categorical features. I am sure it will help more people new to lightGBM.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants