Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] docs say categorical features must be encoded as int, but pandas categorical columns work #4676

Closed
leihuang opened this issue Oct 13, 2021 · 4 comments · Fixed by #5044
Labels

Comments

@leihuang
Copy link

Currently the documentation on categorical feature support says that:

Categorical features must be encoded as non-negative integers

However, the codes have been updated and now support simply specifying a feature as being of the dtypecategorical.

Reproducible example

LightGBM version or commit hash:

lightgbm 2.2.3

import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np

X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(X).iloc[:,:5]
X['x_'] = pd.Categorical(np.random.choice(['a','b'], size=X.shape[0]))

params = {
    'objective': 'binary',
    'use_missing': 'true',
    'is_unbalance': 'false',
    'boosting': 'gbdt',
    'random_state': None,
    'verbose': -1,
    'bagging_freq': 1,
    }

dat = lgb.Dataset(X, label=y, categorical_feature=['x_'])
clf = lgb.train(params=params, train_set=dat, 
                categorical_feature=['x_'], 
                num_boost_round=10)
@jameslamb jameslamb added the doc label Oct 13, 2021
@jameslamb
Copy link
Collaborator

jameslamb commented Oct 13, 2021

Thanks very much @leihuang !

The document you've linked to is documentation for all of LightGBM, not only the Python package. lightgbm's Python package converts pandas categorical columns to an integer representation before passing the data to the C++ code used to create a lightgbm.Dataset object.

if len(cat_cols): # cat_cols is list
data = data.copy() # not alter origin DataFrame
data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})

pandas categorical columns have an attribute .codes which contains an integer representation of the data in the categories: https://pandas.pydata.org/docs/reference/api/pandas.Categorical.codes.html.


I think it would be useful to add a note in a bullet point at https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#categorical-feature-support that says something like this:

Some LightGBM wrappers in other language may do this conversion for you. For example, the Python package converts pandas categorical columns into an integer representation.

Are you interested in contributing such a change?

@jameslamb jameslamb changed the title Update the doc on categorical feature support [python] docs say categorical features must be encoded as int, but pandas categorical columns work Oct 13, 2021
@StrikerRUS
Copy link
Collaborator

@leihuang Thanks for noting this inconsistency!
We have already added a similar note to the categorical_feature parameter description:

Note: only supports categorical with int type (not applicable for data represented as pandas DataFrame in Python-package)
https://lightgbm.readthedocs.io/en/latest/Parameters.html#categorical_feature

@guolinke
Copy link
Collaborator

guolinke commented Mar 2, 2022

I think this issue could be closed. Feel free to reopen if need.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants