[python-package] Dataset construction hangs with high cardinality categorical features under 4.2.0/Pandas. #6273

neverfox · 2024-01-12T17:32:38Z

Description

Under 4.2.0, dataset construction hangs if X is a Pandas DataFrame and there is a categorical feature of high cardinality (and a lot of rows for that to express itself). This does not occur under 4.1.0 nor under 4.2.0 if X is merely a numpy array.

Reproducible example

import lightgbm as lgb
import numpy as np
import pandas as pd

X = np.random.randint(0, 50000, 100000).reshape(100000, 1)
X = pd.DataFrame(X) # comment out to try as numpy array
y = np.random.rand(100000)
categorical_feature = range(0, 1)

full_data = lgb.Dataset(
    X,
    y,
    categorical_feature=categorical_feature,
)

full_data.construct()

Under 4.1.0, the code completes with only warnings:

[LightGBM] [Warning] Categorical features with more bins than the configured maximum bin number found.
[LightGBM] [Warning] For categorical features, max_bin and max_bin_by_feature may be ignored with a large number of categories.

If X is not converted to a DataFrame, the code completes without warnings under either version.

Environment info

Python version: 3.11.6
LightGBM version or commit hash: 4.2.0

Command(s) you used to install LightGBM

Broken example:

pip install lightgbm==4.2.0 pandas==2.1.4

Working example:

pip install lightgbm==4.1.0 pandas==2.1.4

Additional Comments

I realize this is a contrived example and that large cardinality categoricals are not necessarily best practice but, given the change in behavior, I wanted to raise the issue in case it points to an unexpected breaking change and determine if there is an approach that would make this work with 4.2.0 and Pandas.

The text was updated successfully, but these errors were encountered:

jameslamb · 2024-01-12T17:56:29Z

Thanks for the excellent report!

Are you interested in investigating this and trying to submit a bugfix? We'd be happy to answer any questions you have about how to develop on this repo.

Also note... I've edited your post slightly to include Python syntax highlighting. See https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks if you're not familiar with how to do that.

neverfox · 2024-01-12T21:47:54Z

Thanks. I do know the hang happens on this line.

jameslamb · 2024-01-12T21:57:23Z

Did you know that on GitHub, if you paste a raw link to a commit-anchored line in a file, it'll show a preview of the code?

Check this out:

LightGBM/python-package/lightgbm/basic.py

Line 2218 in ef2a49c

_safe_call(_LIB.LGBM_DatasetCreateFromMat(

I've found that very useful.

jameslamb · 2024-01-12T22:00:23Z

Thanks for that link! It makes sense to me that LGBM_DatasetCreatFromMat() would be the point in the C API where you see a this hanging. LightGBM's C/C++ code doesn't have any routines that directly read data in pandas memory layout.

When you provide a pandas DataFrame, it's converted to numpy format before being passed down to the C API.

LightGBM/python-package/lightgbm/basic.py

Lines 1116 to 1121 in ef2a49c

    
           data = _data_from_pandas( 
        
               data=data, 
        
               feature_name="auto", 
        
               categorical_feature="auto", 
        
               pandas_categorical=self.pandas_categorical 
        
           )[0]

If you're seeing that the same data passed into lgb.Dataset() as a numpy array doesn't result in any issues, then I recommend comparing that numpy array to the one produced by _data_from_pandas() on that line.

If they're identical, then the issue will probably be somewhere in the Dataset attributes like pandas_categorical or categorical_feature.

Are you interested in investigating that further?

neverfox · 2024-01-15T18:50:08Z

Yes, I'll dig some more based on your suggestions.

jameslamb · 2024-01-16T02:41:20Z

Thanks very much!

poudrouxj · 2024-01-25T08:48:47Z

Thanks for the bug report, tricky to debug on our end as we had multiple pipelines running fine with 4.2 while others not and I didnt see a pattern 💡

Deimos357 · 2024-03-12T11:32:01Z

In my case, issue appears for both numpy and pandas inputs. Reproducible for 4.2.0/4.3.0. Concrete max cardinality depends on many factors, can't find logic on it.
4.1.0 works fine though.

jameslamb · 2024-04-01T17:25:58Z

Here is another reproducible example of this same issue: #6400.

Will test it with the fix in #6394 soon.

med2604 · 2024-05-17T15:42:39Z

Hi, I can see that this particular issue has been solved under the #6394 commit, however I have run into the same issue on version 4.3 and was wondering when would the next update of the package be released and if its undetermined yet would it be possible to implement that fix on my end.

jameslamb · 2024-05-17T15:50:57Z

Thanks for using LightGBM.

You can subscribe to notifications on #6439 to be notified by GitHub when v4.4.0 is available, or "watch" this repo (GitHub docs) to be notified of every LightGBM release.

implement that fix on my end

If you cannot wait for the release, you can pull the source code from GitHub and build the library yourself, following the directions in https://github.com/microsoft/LightGBM/blob/master/python-package/README.rst.

If you encounter any issues building it yourself, please open a new issue and don't comment on this one. We'll be happy to help you there.

med2604 · 2024-05-20T08:05:51Z

Thank you for the quick reply, this has been helpful and I was able to build the library from my end which has solved the problem.

eromoe · 2024-05-23T03:39:45Z

I got hang on 4.30 after adding a stock code as categorical_feature too. No log , very bad .

jameslamb · 2024-05-23T03:42:29Z

@eromoe thanks for using LightGBM.

As I mentioned in #6273 (comment), there has not yet been a release in this fix. You can subscribe to #6439 to be notified when that goes out.

neverfox changed the title ~~Dataset construction hangs with high-cardinality categorical features under 4.2.0/Pandas.~~ Dataset construction hangs with high cardinality categorical features under 4.2.0/Pandas. Jan 12, 2024

jameslamb added the bug label Jan 12, 2024

jameslamb changed the title ~~Dataset construction hangs with high cardinality categorical features under 4.2.0/Pandas.~~ [python-package] Dataset construction hangs with high cardinality categorical features under 4.2.0/Pandas. Jan 12, 2024

morokosi mentioned this issue Mar 30, 2024

remove unnecessary omp single that cause deadlock (fixes #6273) #6394

Merged

morokosi added a commit to morokosi/LightGBM that referenced this issue Mar 31, 2024

add regression test for microsoft#6273

ac303ec

morokosi added a commit to morokosi/LightGBM that referenced this issue Mar 31, 2024

add regression test for microsoft#6273

75ba6eb

morokosi added a commit to morokosi/LightGBM that referenced this issue Mar 31, 2024

add regression test for microsoft#6273

112f757

jameslamb mentioned this issue Apr 1, 2024

[python-package] LGBM hangs with high number of categories #6400

Closed

morokosi added a commit to morokosi/LightGBM that referenced this issue Apr 11, 2024

add regression test for microsoft#6273

0749516

jameslamb closed this as completed in #6394 Apr 23, 2024

jameslamb pushed a commit that referenced this issue Apr 23, 2024

remove unnecessary omp single that cause deadlock (fixes #6273) (#6394)

1871350

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] Dataset construction hangs with high cardinality categorical features under 4.2.0/Pandas. #6273

[python-package] Dataset construction hangs with high cardinality categorical features under 4.2.0/Pandas. #6273

neverfox commented Jan 12, 2024 •

edited by jameslamb

Loading

jameslamb commented Jan 12, 2024

neverfox commented Jan 12, 2024

jameslamb commented Jan 12, 2024

jameslamb commented Jan 12, 2024

neverfox commented Jan 15, 2024

jameslamb commented Jan 16, 2024

poudrouxj commented Jan 25, 2024

Deimos357 commented Mar 12, 2024

jameslamb commented Apr 1, 2024

med2604 commented May 17, 2024

jameslamb commented May 17, 2024

med2604 commented May 20, 2024

eromoe commented May 23, 2024 •

edited

Loading

jameslamb commented May 23, 2024

[python-package] Dataset construction hangs with high cardinality categorical features under 4.2.0/Pandas. #6273

[python-package] Dataset construction hangs with high cardinality categorical features under 4.2.0/Pandas. #6273

Comments

neverfox commented Jan 12, 2024 • edited by jameslamb Loading

Description

Reproducible example

Environment info

Additional Comments

jameslamb commented Jan 12, 2024

neverfox commented Jan 12, 2024

jameslamb commented Jan 12, 2024

jameslamb commented Jan 12, 2024

neverfox commented Jan 15, 2024

jameslamb commented Jan 16, 2024

poudrouxj commented Jan 25, 2024

Deimos357 commented Mar 12, 2024

jameslamb commented Apr 1, 2024

med2604 commented May 17, 2024

jameslamb commented May 17, 2024

med2604 commented May 20, 2024

eromoe commented May 23, 2024 • edited Loading

jameslamb commented May 23, 2024

neverfox commented Jan 12, 2024 •

edited by jameslamb

Loading

eromoe commented May 23, 2024 •

edited

Loading