-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] Dataset construction hangs with high cardinality categorical features under 4.2.0/Pandas. #6273
Comments
Thanks for the excellent report! Are you interested in investigating this and trying to submit a bugfix? We'd be happy to answer any questions you have about how to develop on this repo. Also note... I've edited your post slightly to include Python syntax highlighting. See https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks if you're not familiar with how to do that. |
Thanks. I do know the hang happens on this line. |
Did you know that on GitHub, if you paste a raw link to a commit-anchored line in a file, it'll show a preview of the code? Check this out: LightGBM/python-package/lightgbm/basic.py Line 2218 in ef2a49c
I've found that very useful. |
Thanks for that link! It makes sense to me that When you provide a LightGBM/python-package/lightgbm/basic.py Lines 1116 to 1121 in ef2a49c
If you're seeing that the same data passed into If they're identical, then the issue will probably be somewhere in the Are you interested in investigating that further? |
Yes, I'll dig some more based on your suggestions. |
Thanks very much! |
Thanks for the bug report, tricky to debug on our end as we had multiple pipelines running fine with 4.2 while others not and I didnt see a pattern 💡 |
In my case, issue appears for both numpy and pandas inputs. Reproducible for 4.2.0/4.3.0. Concrete max cardinality depends on many factors, can't find logic on it. |
Hi, I can see that this particular issue has been solved under the #6394 commit, however I have run into the same issue on version 4.3 and was wondering when would the next update of the package be released and if its undetermined yet would it be possible to implement that fix on my end. |
Thanks for using LightGBM. You can subscribe to notifications on #6439 to be notified by GitHub when v4.4.0 is available, or "watch" this repo (GitHub docs) to be notified of every LightGBM release.
If you cannot wait for the release, you can pull the source code from GitHub and build the library yourself, following the directions in https://github.com/microsoft/LightGBM/blob/master/python-package/README.rst. If you encounter any issues building it yourself, please open a new issue and don't comment on this one. We'll be happy to help you there. |
Thank you for the quick reply, this has been helpful and I was able to build the library from my end which has solved the problem. |
I got hang on 4.30 after adding a stock code as categorical_feature too. No log , very bad . |
@eromoe thanks for using LightGBM. As I mentioned in #6273 (comment), there has not yet been a release in this fix. You can subscribe to #6439 to be notified when that goes out. |
Description
Under 4.2.0, dataset construction hangs if X is a Pandas DataFrame and there is a categorical feature of high cardinality (and a lot of rows for that to express itself). This does not occur under 4.1.0 nor under 4.2.0 if X is merely a numpy array.
Reproducible example
Under 4.1.0, the code completes with only warnings:
If X is not converted to a DataFrame, the code completes without warnings under either version.
Environment info
Python version: 3.11.6
LightGBM version or commit hash: 4.2.0
Command(s) you used to install LightGBM
Broken example:
Working example:
Additional Comments
I realize this is a contrived example and that large cardinality categoricals are not necessarily best practice but, given the change in behavior, I wanted to raise the issue in case it points to an unexpected breaking change and determine if there is an approach that would make this work with 4.2.0 and Pandas.
The text was updated successfully, but these errors were encountered: