Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subset function cause corruption using 2.3.2 #2744

Closed
aprilffff opened this issue Feb 6, 2020 · 8 comments · Fixed by #2748 or #2754
Closed

subset function cause corruption using 2.3.2 #2744

aprilffff opened this issue Feb 6, 2020 · 8 comments · Fixed by #2748 or #2754
Assignees

Comments

@aprilffff
Copy link

I have updated to version 2.3.2. During dataset construction, espectially when I use the function dataset.subset, It caused corruption occasionally. Corruptions not happened every time, but mostly on small datasets(less than 50k samples).

also,dataset.add_features_from caused corruptions at the same time.

Any idea on that?

@guolinke
Copy link
Collaborator

guolinke commented Feb 6, 2020

which code version you used?

@aprilffff
Copy link
Author

which code version you used?

2.3.2

@guolinke
Copy link
Collaborator

guolinke commented Feb 7, 2020

I mean the git commit id, as we didn't officially release 2.3.2 yet.

@aprilffff
Copy link
Author

dbb804f

@guolinke guolinke self-assigned this Feb 7, 2020
@guolinke
Copy link
Collaborator

guolinke commented Feb 7, 2020

Thanks, I will investigate that. BTW, could you provide the example that can cause corruption? As I didn't see corruption before, it will be helpful for debugging.

@aprilffff
Copy link
Author

pls use attached files below.
testfile.zip

import pandas as pd
import lightgbm as lgb
ref=lgb.Dataset('test_ref.bin')
data=pd.read_pickle('test_data.pkl').values
ds=lgb.Dataset(data,reference=ref).construct()

##will cause corruption below##
ds.subset(np.arange(10)).construct()  

I though it might be related to the reference. If no ref, it wont crash.

@guolinke
Copy link
Collaborator

guolinke commented Feb 7, 2020

is "test_ref.bin" generated by the same code version?
The binary format of Dataset is changed in recent commits.

@aprilffff
Copy link
Author

same version for sure

This was referenced Feb 7, 2020
@lock lock bot locked as resolved and limited conversation to collaborators Apr 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants