Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn when passing labels with missing values #4483

Closed
ChristophAymannsQC opened this issue Jul 22, 2021 · 11 comments
Closed

Warn when passing labels with missing values #4483

ChristophAymannsQC opened this issue Jul 22, 2021 · 11 comments

Comments

@ChristophAymannsQC
Copy link
Contributor

Summary

Currently (it seems to me) missing values (np.nan) are silently replaced by 0s when objective=regression (I haven't checked other objective functions), possibly related to this. It would be nice if this was made transparent to the user who might be inadvertently passing missing values. Alternatively, observations with missing values could be dropped.

Minimal example

import lightgbm as lgb
import numpy as np
import pandas as pd
# %%
rng = np.random.default_rng(42)
x = rng.normal(size=100)
y = 3 * x + rng.normal(size=100)
X = pd.DataFrame(data={"x": x})
y[rng.uniform(size=100) < 0.1] = np.nan
# %%
data = lgb.Dataset(X, label=y)
print(data.get_label())
print(y)
# %%
eval_res = {}
booster = lgb.train(
    params={"objective": "regression", "verbosity": -1},
    train_set=data,
    valid_sets=[data],
    verbose_eval=False,
    evals_result=eval_res,
)
starting_loss = eval_res["training"]["l2"][0]
# %%
print(data.get_label())
# %%
y_dash = y.copy()
y_dash[np.isnan(y)] = 0
data_dash = lgb.Dataset(X, label=y_dash)
# %%
eval_res_dash = {}
booster = lgb.train(
    params={"objective": "regression", "verbosity": -1},
    train_set=data_dash,
    valid_sets=[data_dash],
    verbose_eval=False,
    evals_result=eval_res_dash,
)
starting_loss_dash = eval_res_dash["training"]["l2"][0]
# %%
assert starting_loss_dash == starting_loss
@kurchi1205
Copy link

Hi , I want to make my first contribution . Can I work on this?

@ChristophAymannsQC
Copy link
Contributor Author

Sure. Please do give it a shot.

@jameslamb
Copy link
Collaborator

Thanks for starting this discussion, @ChristophAymannsQC !

@kurchi1205 before you get too far into trying to implement this, I think we should get maintainers' opinions on this question. I have some initial thoughts.

  1. If any change is made for this feature request, I have a strong preference for raising a warning instead of automatically dropping samples. Dropping samples brings a lot of complexity with it. For example, you'd have to also drop the corresponding data in init_score and weight if they're provided, you'd have to adjust used_indices, and for ranking tasks you'd have to carefully edit group.
  2. If any change is made for this feature request, it should be in dataset construction on the C++ side, not just in the Python package, so that users who use the LightGBM CLI, R package, and other interfaces would also benefit.

I'm also not totally convinced that the cost of this check would be worth the value of the warning. Picking this up as described would make every Dataset construction in LightGBM a bit slower.

I'm curious to here what @StrikerRUS and @shiyu1994 think.

@ChristophAymannsQC
Copy link
Contributor Author

@jameslamb Thanks a lot for your thoughts. Didn't want to step on anyone's toes here. :)

@shiyu1994
Copy link
Collaborator

I agree that we should raise a warning instead of dropping samples silently. At least, I cannot figure out when it would make sense to pass a sample with NaN label into a supervised learning algorithm. Raising a warning is enough to inform the user about what is happen, so that they can double check their data and remove the sample with NaN label manually.

@shiyu1994
Copy link
Collaborator

@ChristophAymannsQC Thank you for pointing that out!

@jameslamb
Copy link
Collaborator

I'm also not totally convinced that the cost of this check would be worth the value of the warning. Picking this up as described would make every Dataset construction in LightGBM a bit slower.

^ @shiyu1994 what do you think about this? Do you think it wouldn't be too costly in Dataset construction, and that we should add this as a feature request?

@shiyu1994
Copy link
Collaborator

I think checking NaN in labels has little effect on the speed in Dataset construction, since the heavier part is on parsing features and constructing bins. We can add this as a feature request.

@jameslamb
Copy link
Collaborator

Ok thanks! I've added this to #2302 with other features.

@kurchi1205 are you still interested in contributing this? We'd love the help!

If you pick this up, please consider the following:

  • this change should be implemented on the C++ side, not just in the Python package, so all LightGBM interfaces would benefit from it
  • this should work for all types of raw data sources supported in LightGBM (loading from a file, loading from a sparse matrix, loading from a dense matrix)

Let us know here if you need some additional guidance.

@no-response
Copy link

no-response bot commented Nov 13, 2021

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

@github-actions

This comment was marked as off-topic.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023
@microsoft microsoft unlocked this conversation Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants