Warn when passing labels with missing values #4483

ChristophAymannsQC · 2021-07-22T14:47:00Z

Summary

Currently (it seems to me) missing values (np.nan) are silently replaced by 0s when objective=regression (I haven't checked other objective functions), possibly related to this. It would be nice if this was made transparent to the user who might be inadvertently passing missing values. Alternatively, observations with missing values could be dropped.

Minimal example

import lightgbm as lgb
import numpy as np
import pandas as pd
# %%
rng = np.random.default_rng(42)
x = rng.normal(size=100)
y = 3 * x + rng.normal(size=100)
X = pd.DataFrame(data={"x": x})
y[rng.uniform(size=100) < 0.1] = np.nan
# %%
data = lgb.Dataset(X, label=y)
print(data.get_label())
print(y)
# %%
eval_res = {}
booster = lgb.train(
    params={"objective": "regression", "verbosity": -1},
    train_set=data,
    valid_sets=[data],
    verbose_eval=False,
    evals_result=eval_res,
)
starting_loss = eval_res["training"]["l2"][0]
# %%
print(data.get_label())
# %%
y_dash = y.copy()
y_dash[np.isnan(y)] = 0
data_dash = lgb.Dataset(X, label=y_dash)
# %%
eval_res_dash = {}
booster = lgb.train(
    params={"objective": "regression", "verbosity": -1},
    train_set=data_dash,
    valid_sets=[data_dash],
    verbose_eval=False,
    evals_result=eval_res_dash,
)
starting_loss_dash = eval_res_dash["training"]["l2"][0]
# %%
assert starting_loss_dash == starting_loss

The text was updated successfully, but these errors were encountered:

kurchi1205 · 2021-08-20T06:11:01Z

Hi , I want to make my first contribution . Can I work on this?

ChristophAymannsQC · 2021-08-30T15:51:23Z

Sure. Please do give it a shot.

jameslamb · 2021-08-31T14:41:42Z

Thanks for starting this discussion, @ChristophAymannsQC !

@kurchi1205 before you get too far into trying to implement this, I think we should get maintainers' opinions on this question. I have some initial thoughts.

If any change is made for this feature request, I have a strong preference for raising a warning instead of automatically dropping samples. Dropping samples brings a lot of complexity with it. For example, you'd have to also drop the corresponding data in init_score and weight if they're provided, you'd have to adjust used_indices, and for ranking tasks you'd have to carefully edit group.
If any change is made for this feature request, it should be in dataset construction on the C++ side, not just in the Python package, so that users who use the LightGBM CLI, R package, and other interfaces would also benefit.

I'm also not totally convinced that the cost of this check would be worth the value of the warning. Picking this up as described would make every Dataset construction in LightGBM a bit slower.

I'm curious to here what @StrikerRUS and @shiyu1994 think.

ChristophAymannsQC · 2021-08-31T14:56:46Z

@jameslamb Thanks a lot for your thoughts. Didn't want to step on anyone's toes here. :)

shiyu1994 · 2021-09-23T16:33:51Z

I agree that we should raise a warning instead of dropping samples silently. At least, I cannot figure out when it would make sense to pass a sample with NaN label into a supervised learning algorithm. Raising a warning is enough to inform the user about what is happen, so that they can double check their data and remove the sample with NaN label manually.

shiyu1994 · 2021-09-23T16:35:55Z

@ChristophAymannsQC Thank you for pointing that out!

jameslamb · 2021-09-23T16:58:03Z

I'm also not totally convinced that the cost of this check would be worth the value of the warning. Picking this up as described would make every Dataset construction in LightGBM a bit slower.

^ @shiyu1994 what do you think about this? Do you think it wouldn't be too costly in Dataset construction, and that we should add this as a feature request?

shiyu1994 · 2021-09-27T02:28:08Z

I think checking NaN in labels has little effect on the speed in Dataset construction, since the heavier part is on parsing features and constructing bins. We can add this as a feature request.

jameslamb · 2021-09-29T02:24:11Z

Ok thanks! I've added this to #2302 with other features.

@kurchi1205 are you still interested in contributing this? We'd love the help!

If you pick this up, please consider the following:

this change should be implemented on the C++ side, not just in the Python package, so all LightGBM interfaces would benefit from it
this should work for all types of raw data sources supported in LightGBM (loading from a file, loading from a sparse matrix, loading from a dense matrix)

Let us know here if you need some additional guidance.

no-response · 2021-11-13T01:00:25Z

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

jameslamb added the feature request label Jul 25, 2021

jameslamb mentioned this issue Sep 29, 2021

Feature Requests & Voting Hub #2302

Open

shiyu1994 added the awaiting response label Oct 14, 2021

no-response bot closed this as completed Nov 13, 2021

StrikerRUS mentioned this issue May 6, 2022

Fix potential overflow "Multiplication result converted to larger type" #5189

Merged

jameslamb mentioned this issue May 16, 2022

[python-package] check feature names in predict with dataframe (fixes #812) #4909

Merged

This comment was marked as off-topic.

Sign in to view

github-actions bot removed the awaiting response label Aug 16, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023

microsoft unlocked this conversation Aug 18, 2023

jameslamb mentioned this issue Nov 1, 2023

How is missing value in labels handled when do regression? #6173

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn when passing labels with missing values #4483

Warn when passing labels with missing values #4483

ChristophAymannsQC commented Jul 22, 2021

kurchi1205 commented Aug 20, 2021

ChristophAymannsQC commented Aug 30, 2021

jameslamb commented Aug 31, 2021

ChristophAymannsQC commented Aug 31, 2021

shiyu1994 commented Sep 23, 2021

shiyu1994 commented Sep 23, 2021

jameslamb commented Sep 23, 2021

shiyu1994 commented Sep 27, 2021

jameslamb commented Sep 29, 2021

no-response bot commented Nov 13, 2021

This comment was marked as off-topic.

Warn when passing labels with missing values #4483

Warn when passing labels with missing values #4483

Comments

ChristophAymannsQC commented Jul 22, 2021

Summary

Minimal example

kurchi1205 commented Aug 20, 2021

ChristophAymannsQC commented Aug 30, 2021

jameslamb commented Aug 31, 2021

ChristophAymannsQC commented Aug 31, 2021

shiyu1994 commented Sep 23, 2021

shiyu1994 commented Sep 23, 2021

jameslamb commented Sep 23, 2021

shiyu1994 commented Sep 27, 2021

jameslamb commented Sep 29, 2021

no-response bot commented Nov 13, 2021

This comment was marked as off-topic.