Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] [python] custom metric function in lgb.train() interface should not refer to the the passed dataset a train_data #4759

Closed
jameslamb opened this issue Nov 1, 2021 · 9 comments · Fixed by #5002 or #5011
Assignees
Labels

Comments

@jameslamb
Copy link
Collaborator

Summary

This issue proposes changing the documentation for feval in lgb.train(), in the Python package, replacing train_data with evaluation_dataset.

Motivation

lgb.train() allows providing custom evaluation metrics defined in Python functions. According to the docs for lgb.train() (here), such functions should have the following signature.

Should accept two parameters: preds, train_data, and return (grad, hess).

train_data: Dataset
The training dataset.

I think it's confusing to refer to this as "The training dataset". In my opinion, that gives the impression that custom evaluation functions will have access to the data being used for training, which is not true. Instead, each custom evaluation function is provided a Dataset that evaluation has been request for, which might be either the training data or a separate validation dataset.

if valid_sets is not None:
if is_valid_contain_train:
evaluation_result_list.extend(booster.eval_train(feval))
evaluation_result_list.extend(booster.eval_valid(feval))

def eval_valid(self, feval=None):
"""Evaluate for validation data.
Parameters
----------
feval : callable or None, optional (default=None)
Customized evaluation function.
Should accept two parameters: preds, valid_data,
and return (eval_name, eval_result, is_higher_better) or list of such tuples.
preds : list or numpy 1-D array
The predicted values.
If ``fobj`` is specified, predicted values are returned before any transformation,
e.g. they are raw margin instead of probability of positive class for binary task in this case.
valid_data : Dataset
The validation dataset.

return [item for i in range(1, self.__num_dataset)
for item in self.__inner_eval(self.name_valid_sets[i - 1], i, feval)]

Description

Specifically, I'm proposing the following changes:

References

Noticed this while working on #4679 (comment).

@jameslamb jameslamb added the doc label Nov 1, 2021
@jameslamb
Copy link
Collaborator Author

@StrikerRUS @jmoralez would you support a change like this?

@shiyu1994
Copy link
Collaborator

I agree with the changes.
In fact, the description of feval seems just copied from the description of fobj, which is used only by training dataset. And this line should also be changed:

Each evaluation function should accept two parameters: preds, train_data,

@StrikerRUS
Copy link
Collaborator

Generally, I support this change. But maybe use eval_data name instead of evaluation_data for the consistency with other similar names like eval_valid, eval_train, eval_name, eval_result, etc.?

@akshitadixit
Copy link
Contributor

Is this available to work on?

@jameslamb
Copy link
Collaborator Author

@akshitadixit sure! We'd welcome the help!

Please see @StrikerRUS 's comment above about using the new name eval_data: #4759 (comment)

@akshitadixit
Copy link
Contributor

Sure!

@akshitadixit
Copy link
Contributor

akshitadixit commented Feb 4, 2022

Hi @jameslamb and @StrikerRUS i noticed something I thought of confirming from you.

I have been asked to change valid_data to eval_data in the eval_valid() function

def eval_valid(self, feval=None):
"""Evaluate for validation data.
Parameters
----------
feval : callable or None, optional (default=None)
Customized evaluation function.
Should accept two parameters: preds, valid_data,
and return (eval_name, eval_result, is_higher_better) or list of such tuples.
preds : list or numpy 1-D array
The predicted values.
If ``fobj`` is specified, predicted values are returned before any transformation,
e.g. they are raw margin instead of probability of positive class for binary task in this case.
valid_data : Dataset
The validation dataset.
eval_name : str

whereas there is a similar instance with train_data in the eval_train() function too as was pointed out for engine.py:

def eval_train(self, feval=None):
"""Evaluate for training data.
Parameters
----------
feval : callable or None, optional (default=None)
Customized evaluation function.
Should accept two parameters: preds, train_data,
and return (eval_name, eval_result, is_higher_better) or list of such tuples.
preds : list or numpy 1-D array
The predicted values.
If ``fobj`` is specified, predicted values are returned before any transformation,
e.g. they are raw margin instead of probability of positive class for binary task in this case.
train_data : Dataset
The training dataset.
eval_name : str

Kindly clarify this,

@jameslamb
Copy link
Collaborator Author

@akshitadixit sorry for the delayed response.

Yes, please also change those places to

Should accept two parameters: preds, eval_data

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
4 participants