[python-package] custom objective function returns strange leaf node values #5114

ShaharKSegal · 2022-03-31T14:27:05Z

Description

I got different leaf node values when using a custom objective function which should be identical to the built-in function (e.g. square loss).

Additional Information

I've inspected the single estimator (tree) case , and it seems that the two models perform the same exact splits, the only difference is the leaf node values.
I would like to note that I suspect that the learning_rate has something to do with it. The learning rate doesn't affect the leaf node values in the built-in objective model for a single tree, but it greatly affects them in the custom loss model (increasing with the learning rate). At learning_rate = 1.0 the two models returns almost identical trees.

Reproducable Example

A toy example (with learning_rate=0.1):

import numpy as np
import pandas as pd
import lightgbm as lgb
import sklearn
import sklearn.datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def l2_loss(y, data):
    t = data.get_label()
    grad = y - t
    hess = np.ones_like(y)
    return grad, hess

X, y = sklearn.datasets.load_diabetes(True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
lgb_train = lgb.Dataset(X_train, label=y_train)

# Using built-in objective
lgbm_params = { 'learning_rate': 0.1, 'objective': 'l2', 'n_estimators': 1, 'random_seed': 0}
lgbm_params2 = lgbm_params.copy()
model = lgb.train(lgbm_params, lgb_train)
# Using custom objective
del lgbm_params2['objective']
model2 = lgb.train(lgbm_params2, lgb_train, fobj=l2_loss)
# Perform Inference
y_pred = model.predict(X_test)
y_pred2 = model2.predict(X_test)
print(mean_squared_error(y_test, y_pred))
print(mean_squared_error(y_test, y_pred2))

tree_df = model.trees_to_dataframe()
tree_df2 = model2.trees_to_dataframe()
# assert all columns besides the value column
pd.testing.assert_frame_equal(tree_df.drop('value', axis=1), tree_df2.drop('value', axis=1))
print('=' * 20)
# assert value column, which raises an error
pd.testing.assert_series_equal(tree_df.value, tree_df2.value)

Simulation output:

4601.705144789863
23340.304118379245
====================
...
AssertionError: Series are different
Series values are different (100.0 %)
[left]:  [151.921, 148.842, 147.042, 148.306, 145.35430527831252, 149.276, 150.4806565130858, 148.03763852621253, 145.291, 144.56173769593775, 146.1080930897291, 154.024, 151.42389976535026, 158.60549574740722, 158.459, 153.975, 156.14430521585183, 152.16680512135363, 161.641, 163.30562114757277, 159.0043053021544]
[right]: [0.0, 12.1124, 10.3126, 11.5763, 8.625, 12.5466, 13.751351351351353, 11.308333333333334, 8.56143, 7.832432432432433, 9.378787878787879, 17.2948, 14.694594594594594, 21.876190476190477, 21.7292, 17.2455, 19.415000000000003, 15.4375, 24.9113, 26.576315789473682, 22.275000000000002]

Environment Info

tested on LGBM 3.3.2 and 3.2.1 (install via pip) with Windows 10 and python 3.7

The text was updated successfully, but these errors were encountered:

jmoralez · 2022-03-31T19:03:59Z

Hi @ShaharKSegal, thank you for your interest in LightGBM. The difference you're seeing is due to the initial score. When you use the built-in objective, it starts the boosting from the mean of the label, i.e. the initial score for each sample is the average of label. When you use a custom objective boosting starts from zero by default, unless you explicitly set the initial scores in the dataset. I've modified your example to incorporate this:

import numpy as np
import pandas as pd
import lightgbm as lgb
import sklearn
import sklearn.datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def l2_loss(y, data):
    t = data.get_label()
    grad = y - t
    hess = np.ones_like(y)
    return grad, hess

X, y = sklearn.datasets.load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
lgb_train = lgb.Dataset(X_train, label=y_train)
avg_label = y_train.mean()
ds_with_init_score = lgb.Dataset(X_train, y_train, init_score=np.full_like(y_train, avg_label))

# Using built-in objective
lgbm_params = { 'learning_rate': 0.1, 'objective': 'l2', 'n_estimators': 1, 'random_seed': 0}
model = lgb.train(lgbm_params, lgb_train)
# Using custom objective
model2 = lgb.train(lgbm_params, ds_with_init_score, fobj=l2_loss)
# Perform Inference
y_pred = model.predict(X_test)
y_pred2 = model2.predict(X_test) + avg_label  # have to add back the init_score
print(mean_squared_error(y_test, y_pred))
print(mean_squared_error(y_test, y_pred2))

tree_df = model.trees_to_dataframe()
tree_df2 = model2.trees_to_dataframe()
# assert all columns besides the value column
pd.testing.assert_frame_equal(tree_df.drop('value', axis=1), tree_df2.drop('value', axis=1))
print('=' * 20)
# assert value column, doesn't raise an error anymore
pd.testing.assert_series_equal(tree_df.value, tree_df2.value + avg_label)

ShaharKSegal · 2022-04-01T05:33:10Z

Hi @jmoralez , thank you for your quick reply! It seems to do the trick, but feel rather odd that you have to add back the average to the prediction for the custom objective but not for the built in one.
I suppose all built-in objective have an init_score, is this behaviour defined in the documentation for me to further read about? I didn't find any and would appreciate a reference.

jmoralez · 2022-04-01T15:11:42Z

I'd say a good reference is the boost_from_average parameter.

github-actions · 2023-08-23T00:18:30Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jmoralez added question awaiting response labels Mar 31, 2022

no-response bot removed the awaiting response label Apr 1, 2022

ShaharKSegal closed this as completed Apr 3, 2022

jmoralez mentioned this issue May 31, 2022

Custom Loss Function for LGBMRegressor #5256

Closed

jmoralez mentioned this issue Jul 4, 2022

Custom objective does not have the same loss curves. #5350

Closed

sushmit-goyal mentioned this issue Jul 18, 2022

Custom Log Loss does not have the same loss curves #5373

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] custom objective function returns strange leaf node values #5114

[python-package] custom objective function returns strange leaf node values #5114

ShaharKSegal commented Mar 31, 2022 •

edited

Loading

jmoralez commented Mar 31, 2022 •

edited

Loading

ShaharKSegal commented Apr 1, 2022

jmoralez commented Apr 1, 2022

github-actions bot commented Aug 23, 2023

[python-package] custom objective function returns strange leaf node values #5114

[python-package] custom objective function returns strange leaf node values #5114

Comments

ShaharKSegal commented Mar 31, 2022 • edited Loading

Description

Additional Information

Reproducable Example

Environment Info

jmoralez commented Mar 31, 2022 • edited Loading

ShaharKSegal commented Apr 1, 2022

jmoralez commented Apr 1, 2022

github-actions bot commented Aug 23, 2023

ShaharKSegal commented Mar 31, 2022 •

edited

Loading

jmoralez commented Mar 31, 2022 •

edited

Loading