Quantile LightGBM - inconsistent deciles #3447

TomekPro · 2020-10-09T13:37:54Z

Hi, I'm doing Quantile LightGBM, defined as follows:

QuantileEstimator(lgb.LGBMRegressor(n_jobs=-1,
                             seed=1234,
                             learning_rate=0.1,
                             reg_sqrt=True,
                             objective = 'quantile',
                             n_estimators=100))

Where QuantileEstimator is just:

class QuantileEstimator(BaseEstimator, RegressorMixin):

    def __init__(self, model):
        """
        """
        self.alphas = [round(x, 1) for x in np.arange(0.1, 1, 0.1)]
        self.model_factory = []
        self.model = model
        super().__init__()

    def fit(self, X, y=None):
        
        for a in self.alphas:
            model_i = clone(self.model)
            model_i = model_i.set_params(**{'alpha': a})
            model_fitted = model_i.fit(X, y)
            self.model_factory.append(model_fitted)
        return self

    def predict(self, X):
        predictions = pd.DataFrame()
        for a, m in zip(self.alphas, self.model_factory):
            predictions["y_pred_" + str(a)] = m.predict(X)
        return predictions

Sometimes it happens that following decile for individual predictiions has lower value than the previous one.

Any idea why that's happening? I understand that those are different models, but still as I set the same seed I would expect that the results for different deciles should be monotonic.

The text was updated successfully, but these errors were encountered:

guolinke · 2020-10-15T23:49:31Z

I think it may be a bug, will investigate it when have time.

TomekPro · 2020-10-16T07:05:21Z

Thanks, please let me know if you would need some additional info. Maybe the distribution of undelying data (y) could be a clue?

guolinke · 2020-10-26T05:15:58Z

@TomekPro could you provide a data for reproducibly? the small dataset or randomly generated dataset will be better.

StrikerRUS · 2020-12-25T21:37:55Z

gently ping @TomekPro for example of data

mburaksayici · 2021-02-09T13:11:30Z

Hi, having the same problem but didn't think that it's a bug. When I see this topic i see its worth questioning that.
I calculate 10th,30th,50th,70th,90th quantiles for time series regression problem. On some points, p30-p50-p70 is mixing up. Until I see the suggestion of @TomekPro I thought it's due to randomity of the training process, because when I average ensemble the same model with same features for 3-4 times quantiles were getting organised.
My data is not normally distributed, so it makes sense, yes.

Diving into theory in order to see if it's a bug, I have a feature that can separate the data into two normal distribution(open to discussion, but it's more normally dist.) which is in most important 10 features. However, lgbm doesn't guarantee to use the this feature at the top of the trees, so doesn't guarantee that the final output is normal. I'm pointing this out because I wonder even if the data is not normally distributed, trees can find normally distributed subparts of the data. And it may resolve the problem of innormally distributed target, trees can find normally distributed subparts of it. Is it the case for trees?

I sometimes catastrophically end up with quantiles such as this:

In general, quantiles are mixing up. For the mix of p30-p50-p70, @TomekPro 's suggestion seems to be plausible, however,in some cases I have that p50 is out of the range of p10-p90. Sometimes, p50 gives more accurate results but p10-p90 interval doesn't capture the real value at all.

Again, i'm not sure whether it's bug or not, or may be I'm modelling something wrong, or it's just a randomness since we don't fit an actual normal distribution with (mean,std) parameters (NGBoost does that) but we estimate it, or using lots of features etc. Again, i'm noting that when I run the same quantile with same model/features and simply average the results, I have more consistent results.

shiyu1994 · 2021-03-17T08:19:55Z

I think with GBDT, quantile objective does not guarantee the prediction value of an instance (data point) to be monotone. Just as we cannot guarantee that if we increase the label of one instance in the training data, then the prediction of that instance will increase if we retrain the model. Because the data partition in the trees can change, and the leaf prediction value can be very different.

Here's an example where the predicted decile is not monotone.

import lightgbm as lgb
from sklearn.base import *
import pandas as pd
import numpy as np

class QuantileEstimator(BaseEstimator, RegressorMixin):

    def __init__(self, model):
        """
        """
        self.alphas = [round(x, 1) for x in np.arange(0.1, 1.0, 0.1)]
        self.model_factory = []
        self.model = model
        super().__init__()

    def fit(self, X, y=None):
        
        for a in self.alphas:
            model_i = clone(self.model)
            model_i = model_i.set_params(**{'alpha': a})
            model_fitted = model_i.fit(X, y)
            self.model_factory.append(model_fitted)
        return self

    def predict(self, X):
        predictions = pd.DataFrame()
        for a, m in zip(self.alphas, self.model_factory):
            predictions["y_pred_" + str(a)] = m.predict(X, raw_score=True)
        return predictions

qet = QuantileEstimator(lgb.LGBMRegressor(n_jobs=1,
                             seed=1234,
                             learning_rate=0.1,
                             reg_sqrt=True,
                             objective = 'quantile',
                             n_estimators=2,
                             min_data_in_leaf=1,
                             num_leaves=3,
                             boost_from_average=True,
                             verbose=2))

np.random.seed(2)
X = np.random.rand(10, 20)
y = np.random.rand(10)
qet.fit(X, y)
pred = qet.predict(X)
for i in range(pred.shape[0]):
    for j in range(pred.shape[1] - 1):
        if pred.iloc[i, j] > pred.iloc[i, j + 1]:
            print(pred.iloc[i, :])
            print(i)

And the output is

y_pred_0.1    0.504354
y_pred_0.2    0.660557
y_pred_0.3    0.729929
y_pred_0.4    0.725802
y_pred_0.5    0.742658
y_pred_0.6    0.802490
y_pred_0.7    0.891505
y_pred_0.8    0.929288
y_pred_0.9    0.962373
Name: 1, dtype: float64
1
y_pred_0.1    0.504354
y_pred_0.2    0.662241
y_pred_0.3    0.726090
y_pred_0.4    0.710099
y_pred_0.5    0.732608
y_pred_0.6    0.802490
y_pred_0.7    0.891505
y_pred_0.8    0.929288
y_pred_0.9    0.962373
Name: 5, dtype: float64
5

I've strictly follow the computation of GBDT with quantile objective and manually calculated the result. It is consistent with the tree output of LightGBM.

So in general, I don't think this is bug. It would be better if @TomekPro could provide the data, so that we can check the tree structures in the model. Your help is really appreciated.

Restricting the deciles to be monotone is an interesting question and can be take as a feature request.

If no further evidence shows that this is really a bug, I think this issue should be closed.

TomekPro · 2021-03-17T08:30:21Z

@shiyu1994 I'm sorry but I cannot provide the data as it is confidential but you can simulate it from the distribution I posted above. I suspect that this case is most common for such distributions with a large group of outliers.

shiyu1994 · 2021-03-17T08:35:11Z

Ok, thanks for your information. I can try to build another example with labels similar to your distribution. But in general, the output for a data point with quantile objective is not guaranteed to be monotone with the alpha hyperparameter, as the example showed above.

Is the monotonicity essential for you application?

TomekPro · 2021-03-17T09:52:22Z

At the end I haven't used LGBM but in general I believe this is a really common case. You want not only have point estimate but as well some kind of confidence interval for your prediction.

shiyu1994 · 2021-04-14T10:05:47Z

I think to implement such a constraint requires to fitting models with different alpha together. We need to pass all the alphas to the boosting process, maintaining boosting models for different alphas simultaneously and setting appropriate restriction so that the prediction values for each data point grows monotonically with alpha.
This is a complicated task and requires new boosting algorithm design. We may investigate this in the future.
For now, let me put this into the feature request & voting hub.

Bougeant · 2022-01-28T15:37:01Z

Not sure how to vote for this feature request to be prioritised, but it obviously is critical in most applications that the quantiles be ranked correctly.

lorentzenchr · 2023-03-01T08:18:29Z

My 5 cents:
I would also not consider this as a bug.

What most people do not consider: If different quantiles clash/are not monotone, then the uncertainty of that prediction is veeeeery likely high. i.e. the crossing is withing the estimation uncertainty.

It is much easier to enforce such monotonicity constraints with linear models (as R's quantreg does). For tree based models, one would need the same tree split points to make such a constraint possible, very much like quantile regression forests do.

RektPunk · 2023-03-08T07:26:27Z

Hey, I also suffered from the problem of not maintaining monotonicity between quantiles, called the crossing problem.
To solve this, I suggested a method written in the issue. I hope you guys check it out and feel free to discuss it.

github-actions · 2023-08-15T20:13:32Z

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

StrikerRUS mentioned this issue Oct 9, 2020

Quantile LightGBM - inconsistent deciles #3448

Closed

guolinke added the bug label Oct 15, 2020

StrikerRUS mentioned this issue Jan 28, 2021

v3.2.0 release #3872

Merged

shiyu1994 mentioned this issue Apr 14, 2021

Feature Requests & Voting Hub #2302

Open

shiyu1994 closed this as completed Apr 14, 2021

shiyu1994 mentioned this issue Apr 22, 2021

Multiple Quantile Regression #4201

Closed

jmoralez mentioned this issue Jun 16, 2022

Prediction outside prediction interval with LGBMRegressor #5296

Closed

RektPunk mentioned this issue Feb 17, 2023

Multiple quantile regression with preserving monotonicity #5727

Open

github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantile LightGBM - inconsistent deciles #3447

Quantile LightGBM - inconsistent deciles #3447

TomekPro commented Oct 9, 2020

guolinke commented Oct 15, 2020

TomekPro commented Oct 16, 2020

guolinke commented Oct 26, 2020

StrikerRUS commented Dec 25, 2020

mburaksayici commented Feb 9, 2021 •

edited

Loading

shiyu1994 commented Mar 17, 2021

TomekPro commented Mar 17, 2021

shiyu1994 commented Mar 17, 2021 •

edited

Loading

TomekPro commented Mar 17, 2021

shiyu1994 commented Apr 14, 2021

Bougeant commented Jan 28, 2022

lorentzenchr commented Mar 1, 2023

RektPunk commented Mar 8, 2023

github-actions bot commented Aug 15, 2023

Quantile LightGBM - inconsistent deciles #3447

Quantile LightGBM - inconsistent deciles #3447

Comments

TomekPro commented Oct 9, 2020

guolinke commented Oct 15, 2020

TomekPro commented Oct 16, 2020

guolinke commented Oct 26, 2020

StrikerRUS commented Dec 25, 2020

mburaksayici commented Feb 9, 2021 • edited Loading

shiyu1994 commented Mar 17, 2021

TomekPro commented Mar 17, 2021

shiyu1994 commented Mar 17, 2021 • edited Loading

TomekPro commented Mar 17, 2021

shiyu1994 commented Apr 14, 2021

Bougeant commented Jan 28, 2022

lorentzenchr commented Mar 1, 2023

RektPunk commented Mar 8, 2023

github-actions bot commented Aug 15, 2023

mburaksayici commented Feb 9, 2021 •

edited

Loading

shiyu1994 commented Mar 17, 2021 •

edited

Loading