feature request: Add parameter to control maximum group size for Lambdarank #5053

antaradas94 · 2022-03-04T12:32:31Z

Description

I have around 12 groups for a dataset of over a ~1 Million rows. And several groups have easily over 10,000 rows.
It would be really helpful if the quota for each query is increased , or maybe we can additionally set it the number of rows we wish.
Im tried to seach if there is a way to increase the upper limit, but havent really come across any. If there exists please do let me know,
Thanks :)

References

    model(
  File "/mnt/batch/tasks/shared/LS_root/jobs/mlw-kundenscore/azureml/db031a99-93b4-4d83-9280-24da972050e8/wd/azureml/db031a99-93b4-4d83-9280-24da972050e8/kundenscore/train/model_classes/base_model.py", line 366, in train_register_and_evaluate_model
    model = self._fit_model(pipeline, hyperparameters)
  File "/mnt/batch/tasks/shared/LS_root/jobs/mlw-kundenscore/azureml/db031a99-93b4-4d83-9280-24da972050e8/wd/azureml/db031a99-93b4-4d83-9280-24da972050e8/kundenscore/train/model_classes/lgbmranker.py", line 106, in _fit_model
    ranker = pipeline.fit(x_train, y_train, **fit_params)
  File "/usr/local/lib/python3.8/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/usr/local/lib/python3.8/site-packages/lightgbm/sklearn.py", line 1067, in fit
    super().fit(X, y, sample_weight=sample_weight, init_score=init_score, group=group,
  File "/usr/local/lib/python3.8/site-packages/lightgbm/sklearn.py", line 748, in fit
    self._Booster = train(
  File "/usr/local/lib/python3.8/site-packages/lightgbm/engine.py", line 271, in train
    booster = Booster(params=params, train_set=train_set)
  File "/usr/local/lib/python3.8/site-packages/lightgbm/basic.py", line 2610, in __init__
    _safe_call(_LIB.LGBM_BoosterCreate(
  File "/usr/local/lib/python3.8/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Number of rows 632751 exceeds upper limit of 10000 for a query

The text was updated successfully, but these errors were encountered:

jameslamb · 2022-03-13T06:02:03Z

Thanks for using LightGBM and for this report.

This error message comes from the following place in the code:

LightGBM/src/metric/dcg_calculator.cpp

Lines 134 to 144 in 9a4e706

    
           void DCGCalculator::CheckMetadata(const Metadata& metadata, data_size_t num_queries) { 
        
             const data_size_t* query_boundaries = metadata.query_boundaries(); 
        
             if (num_queries > 0 && query_boundaries != nullptr) { 
        
               for (data_size_t i = 0; i < num_queries; i++) { 
        
                 data_size_t num_rows = query_boundaries[i + 1] - query_boundaries[i]; 
        
                 if (num_rows > kMaxPosition) { 
        
                   Log::Fatal("Number of rows %i exceeds upper limit of %i for a query", static_cast<int>(num_rows), static_cast<int>(kMaxPosition)); 
        
                 } 
        
               } 
        
             } 
        
           }

I found that by running

git grep 'exceeds upper limit'

Note that the threshold is hard-coded, here.

LightGBM/src/metric/dcg_calculator.cpp

Line 17 in 9a4e706

const data_size_t DCGCalculator::kMaxPosition = 10000;

Using the git blame, I can see that that limit has been set to 10,000 since the very first commit of LightGBM 6 years ago.

https://github.com/microsoft/LightGBM/blame/9a4e70687d5c0732ca895959f418c3f923f2e85a/src/metric/dcg_calculator.cpp#L17

I think this might be hard-coded instead of being determined by label_gain or the maximum group size in the input Dataset because Objective objects in this project's C++ library are initialized without any knowledge of the input data.

LightGBM/src/objective/objective_function.cpp

Line 31 in 9a4e706

return new LambdarankNDCG(config);

@shiyu1994 @guolinke do you think LightGBM should allow increasing this limit via a new parameter? I'm not that familiar with Lambarank, so not sure if (for example), the use of such large query groups in LightGBM should be discouraged.

antaradas94 · 2022-03-14T15:16:53Z

@jameslamb , yes it is hardcoded.
I am curious if there was a specific reason for setting a limit for query.
The example data that i wish to train on has many queries and some queries has even around 100,000.

shiyu1994 · 2022-03-18T09:14:37Z

@antaradas94 Thanks for using LightGBM. The limitation if hard coded here

LightGBM/src/metric/dcg_calculator.cpp

Line 17 in 8e721c5

const data_size_t DCGCalculator::kMaxPosition = 10000;

So you may try to enlarge the number in the source code to meet your need. And then recompile the python package from the source code.

Guidelines for compiling python package:

First compile the source code according to https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#linux (you may also find platforms other than Linux).
Install the python package by cd into LightGBM/python-package and python setup.py install --precompile

A maximum number of documents per query is limited mainly because the complexity of computation of gradients per query in Lambdarank is O(n^2) where n is the number of documents. Datasets with too large documents per query simply does not fit with Lambdarank.

Sorry for the late response, if you have any further question. Please feel free to post here.

shiyu1994 · 2022-03-18T09:17:24Z

@jameslamb I think we can change kMaxPosition into a parameter. WDYT @guolinke @StrikerRUS

StrikerRUS · 2022-03-18T23:50:45Z

I think we can turn that constant into parameter to not force users re-compile LightGBM. Although, I guess it's quite rare cases when users need to increase the default value.

jameslamb · 2022-03-19T22:49:07Z

Ok, thanks @StrikerRUS and @shiyu1994 .

I've changed the title of this issue and added it to #2302.

Per this repo's policy on feature requests, I'm going to close this issue for now. @antaradas94 if you are interested in contributing this feature, please comment here and we can answer any questions you have. If not, anyone else reading this is encouraged to comment here if you're interested in contributing this feature.

Otherwise, you can change the source code and recompile LightGBM yourself.

octatour · 2022-12-29T17:57:43Z

@jameslamb, hi.
I encountered this exception, because one of the queries size exceeded 10000 docs.
But I'm wondering, if I'm using NDCG as a metric, but I'm not implicitly limiting query size via parameter lambdarank_truncation_level, and default is 30, does it mean that all queries is truncated on 30th document?
Because in documentation is says that lambdarank_truncation_level "controls the number of top-results to focus on during training", but also exception "lightgbm.basic.LightGBMError" appears before training (on stage of initialising Booster object) and queries are not truncated yet?

jameslamb · 2023-01-12T05:57:13Z

I'm not sure about the relationship between those two configuration values, sorry. @guolinke can you answer this question?

I think it is like

If lambdarank_truncation_level already is limiting the number of documents per query considered, then why does LightGBM raise an error during Lambdarank training when it encounters queries with many more than lambdarank_truncation_level documents? Why doesn't it just ignore all the documents after lambdarank_truncation_level?

I'll re-open this since it's being discussed.

guolinke · 2023-01-12T06:17:58Z

@octatour not, all documents are used, truncation_level is for the loss calculation. It is used to ensure at least one document in pair (in the pair-wise loss accumulation) is above the truncation_level.
For example, if there are 100 documents for a query, and truncation_level is set to 30. Then, there will be 30 * 100 = 3000 pairs that are used for the loss calculation.

octatour · 2023-01-12T13:42:50Z

@guolinke okay, got it. Thank you for explanation. It wasn't clear for me from documentation and paper, that truncation_level controls number of pairs for loss.

jameslamb · 2023-08-18T01:48:33Z

Sorry, this was locked accidentally. Just unlocked it. We'd still love help with this feature!

jameslamb · 2024-05-01T03:24:54Z

@NigamSomya thanks for using LightGBM.

Your question seems to be generally "how do I build the Python package from source" and not specific to Lambdarank, so I've created #6437 and hidden your comment here. Let's please discuss over there.

If there's any other way to increase 'kMaxPosition', please share.

There is not. At the moment, you'll have to change it in the code and recompile LightGBM.

adisomani2003 · 2024-05-09T04:43:40Z

Sorry, this was locked accidentally. Just unlocked it. We'd still love help with this feature!

Hi, would love to contribute if this is still open!

jameslamb · 2024-05-09T04:46:13Z

sure! We'd welcome the contribution.

You can ask any questions here, and tag @shiyu1994 and @metpavel

jameslamb added the question label Mar 13, 2022

jameslamb mentioned this issue Mar 19, 2022

Feature Requests & Voting Hub #2302

Open

jameslamb closed this as completed Mar 19, 2022

jameslamb changed the title ~~Exceed number of rows for a query~~ feature request: Add parameter to control maximum group size for Lambdarank Mar 19, 2022

jameslamb added the feature request label Mar 19, 2022

jameslamb reopened this Jan 12, 2023

jameslamb added the awaiting response label Jan 12, 2023

jameslamb closed this as completed Feb 7, 2023

This comment was marked as off-topic.

Sign in to view

github-actions bot removed the awaiting response label Aug 15, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023

microsoft unlocked this conversation Aug 18, 2023

This comment was marked as off-topic.

Sign in to view

jameslamb mentioned this issue May 1, 2024

(transferred) [python-package] how to install the Python package from source? #6437

Closed

jameslamb reopened this May 9, 2024

StrikerRUS closed this as completed Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: Add parameter to control maximum group size for Lambdarank #5053

feature request: Add parameter to control maximum group size for Lambdarank #5053

antaradas94 commented Mar 4, 2022

jameslamb commented Mar 13, 2022

antaradas94 commented Mar 14, 2022

shiyu1994 commented Mar 18, 2022 •

edited

Loading

shiyu1994 commented Mar 18, 2022

StrikerRUS commented Mar 18, 2022

jameslamb commented Mar 19, 2022

octatour commented Dec 29, 2022

jameslamb commented Jan 12, 2023

guolinke commented Jan 12, 2023

octatour commented Jan 12, 2023

This comment was marked as off-topic.

jameslamb commented Aug 18, 2023

This comment was marked as off-topic.

jameslamb commented May 1, 2024 •

edited

Loading

adisomani2003 commented May 9, 2024

jameslamb commented May 9, 2024

feature request: Add parameter to control maximum group size for Lambdarank #5053

feature request: Add parameter to control maximum group size for Lambdarank #5053

Comments

antaradas94 commented Mar 4, 2022

Description

References

jameslamb commented Mar 13, 2022

antaradas94 commented Mar 14, 2022

shiyu1994 commented Mar 18, 2022 • edited Loading

shiyu1994 commented Mar 18, 2022

StrikerRUS commented Mar 18, 2022

jameslamb commented Mar 19, 2022

octatour commented Dec 29, 2022

jameslamb commented Jan 12, 2023

guolinke commented Jan 12, 2023

octatour commented Jan 12, 2023

This comment was marked as off-topic.

jameslamb commented Aug 18, 2023

This comment was marked as off-topic.

jameslamb commented May 1, 2024 • edited Loading

adisomani2003 commented May 9, 2024

jameslamb commented May 9, 2024

shiyu1994 commented Mar 18, 2022 •

edited

Loading

jameslamb commented May 1, 2024 •

edited

Loading