Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: Add parameter to control maximum group size for Lambdarank #5053

Closed
antaradas94 opened this issue Mar 4, 2022 · 16 comments
Closed

Comments

@antaradas94
Copy link

Description

I have around 12 groups for a dataset of over a ~1 Million rows. And several groups have easily over 10,000 rows.
It would be really helpful if the quota for each query is increased , or maybe we can additionally set it the number of rows we wish.
Im tried to seach if there is a way to increase the upper limit, but havent really come across any. If there exists please do let me know,
Thanks :)

References

    model(
  File "/mnt/batch/tasks/shared/LS_root/jobs/mlw-kundenscore/azureml/db031a99-93b4-4d83-9280-24da972050e8/wd/azureml/db031a99-93b4-4d83-9280-24da972050e8/kundenscore/train/model_classes/base_model.py", line 366, in train_register_and_evaluate_model
    model = self._fit_model(pipeline, hyperparameters)
  File "/mnt/batch/tasks/shared/LS_root/jobs/mlw-kundenscore/azureml/db031a99-93b4-4d83-9280-24da972050e8/wd/azureml/db031a99-93b4-4d83-9280-24da972050e8/kundenscore/train/model_classes/lgbmranker.py", line 106, in _fit_model
    ranker = pipeline.fit(x_train, y_train, **fit_params)
  File "/usr/local/lib/python3.8/site-packages/sklearn/pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/usr/local/lib/python3.8/site-packages/lightgbm/sklearn.py", line 1067, in fit
    super().fit(X, y, sample_weight=sample_weight, init_score=init_score, group=group,
  File "/usr/local/lib/python3.8/site-packages/lightgbm/sklearn.py", line 748, in fit
    self._Booster = train(
  File "/usr/local/lib/python3.8/site-packages/lightgbm/engine.py", line 271, in train
    booster = Booster(params=params, train_set=train_set)
  File "/usr/local/lib/python3.8/site-packages/lightgbm/basic.py", line 2610, in __init__
    _safe_call(_LIB.LGBM_BoosterCreate(
  File "/usr/local/lib/python3.8/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Number of rows 632751 exceeds upper limit of 10000 for a query
@jameslamb
Copy link
Collaborator

Thanks for using LightGBM and for this report.

This error message comes from the following place in the code:

void DCGCalculator::CheckMetadata(const Metadata& metadata, data_size_t num_queries) {
const data_size_t* query_boundaries = metadata.query_boundaries();
if (num_queries > 0 && query_boundaries != nullptr) {
for (data_size_t i = 0; i < num_queries; i++) {
data_size_t num_rows = query_boundaries[i + 1] - query_boundaries[i];
if (num_rows > kMaxPosition) {
Log::Fatal("Number of rows %i exceeds upper limit of %i for a query", static_cast<int>(num_rows), static_cast<int>(kMaxPosition));
}
}
}
}

I found that by running

git grep 'exceeds upper limit'

Note that the threshold is hard-coded, here.

const data_size_t DCGCalculator::kMaxPosition = 10000;

Using the git blame, I can see that that limit has been set to 10,000 since the very first commit of LightGBM 6 years ago.

https://github.com/microsoft/LightGBM/blame/9a4e70687d5c0732ca895959f418c3f923f2e85a/src/metric/dcg_calculator.cpp#L17

I think this might be hard-coded instead of being determined by label_gain or the maximum group size in the input Dataset because Objective objects in this project's C++ library are initialized without any knowledge of the input data.

return new LambdarankNDCG(config);

@shiyu1994 @guolinke do you think LightGBM should allow increasing this limit via a new parameter? I'm not that familiar with Lambarank, so not sure if (for example), the use of such large query groups in LightGBM should be discouraged.

@antaradas94
Copy link
Author

@jameslamb , yes it is hardcoded.
I am curious if there was a specific reason for setting a limit for query.
The example data that i wish to train on has many queries and some queries has even around 100,000.

@shiyu1994
Copy link
Collaborator

shiyu1994 commented Mar 18, 2022

@antaradas94 Thanks for using LightGBM. The limitation if hard coded here

const data_size_t DCGCalculator::kMaxPosition = 10000;

So you may try to enlarge the number in the source code to meet your need. And then recompile the python package from the source code.

Guidelines for compiling python package:

  1. First compile the source code according to https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#linux (you may also find platforms other than Linux).
  2. Install the python package by cd into LightGBM/python-package and python setup.py install --precompile

A maximum number of documents per query is limited mainly because the complexity of computation of gradients per query in Lambdarank is O(n^2) where n is the number of documents. Datasets with too large documents per query simply does not fit with Lambdarank.

Sorry for the late response, if you have any further question. Please feel free to post here.

@shiyu1994
Copy link
Collaborator

@jameslamb I think we can change kMaxPosition into a parameter. WDYT @guolinke @StrikerRUS

@StrikerRUS
Copy link
Collaborator

I think we can turn that constant into parameter to not force users re-compile LightGBM. Although, I guess it's quite rare cases when users need to increase the default value.

@jameslamb
Copy link
Collaborator

Ok, thanks @StrikerRUS and @shiyu1994 .

I've changed the title of this issue and added it to #2302.

Per this repo's policy on feature requests, I'm going to close this issue for now. @antaradas94 if you are interested in contributing this feature, please comment here and we can answer any questions you have. If not, anyone else reading this is encouraged to comment here if you're interested in contributing this feature.

Otherwise, you can change the source code and recompile LightGBM yourself.

@jameslamb jameslamb changed the title Exceed number of rows for a query feature request: Add parameter to control maximum group size for Lambdarank Mar 19, 2022
@octatour
Copy link

@jameslamb, hi.
I encountered this exception, because one of the queries size exceeded 10000 docs.
But I'm wondering, if I'm using NDCG as a metric, but I'm not implicitly limiting query size via parameter lambdarank_truncation_level, and default is 30, does it mean that all queries is truncated on 30th document?
Because in documentation is says that lambdarank_truncation_level "controls the number of top-results to focus on during training", but also exception "lightgbm.basic.LightGBMError" appears before training (on stage of initialising Booster object) and queries are not truncated yet?

@jameslamb
Copy link
Collaborator

I'm not sure about the relationship between those two configuration values, sorry. @guolinke can you answer this question?

I think it is like

If lambdarank_truncation_level already is limiting the number of documents per query considered, then why does LightGBM raise an error during Lambdarank training when it encounters queries with many more than lambdarank_truncation_level documents? Why doesn't it just ignore all the documents after lambdarank_truncation_level?

I'll re-open this since it's being discussed.

@guolinke
Copy link
Collaborator

@octatour not, all documents are used, truncation_level is for the loss calculation. It is used to ensure at least one document in pair (in the pair-wise loss accumulation) is above the truncation_level.
For example, if there are 100 documents for a query, and truncation_level is set to 30. Then, there will be 30 * 100 = 3000 pairs that are used for the loss calculation.

@octatour
Copy link

@guolinke okay, got it. Thank you for explanation. It wasn't clear for me from documentation and paper, that truncation_level controls number of pairs for loss.

@github-actions

This comment was marked as off-topic.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
@jameslamb
Copy link
Collaborator

Sorry, this was locked accidentally. Just unlocked it. We'd still love help with this feature!

@microsoft microsoft unlocked this conversation Aug 18, 2023
@NigamSomya

This comment was marked as off-topic.

@jameslamb
Copy link
Collaborator

jameslamb commented May 1, 2024

@NigamSomya thanks for using LightGBM.

Your question seems to be generally "how do I build the Python package from source" and not specific to Lambdarank, so I've created #6437 and hidden your comment here. Let's please discuss over there.

If there's any other way to increase 'kMaxPosition', please share.

There is not. At the moment, you'll have to change it in the code and recompile LightGBM.

@adisomani2003
Copy link

Sorry, this was locked accidentally. Just unlocked it. We'd still love help with this feature!

Hi, would love to contribute if this is still open!

@jameslamb
Copy link
Collaborator

sure! We'd welcome the contribution.

You can ask any questions here, and tag @shiyu1994 and @metpavel

@jameslamb jameslamb reopened this May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants