Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop 'not evaluated' placeholder from dask.py #4393

Closed
ffineis opened this issue Jun 20, 2021 · 1 comment
Closed

Drop 'not evaluated' placeholder from dask.py #4393

ffineis opened this issue Jun 20, 2021 · 1 comment

Comments

@ffineis
Copy link
Contributor

ffineis commented Jun 20, 2021

Summary

Drop 'not evaluated' placeholder string used for when the rank 0 Dask worker happens to have not received data for particular components of eval_set.

Motivation

This is a request for an improvement to the handling of multiple workers' eval_set data attributes (e.g. evals_result_) once #4392 is resolved, as opposed to using not evaluated as a placeholder for unevaluated eval_sets' missing training history.

Description

When a user is evaluating model training progress on multiple validation sets (meaning, len(eval_set) > 1) and those validation sets each comprise dask datasets with varying number of components, then it is possible for some worker(s) to be distributed only chunks of particular eval_sets. Put another way, when individual eval_sets are not the same size, there is no guarantee that every worker will be allocated parts from all individual eval_sets contained within eval_set.

The implementation of eval_set support in #4101 breaks up eval_set into smaller eval_sets that each worker reconstructs from its allocated list_of_parts. To ensure that each worker is aware that there are len(eval_set)-many original individual eval_sets, we use the None-padding technique. Here is an illustration:

lightgbm_eval_sets

Therefore, when there is variance in the list [X[0].npartitions for X, y in eval_set], it is possible that an individual eval_set is missing on a particular worker. When a worker receives all Nones for a particular component of eval_set, the corresponding value for this eval_set within best_score_ and evals_result_ is the string 'not evaluated'.

This informs the user that the "rank 0" worker that has their LightGBM estimator selected during _train was not distributed any chunks of the corresponding to the particular eval_set. In the illustration, mdl.best_score_['valid_0'] == 'not evaluated'.

This issue is closely related to #4392; if #4392 is resolved, then 'not resolved' may no longer apply (depending on the path of resolution), and in which case it can (and should!) be dropped from both dask.py and test_dask.py.

References

#4101
#4392

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants