Support for PBT (Population Based Training) #1382

romeokienzler · 2020-11-09T10:09:20Z

/kind feature

Support for PBT (Population Based Training)
Ray tune recently came out with support for PBT, Deepmind showed exceptional performance. Are we considering to support PBT in Katib as well?

https://arxiv.org/abs/1711.09846
https://deepmind.com/blog/article/population-based-training-neural-networks

andreyvelich · 2020-11-10T12:33:42Z

Thank you for this information @romeokienzler, I think it's very exciting.

We definitely should investigate it and try to adopt it in Katib.
What do you think @gaocegege @johnugeorge ?

gaocegege · 2020-12-28T01:49:34Z

My colleague @ezioliao is working on our internal AutoML system and he is interested in this issue.

ezioliao · 2020-12-28T02:00:49Z

My colleague @ezioliao is working on our internal AutoML system and he is interested in this issue.

With Pleasure. PBT is implemented in our own AutoML system and I'm willing to solve it.

andreyvelich · 2021-01-04T11:16:57Z

That would be great!
Thank you @ezioliao, let us know if you need any help.

andreyvelich · 2021-03-18T22:52:28Z

@ezioliao Do you have time resources to work on this in 2021 ?
We can include PBT support in the 2021 Roadmap.

stale · 2021-06-16T23:36:59Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich · 2021-06-17T12:40:08Z

/lifecycle frozen

andreyvelich · 2021-06-17T12:40:12Z

/help

google-oss-robot · 2021-06-17T12:40:13Z

@andreyvelich:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

johnugeorge · 2021-06-18T19:00:41Z

/cc @richardsliu

SunnyGhj · 2022-04-23T14:52:30Z

@andreyvelich Hello, I am very interested in this proposal. May I ask when it will be launched and how can I participate

romeokienzler · 2022-04-23T18:17:54Z

@ezioliao

johnugeorge · 2022-04-24T07:04:48Z

@hongjunGu2019 Please see #1833

SunnyGhj · 2022-04-25T19:07:20Z

@hongjunGu2019 Please see #1833

Thanks

david-thrower · 2022-10-19T17:20:29Z

One enhancement to this which, I propose on this is an addition of a parameter to this, which a user can use to configure the training task to filter out invalid permutations of parameter values from each generation's trials, given rules like minimum_skip_connection_depth must be < maximum_skip_connection_depth , so as to filter out planned trials that would throw ValueErrors before they are executed. This would make more elaborate algorithms practical to train with the training op.

johnugeorge · 2022-10-19T17:58:42Z

@david-thrower We have an interface to add validation checks per suggestion algorithm. Ref: #1924

david-thrower · 2022-10-19T18:14:50Z

@johnugeorge , I really appreciate you pointing that out, especiallyat this time of night in your time zone. If I am ever in town, I owe you a coffee drink. Sorry I missed that recent change. I'm glad to see this. I may get more sleep this week than I thought. I will look into it. To clarify, this removes suggestions, not throws exceptions, right?

johnugeorge · 2022-10-19T18:28:56Z

ValidateAlgorithmSettings function is a common interface which can be implemented for any suggestion algorithm. If ValidateAlgorithmSettings fails, Katib controller changes Suggestion and Experiment status to failed.

#1126

david-thrower · 2022-10-19T18:41:50Z

@johnugeorge I see. This is where the problem I am faced with lies. When I have a massive parameter space where as many as 1/4 of the permutations of values are invalid and I may be running 1000+ trials on a a distributed cluster with IPU, TPU, or A100s, and we will get an error status from of 250 of the 1000 trials, this requires me to set an very large maxFailedTrialCount to escape the run failing out. If I do that, then I have to ignore any other unforeseen errors that may arise (that should prompt me to abort the run and de-provision the cluster until I have the issue debugged...). What I had in mind was more of a pre-screening and preclusion from these forseen invalid suggestions being included in the oracles / metadata to begin with or being assigned a separate status that isn't seen by Katib / Training as an error, that way the trial does not count towards maxFailedTrialCount, as this is a foreseen invalid trial, which these are inherently going to clutter the suggestions created by any algorithm having a permutation / mutation / random selection step. I hope my clarification makes sense.

johnugeorge · 2022-10-19T18:49:26Z

I see. But the what is the real reason for this issue? Is it that the user configures invalid parameter range for an algorithm?

david-thrower · 2022-10-19T19:20:16Z

Example:

minimum_skip_connections: [1:10]
maximum_skip_conections: [1:10]

There are both valid and invalid permutations of these in this range (any permutation where minimum... is < maximum ... is valid, and not so, is invalid). Any single range I could set to preclude invalid permutations: (e.g. setting minimum to [1:5] and maximum to [6:10]), this would eliminate many if not most valid permutations. (e.g. minimum = 7, maximum = 9 in this case ..). The NAS I am developing has many min_ max_ pairs in its prams, and each trial intrinsically needs them because of an ensemble - like setup that needs both the minimum and maximum to deduce the optimal solution(s) in this range of many ranges to be separately studied. There may be an exponential number of model architecture parameters + traditional hyperparameter permutations to sample from with the same narrow param range, so there's no practical way to make the trainer take a single straight number in lieu of the range) ... The only other workaround is to (within the train function), make the valid permutations of the pairs a list and do this:
{

# list[i][0] is minimum
# ist[i][1] is maximum
 
options = np.random.randint((100,2))
options = options[np.less(options[:,0],options[:,1]),:]

# options now looks like:
# [[1,2],
# [1,3] ...,
# [9,10]]

i = hp.Choice("min_and_max_skip_connections", options)

min_skip_conn = options[i][0]
max_skip_conn = options[i][1] 

# ... on to the code that parses the model from these params and numerous others ....

}

The problem with the approach above (using the engineered fused parameter "min_max_skip_connections") is that a [Bayesian | Hyperband | Genetic] algo can't extrapolate as strong of a mathematical meaning to what values are likely optimal for the 2 individual parameters and predict where to best sample from moving forward, as it would if it were sampling the 2 individual parameters separately (with the invalid options dropped and not influencing it). A human can't easily either. If I look at a rectangular coordinates plot of the engineered single fused parameter, I can't draw a meaningful pattern of what values for these 2 individual parameters are optimal and deduce a range without looking back and forth between a printout of the list and what index number is saturated on the rectangular coordinates plot.

One unrealistic workaround would be to just hard code the trainer function to return infiniti for the loss for invalid trials, but this would create a huge problem for all strategies other than grid and random search: If the optimal solution is (max_skip_connections = 9, and min_skip_connections = 7), if its first iteration / generation tried (max_skip_connections = 7, and min_skip_connections = 5) any "smart" algo like population based, hyperband, etc would be led to drop the nearby space, such as (max_skip_connections = 7, and min_skip_connections = 9), unless it did an inordinately large number of random trials in the first iteration where several nearby trials weighed against the failed possibility ... Which defeats the purpose of this step of the study, as the purpose is to quickly find promising ranges to be investigated in more detail / eliminate dead ones, without great computational expense...

I hope the need for this is making sense now. It is a very complex issue, taking a verbose explanation to articulate, but nonetheless, a real problem that is encumbering a lot of experiments, which I have seen similar issues raised on the repos of various tuners ... It is motivating me to write my own tuner, which I hope to not need to do, but have accepted that this may be the de facto option, unless I just fall back on setting the maxFailedTrialCount to a large number ... ).

johnugeorge · 2022-10-20T11:14:50Z

Yes. it makes sense. The only option that I can think of, is to add these checks within the algorithm and skip if parameters are invalid. But as you said, it should not be considered as failed. So, we will need a new status return to indicate that trial was skipped.

We will take this up in the next WG meeting.

david-thrower · 2022-10-27T00:51:29Z

@johnugeorge, I appreciate the attention to the issue. This is no doubt, a weird issue, but nonetheless, I can foresee this becoming a more common requirement as more complex algorithms become "production material".

andreyvelich · 2023-05-09T15:08:51Z

The initial version of PBT was implemented by @a9p!
You can find the docs here: https://www.kubeflow.org/docs/components/katib/experiment/#population-based-training-pbt.
Thank you @a9p for the great contribution!

k8s-ci-robot added the kind/feature label Nov 9, 2020

andreyvelich mentioned this issue Mar 18, 2021

Katib Roadmap 2021 #1484

Closed

stale bot added the lifecycle/stale label Jun 16, 2021

google-oss-robot added lifecycle/frozen and removed lifecycle/stale labels Jun 17, 2021

google-oss-robot added the help wanted Extra attention is needed label Jun 17, 2021

andreyvelich mentioned this issue Nov 1, 2021

fix: check if parameter references exist in experiment parameters #1726

Merged

a9p mentioned this issue Mar 8, 2022

Population based training #1833

Merged

1 task

a9p mentioned this issue May 5, 2022

Population based training UI #1862

Draft

1 task

a9p mentioned this issue Jun 29, 2022

Add PBT to experiment creation form #1909

Merged

andreyvelich closed this as completed May 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for PBT (Population Based Training) #1382

Support for PBT (Population Based Training) #1382

romeokienzler commented Nov 9, 2020 •

edited

Loading

andreyvelich commented Nov 10, 2020

gaocegege commented Dec 28, 2020

ezioliao commented Dec 28, 2020 •

edited

Loading

andreyvelich commented Jan 4, 2021

andreyvelich commented Mar 18, 2021

stale bot commented Jun 16, 2021

andreyvelich commented Jun 17, 2021

andreyvelich commented Jun 17, 2021

google-oss-robot commented Jun 17, 2021

johnugeorge commented Jun 18, 2021

SunnyGhj commented Apr 23, 2022

romeokienzler commented Apr 23, 2022

johnugeorge commented Apr 24, 2022

SunnyGhj commented Apr 25, 2022

david-thrower commented Oct 19, 2022 •

edited

Loading

johnugeorge commented Oct 19, 2022

david-thrower commented Oct 19, 2022

johnugeorge commented Oct 19, 2022

david-thrower commented Oct 19, 2022 •

edited

Loading

johnugeorge commented Oct 19, 2022

david-thrower commented Oct 19, 2022 •

edited

Loading

johnugeorge commented Oct 20, 2022

david-thrower commented Oct 27, 2022

andreyvelich commented May 9, 2023

Support for PBT (Population Based Training) #1382

Support for PBT (Population Based Training) #1382

Comments

romeokienzler commented Nov 9, 2020 • edited Loading

andreyvelich commented Nov 10, 2020

gaocegege commented Dec 28, 2020

ezioliao commented Dec 28, 2020 • edited Loading

andreyvelich commented Jan 4, 2021

andreyvelich commented Mar 18, 2021

stale bot commented Jun 16, 2021

andreyvelich commented Jun 17, 2021

andreyvelich commented Jun 17, 2021

google-oss-robot commented Jun 17, 2021

johnugeorge commented Jun 18, 2021

SunnyGhj commented Apr 23, 2022

romeokienzler commented Apr 23, 2022

johnugeorge commented Apr 24, 2022

SunnyGhj commented Apr 25, 2022

david-thrower commented Oct 19, 2022 • edited Loading

johnugeorge commented Oct 19, 2022

david-thrower commented Oct 19, 2022

johnugeorge commented Oct 19, 2022

david-thrower commented Oct 19, 2022 • edited Loading

johnugeorge commented Oct 19, 2022

david-thrower commented Oct 19, 2022 • edited Loading

johnugeorge commented Oct 20, 2022

david-thrower commented Oct 27, 2022

andreyvelich commented May 9, 2023

romeokienzler commented Nov 9, 2020 •

edited

Loading

ezioliao commented Dec 28, 2020 •

edited

Loading

david-thrower commented Oct 19, 2022 •

edited

Loading

david-thrower commented Oct 19, 2022 •

edited

Loading

david-thrower commented Oct 19, 2022 •

edited

Loading