Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for PBT (Population Based Training) #1382

Closed
romeokienzler opened this issue Nov 9, 2020 · 24 comments
Closed

Support for PBT (Population Based Training) #1382

romeokienzler opened this issue Nov 9, 2020 · 24 comments
Labels

Comments

@romeokienzler
Copy link

romeokienzler commented Nov 9, 2020

/kind feature

Support for PBT (Population Based Training)
Ray tune recently came out with support for PBT, Deepmind showed exceptional performance. Are we considering to support PBT in Katib as well?

https://arxiv.org/abs/1711.09846
https://deepmind.com/blog/article/population-based-training-neural-networks

@andreyvelich
Copy link
Member

Thank you for this information @romeokienzler, I think it's very exciting.

We definitely should investigate it and try to adopt it in Katib.
What do you think @gaocegege @johnugeorge ?

@gaocegege
Copy link
Member

My colleague @ezioliao is working on our internal AutoML system and he is interested in this issue.

@ezioliao
Copy link

ezioliao commented Dec 28, 2020

My colleague @ezioliao is working on our internal AutoML system and he is interested in this issue.

With Pleasure. PBT is implemented in our own AutoML system and I'm willing to solve it.

@andreyvelich
Copy link
Member

That would be great!
Thank you @ezioliao, let us know if you need any help.

@andreyvelich
Copy link
Member

@ezioliao Do you have time resources to work on this in 2021 ?
We can include PBT support in the 2021 Roadmap.

@stale
Copy link

stale bot commented Jun 16, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@andreyvelich
Copy link
Member

/lifecycle frozen

@andreyvelich
Copy link
Member

/help

@google-oss-robot
Copy link

@andreyvelich:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-robot google-oss-robot added the help wanted Extra attention is needed label Jun 17, 2021
@johnugeorge
Copy link
Member

/cc @richardsliu

@SunnyGhj
Copy link

@andreyvelich Hello, I am very interested in this proposal. May I ask when it will be launched and how can I participate

@romeokienzler
Copy link
Author

@ezioliao

@johnugeorge
Copy link
Member

@hongjunGu2019 Please see #1833

@SunnyGhj
Copy link

@hongjunGu2019 Please see #1833

Thanks

@david-thrower
Copy link

david-thrower commented Oct 19, 2022

One enhancement to this which, I propose on this is an addition of a parameter to this, which a user can use to configure the training task to filter out invalid permutations of parameter values from each generation's trials, given rules like minimum_skip_connection_depth must be < maximum_skip_connection_depth , so as to filter out planned trials that would throw ValueErrors before they are executed. This would make more elaborate algorithms practical to train with the training op.

@johnugeorge
Copy link
Member

@david-thrower We have an interface to add validation checks per suggestion algorithm. Ref: #1924

@david-thrower
Copy link

@johnugeorge , I really appreciate you pointing that out, especiallyat this time of night in your time zone. If I am ever in town, I owe you a coffee drink. Sorry I missed that recent change. I'm glad to see this. I may get more sleep this week than I thought. I will look into it. To clarify, this removes suggestions, not throws exceptions, right?

@johnugeorge
Copy link
Member

ValidateAlgorithmSettings function is a common interface which can be implemented for any suggestion algorithm. If ValidateAlgorithmSettings fails, Katib controller changes Suggestion and Experiment status to failed.

#1126

@david-thrower
Copy link

david-thrower commented Oct 19, 2022

@johnugeorge I see. This is where the problem I am faced with lies. When I have a massive parameter space where as many as 1/4 of the permutations of values are invalid and I may be running 1000+ trials on a a distributed cluster with IPU, TPU, or A100s, and we will get an error status from of 250 of the 1000 trials, this requires me to set an very large maxFailedTrialCount to escape the run failing out. If I do that, then I have to ignore any other unforeseen errors that may arise (that should prompt me to abort the run and de-provision the cluster until I have the issue debugged...). What I had in mind was more of a pre-screening and preclusion from these forseen invalid suggestions being included in the oracles / metadata to begin with or being assigned a separate status that isn't seen by Katib / Training as an error, that way the trial does not count towards maxFailedTrialCount, as this is a foreseen invalid trial, which these are inherently going to clutter the suggestions created by any algorithm having a permutation / mutation / random selection step. I hope my clarification makes sense.

@johnugeorge
Copy link
Member

I see. But the what is the real reason for this issue? Is it that the user configures invalid parameter range for an algorithm?

@david-thrower
Copy link

david-thrower commented Oct 19, 2022

Example:

minimum_skip_connections: [1:10]
maximum_skip_conections: [1:10]

There are both valid and invalid permutations of these in this range (any permutation where minimum... is < maximum ... is valid, and not so, is invalid). Any single range I could set to preclude invalid permutations: (e.g. setting minimum to [1:5] and maximum to [6:10]), this would eliminate many if not most valid permutations. (e.g. minimum = 7, maximum = 9 in this case ..). The NAS I am developing has many min_ max_ pairs in its prams, and each trial intrinsically needs them because of an ensemble - like setup that needs both the minimum and maximum to deduce the optimal solution(s) in this range of many ranges to be separately studied. There may be an exponential number of model architecture parameters + traditional hyperparameter permutations to sample from with the same narrow param range, so there's no practical way to make the trainer take a single straight number in lieu of the range) ... The only other workaround is to (within the train function), make the valid permutations of the pairs a list and do this:
{

# list[i][0] is minimum
# ist[i][1] is maximum
 
options = np.random.randint((100,2))
options = options[np.less(options[:,0],options[:,1]),:]

# options now looks like:
# [[1,2],
# [1,3] ...,
# [9,10]]

i = hp.Choice("min_and_max_skip_connections", options)

min_skip_conn = options[i][0]
max_skip_conn = options[i][1] 

# ... on to the code that parses the model from these params and numerous others ....

}

The problem with the approach above (using the engineered fused parameter "min_max_skip_connections") is that a [Bayesian | Hyperband | Genetic] algo can't extrapolate as strong of a mathematical meaning to what values are likely optimal for the 2 individual parameters and predict where to best sample from moving forward, as it would if it were sampling the 2 individual parameters separately (with the invalid options dropped and not influencing it). A human can't easily either. If I look at a rectangular coordinates plot of the engineered single fused parameter, I can't draw a meaningful pattern of what values for these 2 individual parameters are optimal and deduce a range without looking back and forth between a printout of the list and what index number is saturated on the rectangular coordinates plot.

One unrealistic workaround would be to just hard code the trainer function to return infiniti for the loss for invalid trials, but this would create a huge problem for all strategies other than grid and random search: If the optimal solution is (max_skip_connections = 9, and min_skip_connections = 7), if its first iteration / generation tried (max_skip_connections = 7, and min_skip_connections = 5) any "smart" algo like population based, hyperband, etc would be led to drop the nearby space, such as (max_skip_connections = 7, and min_skip_connections = 9), unless it did an inordinately large number of random trials in the first iteration where several nearby trials weighed against the failed possibility ... Which defeats the purpose of this step of the study, as the purpose is to quickly find promising ranges to be investigated in more detail / eliminate dead ones, without great computational expense...

I hope the need for this is making sense now. It is a very complex issue, taking a verbose explanation to articulate, but nonetheless, a real problem that is encumbering a lot of experiments, which I have seen similar issues raised on the repos of various tuners ... It is motivating me to write my own tuner, which I hope to not need to do, but have accepted that this may be the de facto option, unless I just fall back on setting the maxFailedTrialCount to a large number ... ).

@johnugeorge
Copy link
Member

Yes. it makes sense. The only option that I can think of, is to add these checks within the algorithm and skip if parameters are invalid. But as you said, it should not be considered as failed. So, we will need a new status return to indicate that trial was skipped.

We will take this up in the next WG meeting.

@david-thrower
Copy link

@johnugeorge, I appreciate the attention to the issue. This is no doubt, a weird issue, but nonetheless, I can foresee this becoming a more common requirement as more complex algorithms become "production material".

@andreyvelich
Copy link
Member

The initial version of PBT was implemented by @a9p!
You can find the docs here: https://www.kubeflow.org/docs/components/katib/experiment/#population-based-training-pbt.
Thank you @a9p for the great contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants