-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Limit maximum number of pending trials. Add convergence test. #14835
Conversation
@@ -160,7 +160,8 @@ def __init__(self, | |||
|
|||
self._avail_resources = Resources(cpu=0, gpu=0) | |||
self._committed_resources = Resources(cpu=0, gpu=0) | |||
self._pg_manager = PlacementGroupManager() | |||
self._pg_manager = PlacementGroupManager( | |||
prefix=os.getenv("TUNE_PLACEMENT_GROUP_PREFIX", "__tune__")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a hex to differentiate among different Tune runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just the prefix here - and it's used to identify leftover placement groups to remove before starting a new Tune run. It thus has to be constant across runs, otherwise removal wouldn't make sense. The trial PGs are actually using unique hex identifiers.
I guess one possibility would be to create a hex, store it in a global variable, and re-use this for sequential runs. Effectively this would mean the auto-removal process will only be triggered for sequential runs (such as in our tests). Parallel trials in different remote functions would work out of the box. Parallel trials using shared global state (threads?) would still interfere, but they do this right now, too.
Hm, this might be a good idea, I'll think about it a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up implementing this - I'll see if the tests pass, but the examples in the issues run for me without problems (other than setting a separate logdir)
Why are these changes needed?
Currently we're generating all trials at once if the searcher (or a concurrency limiter) doesn't enforce limits. This leads effectively to doing random search on most search algorithms (see #14770).
This PR sets the default number of maximum pending trials to 1 for all search algorithms except random/grid search. To trigger more aggressive autoscaling behavior, the
TUNE_MAX_PENDING_TRIALS_PG
environment variable has to be set.To make sure resource-limited parallelism leads to convergence, we added a specific convergence test for all searchers.
This PR also piggybacks a couple of minor changes, e.g. enabling running multiple ray tune trials in parallel with placement gorups (#14557) and adds some minor searcher improvements.
Related issue number
Closes #14770
Closes #14568 (see also #14559)
Addresses #13817
Addresses #14557
Checks
scripts/format.sh
to lint the changes in this PR.