[tune] Limit maximum number of pending trials. Add convergence test. #14835

krfricke · 2021-03-22T10:59:14Z

Why are these changes needed?

Currently we're generating all trials at once if the searcher (or a concurrency limiter) doesn't enforce limits. This leads effectively to doing random search on most search algorithms (see #14770).

This PR sets the default number of maximum pending trials to 1 for all search algorithms except random/grid search. To trigger more aggressive autoscaling behavior, the TUNE_MAX_PENDING_TRIALS_PG environment variable has to be set.

To make sure resource-limited parallelism leads to convergence, we added a specific convergence test for all searchers.

This PR also piggybacks a couple of minor changes, e.g. enabling running multiple ray tune trials in parallel with placement gorups (#14557) and adds some minor searcher improvements.

Related issue number

Closes #14770
Closes #14568 (see also #14559)
Addresses #13817
Addresses #14557

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…earcher fixes.

python/ray/tune/suggest/dragonfly.py

doc/source/tune/user-guide.rst

richardliaw · 2021-03-23T04:00:21Z

python/ray/tune/ray_trial_executor.py

@@ -160,7 +160,8 @@ def __init__(self,

        self._avail_resources = Resources(cpu=0, gpu=0)
        self._committed_resources = Resources(cpu=0, gpu=0)
-        self._pg_manager = PlacementGroupManager()
+        self._pg_manager = PlacementGroupManager(
+            prefix=os.getenv("TUNE_PLACEMENT_GROUP_PREFIX", "__tune__"))


should we add a hex to differentiate among different Tune runs?

This is just the prefix here - and it's used to identify leftover placement groups to remove before starting a new Tune run. It thus has to be constant across runs, otherwise removal wouldn't make sense. The trial PGs are actually using unique hex identifiers.

I guess one possibility would be to create a hex, store it in a global variable, and re-use this for sequential runs. Effectively this would mean the auto-removal process will only be triggered for sequential runs (such as in our tests). Parallel trials in different remote functions would work out of the box. Parallel trials using shared global state (threads?) would still interfere, but they do this right now, too.

Hm, this might be a good idea, I'll think about it a bit more.

I ended up implementing this - I'll see if the tests pass, but the examples in the issues run for me without problems (other than setting a separate logdir)

python/ray/tune/trial_runner.py

…14835)

Limit maximum number of pending trials. Add convergence test. Minor s…

0604026

…earcher fixes.

krfricke commented Mar 22, 2021

View reviewed changes

python/ray/tune/suggest/dragonfly.py Show resolved Hide resolved

Kai Fricke added 3 commits March 22, 2021 12:02

Pre-review

00f389b

Fix environment variables in tests

0e1c618

Don't stop too early

c03bdf9

richardliaw reviewed Mar 23, 2021

View reviewed changes

doc/source/tune/user-guide.rst Outdated Show resolved Hide resolved

richardliaw reviewed Mar 23, 2021

View reviewed changes

python/ray/tune/trial_runner.py Outdated Show resolved Hide resolved

Kai Fricke added 4 commits March 23, 2021 10:26

Set default env value to "auto"

84393db

Have trials ready when placement gorup is available for trial

549e07c

Move to init

9b814f6

Auto-generate unique tune placement group prefixes

8327788

richardliaw approved these changes Mar 24, 2021

View reviewed changes

richardliaw merged commit 898243d into ray-project:master Mar 24, 2021

amogkam pushed a commit that referenced this pull request Mar 24, 2021

[tune] Limit maximum number of pending trials. Add convergence test. (#…

c018908

…14835)

krfricke deleted the tune-pending-trials branch March 24, 2021 08:59

krfricke mentioned this pull request Mar 24, 2021

[tune/placement groups] multiple parallel tune runs don't work with placement groups #14557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Limit maximum number of pending trials. Add convergence test. #14835

[tune] Limit maximum number of pending trials. Add convergence test. #14835

krfricke commented Mar 22, 2021

richardliaw Mar 23, 2021

krfricke Mar 23, 2021

krfricke Mar 23, 2021

[tune] Limit maximum number of pending trials. Add convergence test. #14835

[tune] Limit maximum number of pending trials. Add convergence test. #14835

Conversation

krfricke commented Mar 22, 2021

Why are these changes needed?

Related issue number

Checks

richardliaw Mar 23, 2021

Choose a reason for hiding this comment

krfricke Mar 23, 2021

Choose a reason for hiding this comment

krfricke Mar 23, 2021

Choose a reason for hiding this comment