Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Limit maximum number of pending trials. Add convergence test. #14835

Merged
merged 8 commits into from
Mar 24, 2021

Conversation

krfricke
Copy link
Contributor

Why are these changes needed?

Currently we're generating all trials at once if the searcher (or a concurrency limiter) doesn't enforce limits. This leads effectively to doing random search on most search algorithms (see #14770).

This PR sets the default number of maximum pending trials to 1 for all search algorithms except random/grid search. To trigger more aggressive autoscaling behavior, the TUNE_MAX_PENDING_TRIALS_PG environment variable has to be set.

To make sure resource-limited parallelism leads to convergence, we added a specific convergence test for all searchers.

This PR also piggybacks a couple of minor changes, e.g. enabling running multiple ray tune trials in parallel with placement gorups (#14557) and adds some minor searcher improvements.

Related issue number

Closes #14770
Closes #14568 (see also #14559)
Addresses #13817
Addresses #14557

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -160,7 +160,8 @@ def __init__(self,

self._avail_resources = Resources(cpu=0, gpu=0)
self._committed_resources = Resources(cpu=0, gpu=0)
self._pg_manager = PlacementGroupManager()
self._pg_manager = PlacementGroupManager(
prefix=os.getenv("TUNE_PLACEMENT_GROUP_PREFIX", "__tune__"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a hex to differentiate among different Tune runs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just the prefix here - and it's used to identify leftover placement groups to remove before starting a new Tune run. It thus has to be constant across runs, otherwise removal wouldn't make sense. The trial PGs are actually using unique hex identifiers.

I guess one possibility would be to create a hex, store it in a global variable, and re-use this for sequential runs. Effectively this would mean the auto-removal process will only be triggered for sequential runs (such as in our tests). Parallel trials in different remote functions would work out of the box. Parallel trials using shared global state (threads?) would still interfere, but they do this right now, too.

Hm, this might be a good idea, I'll think about it a bit more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up implementing this - I'll see if the tests pass, but the examples in the issues run for me without problems (other than setting a separate logdir)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants