[tune/placement groups] multiple parallel tune runs don't work with placement groups #14557

richardliaw · 2021-03-09T05:38:05Z

What is the problem?

The following configuration will fail on master:

import ray
from ray import tune

ray.init(address="auto")

def f(cfg):
    return {}

@ray.remote
def experiment():
    tune.run(f, resources_per_trial={"cpu": 1, "gpu": 1})
    return True
    
ray.get([experiment.remote() for i in range(10)
```

The error message:

```python
Traceback (most recent call last):
  File "test.py", line 18, in <module>
    ray.get([experiment.remote() for i in range(10)])
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1428, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::experiment() (pid=421, ip=10.92.1.6)
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "test.py", line 15, in experiment
    tune.run(f, num_samples=10, resources_per_trial={"cpu": 1, "gpu": 1})
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 520, in run
    runner.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 486, in step
    self.trial_executor.stage_and_update_status(self._trials)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 208, in stage_and_update_status
    self._pg_manager.cleanup_existing_pg()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/placement_groups.py", line 279, in cleanup_existing_pg
    pg = get_placement_group(info["name"])
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/placement_group.py", line 231, in get_placement_group
    f"Failed to look up actor with name: {placement_group_name}")
ValueError: Failed to look up actor with name: _tune__6d2dcca7

cc @krfricke @rkooo567

Originally posted by @ANarayan

The text was updated successfully, but these errors were encountered:

ANarayan · 2021-03-09T05:58:09Z

The same error is encountered when running the following code in the head node of a ray cluster:

import ray
from ray import tune
import socket
import os

ray.init(address="auto")
hostname = socket.gethostbyname(socket.gethostname())

def f(cfg):
    while True:
        import time; time.sleep(1)
        tune.report(1)

@ray.remote(num_cpus=0, resources={f"node:{hostname}": 0.001})
def experiment():
    tune.run(f, num_samples=10, resources_per_trial={"cpu": 1, "gpu": 1})
    return True

ray.get([experiment.remote() for i in range(10)])

krfricke · 2021-03-09T07:57:03Z

The command fails because Tune is trying to clean up existing tune-placement groups before creating new ones. This is preferable for sequential execution, but messes up parallel execution.

Setting the TUNE_PLACEMENT_GROUP_CLEANUP_DISABLED environment variable will disable this and the code should then work:

import ray
import os
from ray import tune

ray.init()


def f(cfg):
    return {}


@ray.remote
def experiment():
    os.environ["TUNE_PLACEMENT_GROUP_CLEANUP_DISABLED"] = "1"
    tune.run(f, resources_per_trial={"cpu": 1, "gpu": 0})
    return True


ray.get([experiment.remote() for i in range(10)])

(please note that the env variable has to be set within the remote function or before ray start is called on the worker nodes).

We also - theoretically - support different name prefixes for different run, but there's currently no way to set these. I can address this soon but for now the above solution should unblock you.

richardliaw · 2021-03-19T04:06:51Z

@krfricke assigning this to you now, but feel free to set the priority of the task :)

krfricke · 2021-03-19T16:16:41Z

I'll try to get to it next week, it should be a pretty short fix

krfricke · 2021-03-24T09:01:01Z

Hey @ANarayan, this issue should be fixed in #14835, can you try it out and see if it works for you?

You should probably specifically pass a name or logdir to tune.run() to make sure the different run logs don't interfere with each other.

krfricke · 2021-04-21T13:05:24Z

Closing this for now as this should have been fixed by #14835

richardliaw added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 9, 2021

richardliaw added P2 Important issue, but not time-critical tune Tune-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 19, 2021

richardliaw assigned krfricke Mar 19, 2021

krfricke mentioned this issue Mar 22, 2021

[tune] Limit maximum number of pending trials. Add convergence test. #14835

Merged

6 tasks

krfricke closed this as completed Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune/placement groups] multiple parallel tune runs don't work with placement groups #14557

[tune/placement groups] multiple parallel tune runs don't work with placement groups #14557

richardliaw commented Mar 9, 2021 •

edited

Loading

ANarayan commented Mar 9, 2021

krfricke commented Mar 9, 2021 •

edited

Loading

richardliaw commented Mar 19, 2021

krfricke commented Mar 19, 2021

krfricke commented Mar 24, 2021

krfricke commented Apr 21, 2021

[tune/placement groups] multiple parallel tune runs don't work with placement groups #14557

[tune/placement groups] multiple parallel tune runs don't work with placement groups #14557

Comments

richardliaw commented Mar 9, 2021 • edited Loading

What is the problem?

ANarayan commented Mar 9, 2021

krfricke commented Mar 9, 2021 • edited Loading

richardliaw commented Mar 19, 2021

krfricke commented Mar 19, 2021

krfricke commented Mar 24, 2021

krfricke commented Apr 21, 2021

richardliaw commented Mar 9, 2021 •

edited

Loading

krfricke commented Mar 9, 2021 •

edited

Loading