Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune/placement groups] multiple parallel tune runs don't work with placement groups #14557

Closed
richardliaw opened this issue Mar 9, 2021 · 6 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues

Comments

@richardliaw
Copy link
Contributor

richardliaw commented Mar 9, 2021

What is the problem?

The following configuration will fail on master:

import ray
from ray import tune
​
ray.init(address="auto")
​
def f(cfg):
    return {}
​
@ray.remote
def experiment():
    tune.run(f, resources_per_trial={"cpu": 1, "gpu": 1})
    return True
    
ray.get([experiment.remote() for i in range(10)
​```

The error message:

```python
Traceback (most recent call last):
  File "test.py", line 18, in <module>
    ray.get([experiment.remote() for i in range(10)])
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1428, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::experiment() (pid=421, ip=10.92.1.6)
  File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
  File "test.py", line 15, in experiment
    tune.run(f, num_samples=10, resources_per_trial={"cpu": 1, "gpu": 1})
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 520, in run
    runner.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 486, in step
    self.trial_executor.stage_and_update_status(self._trials)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 208, in stage_and_update_status
    self._pg_manager.cleanup_existing_pg()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/placement_groups.py", line 279, in cleanup_existing_pg
    pg = get_placement_group(info["name"])
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/placement_group.py", line 231, in get_placement_group
    f"Failed to look up actor with name: {placement_group_name}")
ValueError: Failed to look up actor with name: _tune__6d2dcca7

cc @krfricke @rkooo567

Originally posted by @ANarayan

@richardliaw richardliaw added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 9, 2021
@ANarayan
Copy link

ANarayan commented Mar 9, 2021

The same error is encountered when running the following code in the head node of a ray cluster:

import ray
from ray import tune
import socket
import os

ray.init(address="auto")
hostname = socket.gethostbyname(socket.gethostname())

def f(cfg):
    while True:
        import time; time.sleep(1)
        tune.report(1)

@ray.remote(num_cpus=0, resources={f"node:{hostname}": 0.001})
def experiment():
    tune.run(f, num_samples=10, resources_per_trial={"cpu": 1, "gpu": 1})
    return True

ray.get([experiment.remote() for i in range(10)])

@krfricke
Copy link
Contributor

krfricke commented Mar 9, 2021

The command fails because Tune is trying to clean up existing tune-placement groups before creating new ones. This is preferable for sequential execution, but messes up parallel execution.

Setting the TUNE_PLACEMENT_GROUP_CLEANUP_DISABLED environment variable will disable this and the code should then work:

import ray
import os
from ray import tune

ray.init()


def f(cfg):
    return {}


@ray.remote
def experiment():
    os.environ["TUNE_PLACEMENT_GROUP_CLEANUP_DISABLED"] = "1"
    tune.run(f, resources_per_trial={"cpu": 1, "gpu": 0})
    return True


ray.get([experiment.remote() for i in range(10)])

(please note that the env variable has to be set within the remote function or before ray start is called on the worker nodes).

We also - theoretically - support different name prefixes for different run, but there's currently no way to set these. I can address this soon but for now the above solution should unblock you.

@richardliaw richardliaw added P2 Important issue, but not time-critical tune Tune-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 19, 2021
@richardliaw
Copy link
Contributor Author

@krfricke assigning this to you now, but feel free to set the priority of the task :)

@krfricke
Copy link
Contributor

I'll try to get to it next week, it should be a pretty short fix

@krfricke
Copy link
Contributor

Hey @ANarayan, this issue should be fixed in #14835, can you try it out and see if it works for you?

You should probably specifically pass a name or logdir to tune.run() to make sure the different run logs don't interfere with each other.

@krfricke
Copy link
Contributor

Closing this for now as this should have been fixed by #14835

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

3 participants