Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User Ray and Sky Ray version conflicts #671

Closed
gmittal opened this issue Mar 30, 2022 · 7 comments · Fixed by #1790
Closed

User Ray and Sky Ray version conflicts #671

gmittal opened this issue Mar 30, 2022 · 7 comments · Fixed by #1790

Comments

@gmittal
Copy link
Collaborator

gmittal commented Mar 30, 2022

Justin’s requirements.txt will install ray. If the version of ray does not equal our ray version, it will cause some problems. Maybe we should encourage the user to use conda environment?

@michaelzhiluo
Copy link
Collaborator

It looks like for Justin, the version of Ray doesn't matter.

@concretevitamin
Copy link
Member

Running into this for Balsa.

(sky-cmd pid=11365) RuntimeError: Version mismatch: The cluster was started with:
(sky-cmd pid=11365)     Ray: 1.10.0
(sky-cmd pid=11365)     Python: 3.9.4
(sky-cmd pid=11365) This process on node 172.31.65.254 was started with:
(sky-cmd pid=11365)     Ray: 1.9.2
(sky-cmd pid=11365)     Python: 3.7.12

The former is Sky's remote Ray version (controlled) + Python version (uncontrolled by us, depends on the AMI!). The latter is this task's activated conda environment.

For some reason, I didn't see this error on this same cluster before today.

@concretevitamin
Copy link
Member

concretevitamin commented Apr 23, 2022

Update: To get around the above, I had to manually make the Sky task's conda env use the cluster's Ray/python versions.

Scratch that. This remains a problem. After I made the Sky task's conda env use Python 3.9.4, installing requirements.txt failed due to an old dep torch==1.4.0 not being supportd.

Update: manually got the task running by installing Ray + Sky inside the Sky task conda env.

@concretevitamin
Copy link
Member

This error

RuntimeError: Version mismatch: The cluster was started with:
    Ray: 1.10.0
    Python: 3.9.4
This process on node 192.168.15.204 was started with:
    Ray: 1.10.0
    Python: 3.9.12

is run into again by @pounde. After a sky.exec() that runs sudo conda install --file <file>, the system python was changed from 3.9.4 (which was used to launch Sky/Ray runtime) to 3.9.12.

@pounde
Copy link
Collaborator

pounde commented Jul 18, 2022

I can confirm that installing the packages into a new environment nagates the problem ie,
conda create -n --file <req-file.txt>

@Michaelvll
Copy link
Collaborator

Another user requested to launch their own ray cluster on the machines launched by skypilot:

The rough setup is as follows:

  • I need to launch our own copy of the ray clusters inside singularity inside bazel, with all our dependencies available and loaded.
  • For each job submitted by sky, I launch a separate ray cluster, since each job may have their own bazel-built dependencies that need to be made available to the ray workers.
  • Ideally, I’d like to stop any existing ray clusters (launched by us, not sky) before a job starts, and after a job is finished.
  • Because singularity has less isolation than docker, I can’t just call ray stop since it also stops the ray cluster launched by sky running on the host VMs directly.

As suggested by the user, we can launch our ray cluster on non-default ports, so that the user trying to launch a ray cluster will not suffer from the problem where the ports being taken by our ray cluster. It may need some investigations to make sure that multiple ray cluster can work well with each other (we also need the instruction in FAQ).
https://docs.ray.io/en/latest/ray-core/configure.html#ports-configurations

@Michaelvll Michaelvll added the P1 label Dec 28, 2022
@Michaelvll Michaelvll added the P0 label Mar 16, 2023
@Michaelvll
Copy link
Collaborator

Michaelvll commented Mar 16, 2023

Another potential user is blocked by this as their program depends on a specific version of ray (2.2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants