-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Install SkyPilot runtime in separate env #2801
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! Just briefly went through the PR (& related issue) and left two discussions first:
- Will our ray (w/ v 2.7.0) use the same ray address & port as the user-installed ray? If this is the case, will there be any compatibility issues when two different versions of ray act on the same interface? (I assume it is okay since ray should have backward compatibility, but just want to make sure.)
- Can we consider adding the
constants.ACTIVATE_PYTHON_ENV
toSSHCommandRunner.run
? It seems like will make the code cleaner.
We are using the same version of ray in the new venv, and the ray address and port have been changed to non-default one in a previous PR already #1790.
This is a good point! However, the activate command needs to be called case by case and some commands are not directly invoked by
|
Make sense! Lets keep current implementation🫡 |
Is there a way to disable this feature (via the Python API) to produce the existing behavior? We use SkyPilot within the code running on the launched cluster, so this isolation is both not needed and even increases our start time further (because we then need to install SkyPilot and start Ray again in our own environment). |
Thanks for the request @dongreenberg! Is it possible to activate the environment |
That would work! Will the command runner allow specifying a conda environment to run in? |
Currently, we decide not to have the argument for the command runner to make the API cleaner, and make the commands to be run self contained. Do you think it would be possible to do something as the following instead? def custom_run(runner, cmd, *args, **kwargs):
cmd = f'source ~/skypilot-runtime/bin/activate; {cmd}'
return runner.run(cmd, *args, **kwargs) |
…ency-in-separate-env
Another user requested for this PR:
|
…o dependency-in-separate-env
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature! Besides those comments, is it possible to automatically add the activation commands? Feeling like this PR will introduce a lot of hazard for the following PR. Also, could we have a smoke/unit test for it?
@@ -66,19 +74,38 @@ | |||
DOCKER_SERVER_ENV_VAR, | |||
} | |||
|
|||
ACTIVATE_PYTHON_ENV = (f'[ -d {SKY_REMOTE_PYTHON_ENV} ] && ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what case will the remote python env not exist? Should we print some warning here?
@@ -677,6 +677,7 @@ def spot_launch( | |||
# Note: actual spot cluster name will be <task.name>-<spot job ID> | |||
'dag_name': dag.name, | |||
'retry_until_up': retry_until_up, | |||
'skypilot_runtime_env': constants.SKY_REMOTE_PYTHON_ENV, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put this variable into controller_utils.shared_controller_vars_to_fill
since it is both used by serve & spot controller?
@@ -13,12 +13,13 @@ def restart_skylet(): | |||
# TODO(zhwu): make the killing graceful, e.g., use a signal to tell | |||
# skylet to exit, instead of directly killing it. | |||
subprocess.run( | |||
'ps aux | grep "sky.skylet.skylet" | grep "python3 -m"' | |||
'ps aux | grep "sky.skylet.skylet" | grep "python -m"' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, is there any reason to make this change?
# TODO(mluo): Make explicit `sky launch -c <name> ''` optional. | ||
UNINITIALIZED_ONPREM_CLUSTER_MESSAGE = ( | ||
'Found uninitialized local cluster {cluster}. Run this ' | ||
'command to initialize it locally: sky launch -c {cluster} \'\'') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC we already deprecated the local clusters? Should we remove it?
@@ -121,4 +122,5 @@ def is_autostopping(cls) -> str: | |||
def _build(cls, code: List[str]) -> str: | |||
code = cls._PREFIX + code | |||
code = ';'.join(code) | |||
return f'python3 -u -c {shlex.quote(code)}' | |||
return (f'{constants.ACTIVATE_PYTHON_ENV} ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, why not use sky/skylet/constants.py::run_in_python_env
to replace all of those constants.ACTIVATE_PYTHON_ENV
?
'grep "# >>> conda initialize >>>" ~/.bashrc || conda init;' | ||
# Create a separate conda environment for SkyPilot dependencies. | ||
f'[ -d {SKY_REMOTE_PYTHON_ENV} ] || ' | ||
f'python -m venv {SKY_REMOTE_PYTHON_ENV}; ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we considered using a conda env instead? What are the pros & cons of the two implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add this for newly-added cloud as well, e.g. runpod
Closing this, as it is moved to #3575 |
Fixes #2722
Background
Installing every skypilot runtime in the base environment can cause issues when users also tries to install their dependencies in the base environment on the remote VM. It is very easy to cause the VM to become in a unexpected state. We now move all the skypilot runtime to a new conda environment for better isolation from the users' own python environment.
Profiling
(average across three times)
sky launch -c test-aws --cloud aws --cpus 2
(average across 5 times)
for i in {1..5}; do time sky launch -c test-aws-$i --cloud aws --cpus 2 -y; done
sky launch -c test-gcp --cloud gcp --cpus 2
(average across 5 times)
for i in {1..5}; do time sky launch -c test-gcp-$i --cloud gcp --cpus 2 -y; done
TODOs
Tested (run the relevant ones):
bash format.sh
sky launch -c test-launch --cloud aws --num-nodes 2 --cpus 2+ echo hi
sky launch -c test-launch --cloud gcp --num-nodes 2 --cpus 2+ echo hi
sky launch -c test-launch --cloud gcp --gpus tpu-v2-8 echo hi
sky launch -c test-launch --cloud gcp test.yaml
(TPU VM: tpu-v3-32 for TPU pod test)sky launch -c test-launch --cloud azure --num-nodes 2 --cpus 2+ echo hi
sky launch -c test-launch --cloud ibm --num-nodes 2 --cpus 2+ echo hi
sky launch -c test-launch --cloud lambda --num-nodes 2 --cpus 2+ echo hi
pytest tests/test_smoke.py
pytest tests/test_smoke.py --aws
bash tests/backward_comaptibility_tests.sh