[Core] Install SkyPilot runtime in separate env #2801

Michaelvll · 2023-11-17T23:31:36Z

Background

Installing every skypilot runtime in the base environment can cause issues when users also tries to install their dependencies in the base environment on the remote VM. It is very easy to cause the VM to become in a unexpected state. We now move all the skypilot runtime to a new conda environment for better isolation from the users' own python environment.

Profiling

(average across three times)
sky launch -c test-aws --cloud aws --cpus 2

Original: 3m2s
conda implementation: 3m16s
venv implementation (current implementation): 2m42s

(average across 5 times)
for i in {1..5}; do time sky launch -c test-aws-$i --cloud aws --cpus 2 -y; done

master (conda base, 7c514ba): 2m23.9s

2m34.249s
2m30.460s
2m29.660s
2m14.271s
2m11.077s

venv implementation (current, 841f1ff): 2m39.8s (16s longer)
```
3m0.389s
2m56.428s
2m37.248s
2m16.727s
2m28.398s
```

sky launch -c test-gcp --cloud gcp --cpus 2

Original: 2m50s
conda implementation: 3m38s
venv implementation (current implementation): 3m3s

(average across 5 times)
for i in {1..5}; do time sky launch -c test-gcp-$i --cloud gcp --cpus 2 -y; done

master (conda base, 7c514ba): 2m43.3s

2m41.862s
2m56.141s
2m39.850s
2m41.099s
2m37.994s

venv implementation (current, 841f1ff): 3m10.4s (27s longer)
```
3m20.109s
3m9.852s
3m2.220s
3m17.485s
3m2.733s
```

TODOs

Make other clouds work with the new separate venv
Backward compatibility test
Reduce the time for provisioning

Tested (run the relevant ones):

cblmemo

Thanks for the fix! Just briefly went through the PR (& related issue) and left two discussions first:

Will our ray (w/ v 2.7.0) use the same ray address & port as the user-installed ray? If this is the case, will there be any compatibility issues when two different versions of ray act on the same interface? (I assume it is okay since ray should have backward compatibility, but just want to make sure.)
Can we consider adding the constants.ACTIVATE_PYTHON_ENV to SSHCommandRunner.run? It seems like will make the code cleaner.

Michaelvll · 2023-11-20T06:34:15Z

Will our ray (w/ v 2.7.0) use the same ray address & port as the user-installed ray? If this is the case, will there be any compatibility issues when two different versions of ray act on the same interface? (I assume it is okay since ray should have backward compatibility, but just want to make sure.)

We are using the same version of ray in the new venv, and the ray address and port have been changed to non-default one in a previous PR already #1790.

Can we consider adding the constants.ACTIVATE_PYTHON_ENV to SSHCommandRunner.run? It seems like will make the code cleaner.

This is a good point! However, the activate command needs to be called case by case and some commands are not directly invoked by SSHCommandRunner.run, e.g.:

skypilot/sky/backends/cloud_vm_ray_backend.py

Lines 3357 to 3358 in 27dc07b

job_submit_cmd = (

f'{constants.ACTIVATE_PYTHON_ENV}'
skypilot/sky/skylet/attempt_skylet.py

Line 21 in 27dc07b

f'{constants.ACTIVATE_PYTHON_ENV} '

It might not be worth it to increase the complexity of the SSHCommandRunner.run command. Wdyt?

cblmemo · 2023-11-20T06:59:33Z

Will our ray (w/ v 2.7.0) use the same ray address & port as the user-installed ray? If this is the case, will there be any compatibility issues when two different versions of ray act on the same interface? (I assume it is okay since ray should have backward compatibility, but just want to make sure.)

We are using the same version of ray in the new venv, and the ray address and port have been changed to non-default one in a previous PR already #1790.

Can we consider adding the constants.ACTIVATE_PYTHON_ENV to SSHCommandRunner.run? It seems like will make the code cleaner.

This is a good point! However, the activate command needs to be called case by case and some commands are not directly invoked by SSHCommandRunner.run, e.g.:

skypilot/sky/backends/cloud_vm_ray_backend.py

Lines 3357 to 3358 in 27dc07b

job_submit_cmd = (

f'{constants.ACTIVATE_PYTHON_ENV}'

skypilot/sky/skylet/attempt_skylet.py

Line 21 in 27dc07b

f'{constants.ACTIVATE_PYTHON_ENV} '

It might not be worth it to increase the complexity of the SSHCommandRunner.run command. Wdyt?

Make sense! Lets keep current implementation🫡

… into dependency-in-separate-env

dongreenberg · 2023-11-21T19:05:48Z

Is there a way to disable this feature (via the Python API) to produce the existing behavior? We use SkyPilot within the code running on the launched cluster, so this isolation is both not needed and even increases our start time further (because we then need to install SkyPilot and start Ray again in our own environment).

Michaelvll · 2023-11-22T23:14:00Z

Is there a way to disable this feature (via the Python API) to produce the existing behavior? We use SkyPilot within the code running on the launched cluster, so this isolation is both not needed and even increases our start time further (because we then need to install SkyPilot and start Ray again in our own environment).

Thanks for the request @dongreenberg! Is it possible to activate the environment source ~/skypilot-runtime/bin/activate in your code as well? I suppose this will not increase the start time : )

dongreenberg · 2023-11-23T00:19:21Z

That would work! Will the command runner allow specifying a conda environment to run in?

Michaelvll · 2023-11-23T09:58:16Z

That would work! Will the command runner allow specifying a conda environment to run in?

Currently, we decide not to have the argument for the command runner to make the API cleaner, and make the commands to be run self contained. Do you think it would be possible to do something as the following instead?

def custom_run(runner, cmd, *args, **kwargs):
    cmd = f'source ~/skypilot-runtime/bin/activate; {cmd}'
    return runner.run(cmd, *args, **kwargs)

…ency-in-separate-env

Michaelvll · 2023-12-24T16:03:20Z

Another user requested for this PR:

it would be great if sky's core setup is less easily interfered by user's actions

…ency-in-separate-env

…o dependency-in-separate-env

Michaelvll · 2024-03-18T07:17:28Z

master (823999a)

2m2.152s
1m54.330s
1m57.132s
1m56.134s
1m49.895s

mean: 1m55.9286s

ae84211:
for i in {1..5}; do time sky launch -c test-gcp-$i --cloud gcp --cpus 2 -y; done

2m27.360s
2m19.898s
2m27.360s
2m33.147s
2m22.630s

mean: 2m26.079s (overhead: 30s)

cblmemo

Thanks for adding this feature! Besides those comments, is it possible to automatically add the activation commands? Feeling like this PR will introduce a lot of hazard for the following PR. Also, could we have a smoke/unit test for it?

cblmemo · 2024-03-19T07:11:00Z

sky/skylet/constants.py

@@ -66,19 +74,38 @@
    DOCKER_SERVER_ENV_VAR,
 }

+ACTIVATE_PYTHON_ENV = (f'[ -d {SKY_REMOTE_PYTHON_ENV} ] && '


In what case will the remote python env not exist? Should we print some warning here?

cblmemo · 2024-03-19T07:12:07Z

sky/execution.py

@@ -677,6 +677,7 @@ def spot_launch(
            # Note: actual spot cluster name will be <task.name>-<spot job ID>
            'dag_name': dag.name,
            'retry_until_up': retry_until_up,
+            'skypilot_runtime_env': constants.SKY_REMOTE_PYTHON_ENV,


Should we put this variable into controller_utils.shared_controller_vars_to_fill since it is both used by serve & spot controller?

cblmemo · 2024-03-19T07:15:42Z

sky/skylet/attempt_skylet.py

@@ -13,12 +13,13 @@ def restart_skylet():
    # TODO(zhwu): make the killing graceful, e.g., use a signal to tell
    # skylet to exit, instead of directly killing it.
    subprocess.run(
-        'ps aux | grep "sky.skylet.skylet" | grep "python3 -m"'
+        'ps aux | grep "sky.skylet.skylet" | grep "python -m"'


Just curious, is there any reason to make this change?

cblmemo · 2024-03-19T07:16:46Z

sky/skylet/constants.py

+# TODO(mluo): Make explicit `sky launch -c <name> ''` optional.
+UNINITIALIZED_ONPREM_CLUSTER_MESSAGE = (
+    'Found uninitialized local cluster {cluster}. Run this '
+    'command to initialize it locally: sky launch -c {cluster} \'\'')


IIUC we already deprecated the local clusters? Should we remove it?

cblmemo · 2024-03-19T07:17:42Z

sky/skylet/autostop_lib.py

@@ -121,4 +122,5 @@ def is_autostopping(cls) -> str:
    def _build(cls, code: List[str]) -> str:
        code = cls._PREFIX + code
        code = ';'.join(code)
-        return f'python3 -u -c {shlex.quote(code)}'
+        return (f'{constants.ACTIVATE_PYTHON_ENV} '


actually, why not use sky/skylet/constants.py::run_in_python_env to replace all of those constants.ACTIVATE_PYTHON_ENV?

cblmemo · 2024-03-19T07:19:12Z

sky/skylet/constants.py

+    'grep "# >>> conda initialize >>>" ~/.bashrc || conda init;'
+    # Create a separate conda environment for SkyPilot dependencies.
+    f'[ -d {SKY_REMOTE_PYTHON_ENV} ] || '
+    f'python -m venv {SKY_REMOTE_PYTHON_ENV}; '


Have we considered using a conda env instead? What are the pros & cons of the two implementations?

cblmemo · 2024-03-19T07:21:07Z

sky/templates/azure-ray.yml.j2

We should add this for newly-added cloud as well, e.g. runpod

Michaelvll · 2024-05-21T18:29:15Z

Closing this, as it is moved to #3575

cblmemo and others added 3 commits October 3, 2023 13:31

fix

9bcabb4

Moves skypilot/ray installation to skypilot-runtime env

57fd5be

Fix conda run commands and backward compatibility

7824745

Michaelvll changed the title ~~[Remote] Install SkyPilot runtime in separate env~~ [Dependency] Install SkyPilot runtime in separate env Nov 17, 2023

Michaelvll added 13 commits November 17, 2023 23:44

format

c8a7052

Fix GCP setup

7ddcf75

fix dependency cloud

22d9860

Adopt changes to all other clouds

f3e6df0

fix ray command to be used

6c6ea73

fix azure case

ed9866e

Change to venv implementation

6491629

format

01eea55

format

6753b2d

Fix venv

db60092

remove deactivate's as all the commands are run alone

5fb31e0

Fix activate and multiple nodes

952bc80

Fix worker ray command

b39fa19

Michaelvll marked this pull request as ready for review November 20, 2023 02:55

format

27dc07b

Michaelvll requested review from cblmemo and concretevitamin and removed request for cblmemo November 20, 2023 04:37

cblmemo reviewed Nov 20, 2023

View reviewed changes

Michaelvll changed the title ~~[Dependency] Install SkyPilot runtime in separate env~~ [Core] Install SkyPilot runtime in separate env Nov 20, 2023

Michaelvll added 3 commits November 20, 2023 21:20

fix runtime for controller

97ad588

deactivate for run command

dbcfff5

fix comment

cc267c0

Michaelvll added 6 commits November 21, 2023 00:19

deactivate before enabling conda

c353d3b

Fix deactivate

465cd77

longer job time

a280503

longer timeout

bedd6e4

longer wait time

141ac76

Merge branch 'fix-ports-on-azure' of github.com:skypilot-org/skypilot…

cf7c315

… into dependency-in-separate-env

Michaelvll mentioned this pull request Nov 21, 2023

[Provisioner] Fix open ports on Azure #2649

Merged

5 tasks

longer sleep time for autostop

e98788f

install ray not equal to 2.8

06c325a

Michaelvll mentioned this pull request Nov 29, 2023

Docs: various improvements. #2827

Merged

5 tasks

Merge branch 'master' of github.com:skypilot-org/skypilot into depend…

fe9b5da

…ency-in-separate-env

Michaelvll added 2 commits December 27, 2023 12:59

Merge branch 'master' of github.com:skypilot-org/skypilot into depend…

71d9594

…ency-in-separate-env

use constants for envs

841f1ff

Michaelvll added the do not merge do not merge this PR now label Dec 28, 2023

Michaelvll force-pushed the master branch from 71213e5 to 9743aa0 Compare January 13, 2024 05:30

Michaelvll mentioned this pull request Feb 28, 2024

[Core] Upgrade ray to 2.9.3 to support python 3.11 #3248

Merged

18 tasks

Michaelvll added 3 commits March 18, 2024 05:49

Merge branch 'master' of https://github.com/skypilot-org/skypilot int…

ba38eea

…o dependency-in-separate-env

fix var

93afbc6

format

ae84211

cblmemo reviewed Mar 19, 2024

View reviewed changes

Michaelvll mentioned this pull request May 21, 2024

[Core] Install SkyPilot runtime in separate env #3575

Merged

11 tasks

Michaelvll closed this May 21, 2024

Michaelvll mentioned this pull request Jul 4, 2024

[Docs] Clarify that setup for task should avoid installing other version of ray in base env #3082

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Install SkyPilot runtime in separate env #2801

[Core] Install SkyPilot runtime in separate env #2801

Michaelvll commented Nov 17, 2023 •

edited

Loading

cblmemo left a comment

Michaelvll commented Nov 20, 2023

cblmemo commented Nov 20, 2023

dongreenberg commented Nov 21, 2023

Michaelvll commented Nov 22, 2023

dongreenberg commented Nov 23, 2023

Michaelvll commented Nov 23, 2023

Michaelvll commented Dec 24, 2023

Michaelvll commented Mar 18, 2024

cblmemo left a comment

cblmemo Mar 19, 2024

cblmemo Mar 19, 2024

cblmemo Mar 19, 2024

cblmemo Mar 19, 2024

cblmemo Mar 19, 2024

cblmemo Mar 19, 2024

cblmemo Mar 19, 2024

Michaelvll commented May 21, 2024

[Core] Install SkyPilot runtime in separate env #2801

[Core] Install SkyPilot runtime in separate env #2801

Conversation

Michaelvll commented Nov 17, 2023 • edited Loading

Background

Profiling

TODOs

Tested (run the relevant ones):

cblmemo left a comment

Choose a reason for hiding this comment

Michaelvll commented Nov 20, 2023

cblmemo commented Nov 20, 2023

dongreenberg commented Nov 21, 2023

Michaelvll commented Nov 22, 2023

dongreenberg commented Nov 23, 2023

Michaelvll commented Nov 23, 2023

Michaelvll commented Dec 24, 2023

Michaelvll commented Mar 18, 2024

cblmemo left a comment

Choose a reason for hiding this comment

cblmemo Mar 19, 2024

Choose a reason for hiding this comment

cblmemo Mar 19, 2024

Choose a reason for hiding this comment

cblmemo Mar 19, 2024

Choose a reason for hiding this comment

cblmemo Mar 19, 2024

Choose a reason for hiding this comment

cblmemo Mar 19, 2024

Choose a reason for hiding this comment

cblmemo Mar 19, 2024

Choose a reason for hiding this comment

cblmemo Mar 19, 2024

Choose a reason for hiding this comment

Michaelvll commented May 21, 2024

Michaelvll commented Nov 17, 2023 •

edited

Loading