Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Install SkyPilot runtime in separate env #2801

Closed
wants to merge 34 commits into from

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Nov 17, 2023

Fixes #2722

Background

Installing every skypilot runtime in the base environment can cause issues when users also tries to install their dependencies in the base environment on the remote VM. It is very easy to cause the VM to become in a unexpected state. We now move all the skypilot runtime to a new conda environment for better isolation from the users' own python environment.

Profiling

(average across three times)
sky launch -c test-aws --cloud aws --cpus 2

  • Original: 3m2s
  • conda implementation: 3m16s
  • venv implementation (current implementation): 2m42s

(average across 5 times)
for i in {1..5}; do time sky launch -c test-aws-$i --cloud aws --cpus 2 -y; done

  • master (conda base, 7c514ba): 2m23.9s
    2m34.249s
    2m30.460s
    2m29.660s
    2m14.271s
    2m11.077s
    
  • venv implementation (current, 841f1ff): 2m39.8s (16s longer)
    3m0.389s
    2m56.428s
    2m37.248s
    2m16.727s
    2m28.398s
    

sky launch -c test-gcp --cloud gcp --cpus 2

  • Original: 2m50s
  • conda implementation: 3m38s
  • venv implementation (current implementation): 3m3s

(average across 5 times)
for i in {1..5}; do time sky launch -c test-gcp-$i --cloud gcp --cpus 2 -y; done

  • master (conda base, 7c514ba): 2m43.3s
    2m41.862s
    2m56.141s
    2m39.850s
    2m41.099s
    2m37.994s
    
  • venv implementation (current, 841f1ff): 3m10.4s (27s longer)
    3m20.109s
    3m9.852s
    3m2.220s
    3m17.485s
    3m2.733s
    

TODOs

  • Make other clouds work with the new separate venv
  • Backward compatibility test
  • Reduce the time for provisioning

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test-launch --cloud aws --num-nodes 2 --cpus 2+ echo hi
    • sky launch -c test-launch --cloud gcp --num-nodes 2 --cpus 2+ echo hi
    • sky launch -c test-launch --cloud gcp --gpus tpu-v2-8 echo hi
    • sky launch -c test-launch --cloud gcp test.yaml (TPU VM: tpu-v3-32 for TPU pod test)
    • sky launch -c test-launch --cloud azure --num-nodes 2 --cpus 2+ echo hi
    • sky launch -c test-launch --cloud ibm --num-nodes 2 --cpus 2+ echo hi
    • sky launch -c test-launch --cloud lambda --num-nodes 2 --cpus 2+ echo hi
  • All smoke tests: pytest tests/test_smoke.py
  • All smoke tests: pytest tests/test_smoke.py --aws
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh
    • AWS
    • GCP
    • existing spot controller
    • existing serve controller

@Michaelvll Michaelvll changed the title [Remote] Install SkyPilot runtime in separate env [Dependency] Install SkyPilot runtime in separate env Nov 17, 2023
@Michaelvll Michaelvll marked this pull request as ready for review November 20, 2023 02:55
@Michaelvll Michaelvll requested review from cblmemo and concretevitamin and removed request for cblmemo November 20, 2023 04:37
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Just briefly went through the PR (& related issue) and left two discussions first:

  • Will our ray (w/ v 2.7.0) use the same ray address & port as the user-installed ray? If this is the case, will there be any compatibility issues when two different versions of ray act on the same interface? (I assume it is okay since ray should have backward compatibility, but just want to make sure.)
  • Can we consider adding the constants.ACTIVATE_PYTHON_ENV to SSHCommandRunner.run? It seems like will make the code cleaner.

@Michaelvll
Copy link
Collaborator Author

Will our ray (w/ v 2.7.0) use the same ray address & port as the user-installed ray? If this is the case, will there be any compatibility issues when two different versions of ray act on the same interface? (I assume it is okay since ray should have backward compatibility, but just want to make sure.)

We are using the same version of ray in the new venv, and the ray address and port have been changed to non-default one in a previous PR already #1790.

Can we consider adding the constants.ACTIVATE_PYTHON_ENV to SSHCommandRunner.run? It seems like will make the code cleaner.

This is a good point! However, the activate command needs to be called case by case and some commands are not directly invoked by SSHCommandRunner.run, e.g.:

  1. job_submit_cmd = (
    f'{constants.ACTIVATE_PYTHON_ENV}'
  2. f'{constants.ACTIVATE_PYTHON_ENV} '

    It might not be worth it to increase the complexity of the SSHCommandRunner.run command. Wdyt?

@cblmemo
Copy link
Collaborator

cblmemo commented Nov 20, 2023

Will our ray (w/ v 2.7.0) use the same ray address & port as the user-installed ray? If this is the case, will there be any compatibility issues when two different versions of ray act on the same interface? (I assume it is okay since ray should have backward compatibility, but just want to make sure.)

We are using the same version of ray in the new venv, and the ray address and port have been changed to non-default one in a previous PR already #1790.

Can we consider adding the constants.ACTIVATE_PYTHON_ENV to SSHCommandRunner.run? It seems like will make the code cleaner.

This is a good point! However, the activate command needs to be called case by case and some commands are not directly invoked by SSHCommandRunner.run, e.g.:

  1. job_submit_cmd = (
    f'{constants.ACTIVATE_PYTHON_ENV}'

  2. f'{constants.ACTIVATE_PYTHON_ENV} '

    It might not be worth it to increase the complexity of the SSHCommandRunner.run command. Wdyt?

Make sense! Lets keep current implementation🫡

@Michaelvll Michaelvll changed the title [Dependency] Install SkyPilot runtime in separate env [Core] Install SkyPilot runtime in separate env Nov 20, 2023
@dongreenberg
Copy link
Contributor

Is there a way to disable this feature (via the Python API) to produce the existing behavior? We use SkyPilot within the code running on the launched cluster, so this isolation is both not needed and even increases our start time further (because we then need to install SkyPilot and start Ray again in our own environment).

@Michaelvll
Copy link
Collaborator Author

Is there a way to disable this feature (via the Python API) to produce the existing behavior? We use SkyPilot within the code running on the launched cluster, so this isolation is both not needed and even increases our start time further (because we then need to install SkyPilot and start Ray again in our own environment).

Thanks for the request @dongreenberg! Is it possible to activate the environment source ~/skypilot-runtime/bin/activate in your code as well? I suppose this will not increase the start time : )

@dongreenberg
Copy link
Contributor

That would work! Will the command runner allow specifying a conda environment to run in?

@Michaelvll
Copy link
Collaborator Author

That would work! Will the command runner allow specifying a conda environment to run in?

Currently, we decide not to have the argument for the command runner to make the API cleaner, and make the commands to be run self contained. Do you think it would be possible to do something as the following instead?

def custom_run(runner, cmd, *args, **kwargs):
    cmd = f'source ~/skypilot-runtime/bin/activate; {cmd}'
    return runner.run(cmd, *args, **kwargs)

@Michaelvll Michaelvll mentioned this pull request Nov 29, 2023
5 tasks
@Michaelvll
Copy link
Collaborator Author

Another user requested for this PR:

it would be great if sky's core setup is less easily interfered by user's actions

@Michaelvll
Copy link
Collaborator Author

master (823999a)

2m2.152s
1m54.330s
1m57.132s
1m56.134s
1m49.895s

mean: 1m55.9286s

ae84211:
for i in {1..5}; do time sky launch -c test-gcp-$i --cloud gcp --cpus 2 -y; done

2m27.360s
2m19.898s
2m27.360s
2m33.147s
2m22.630s

mean: 2m26.079s (overhead: 30s)

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature! Besides those comments, is it possible to automatically add the activation commands? Feeling like this PR will introduce a lot of hazard for the following PR. Also, could we have a smoke/unit test for it?

@@ -66,19 +74,38 @@
DOCKER_SERVER_ENV_VAR,
}

ACTIVATE_PYTHON_ENV = (f'[ -d {SKY_REMOTE_PYTHON_ENV} ] && '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case will the remote python env not exist? Should we print some warning here?

@@ -677,6 +677,7 @@ def spot_launch(
# Note: actual spot cluster name will be <task.name>-<spot job ID>
'dag_name': dag.name,
'retry_until_up': retry_until_up,
'skypilot_runtime_env': constants.SKY_REMOTE_PYTHON_ENV,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this variable into controller_utils.shared_controller_vars_to_fill since it is both used by serve & spot controller?

@@ -13,12 +13,13 @@ def restart_skylet():
# TODO(zhwu): make the killing graceful, e.g., use a signal to tell
# skylet to exit, instead of directly killing it.
subprocess.run(
'ps aux | grep "sky.skylet.skylet" | grep "python3 -m"'
'ps aux | grep "sky.skylet.skylet" | grep "python -m"'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is there any reason to make this change?

Comment on lines +29 to +32
# TODO(mluo): Make explicit `sky launch -c <name> ''` optional.
UNINITIALIZED_ONPREM_CLUSTER_MESSAGE = (
'Found uninitialized local cluster {cluster}. Run this '
'command to initialize it locally: sky launch -c {cluster} \'\'')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we already deprecated the local clusters? Should we remove it?

@@ -121,4 +122,5 @@ def is_autostopping(cls) -> str:
def _build(cls, code: List[str]) -> str:
code = cls._PREFIX + code
code = ';'.join(code)
return f'python3 -u -c {shlex.quote(code)}'
return (f'{constants.ACTIVATE_PYTHON_ENV} '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, why not use sky/skylet/constants.py::run_in_python_env to replace all of those constants.ACTIVATE_PYTHON_ENV?

'grep "# >>> conda initialize >>>" ~/.bashrc || conda init;'
# Create a separate conda environment for SkyPilot dependencies.
f'[ -d {SKY_REMOTE_PYTHON_ENV} ] || '
f'python -m venv {SKY_REMOTE_PYTHON_ENV}; '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered using a conda env instead? What are the pros & cons of the two implementations?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add this for newly-added cloud as well, e.g. runpod

@Michaelvll
Copy link
Collaborator Author

Closing this, as it is moved to #3575

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge do not merge this PR now
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Ray version change breaks SkyPilot cluster
3 participants