Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Avoid uploading public key #2119

Merged
merged 12 commits into from
Jul 12, 2023
Merged

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jun 22, 2023

Fixes #2025

Uploading the sky-key.pub to the ~/.ssh/sky-key.pub will cause problem for spot controller. This PR gets rid of the public_key upload, except for the Lambda, as it is hardcoded and hard to modify at the moment.

This also reduces the permission for the GCP account to launch a VM, as we no longer need to set the metadata for the key pairs.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch --cloud aws -c test-aws --num-nodes 2 -i 0 echo hi; sleep 90; sky status -r test-aws
    • sky launch --cloud gcp -c test-gcp --num-nodes 2 -i 0 echo hi; sleep 90; sky status -r test-gcp with a new pair of keys
    • sky launch --cloud gcp -c test-gcp -i 0; sky status -r without the setCommonInstanceMetadata permission.
    • sky launch --cloud azure --num-nodes 2 -c test-azure -i 0 echo hi; sleep 120; sky status -r test-azure
    • sky launch --cloud oci --num-nodes 2 --cpus 2 -c test-oci -i 0 echo hi; sleep 90; sky status -r test-oci
    • sky launch --cloud lambda -c test-lambda -i 0 --num-nodes 2 --down echo hi; sleep 90; sky status -r test-lambda
    • sky launch --cloud ibm --num-nodes 2 -c test-ibm -i 0 echo hi; sleep 90; sky status -r test-ibm
    • sky spot launch --cpus 2 --cloud gcp echo hi with Azure to be the spot controller [Spot/Auth] Public key should not be uploaded to the remote VM  #2025
    • sky launch --cloud scp -c test-scp -i 0 echo hi; sleep 90; sky status -r test-scp (Failed due to test account not able to launch new instance)
  • All smoke tests: pytest tests/test_smoke.py (except [Examples] TF training code for ResNet no longer working #2207)

@Michaelvll Michaelvll force-pushed the remove-public-key-uploading branch 3 times, most recently from 2fe9617 to 07ee384 Compare June 22, 2023 08:59
author Zhanghao Wu <zhanghao.wu@outlook.com> 1687057727 -0700
committer Zhanghao Wu <zhanghao.wu@outlook.com> 1687488369 -0700

parent ca9fed1
author Zhanghao Wu <zhanghao.wu@outlook.com> 1687057727 -0700
committer Zhanghao Wu <zhanghao.wu@outlook.com> 1687488294 -0700

Add document for minimal permission

start permission

Fix permission check

note about the service account setup

Add missing permissions

Add missing permissions

add permissions for tpu

Update docs/source/cloud-setup/cloud-permissions.rst

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

Update docs/source/cloud-setup/cloud-permissions.rst

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

address comments
fix lambda

Use ssh-key metadata for instance

fix ssh for lambda

format

revert lambda

remove unused permission
@Michaelvll Michaelvll force-pushed the remove-public-key-uploading branch from f71e185 to c33d3f5 Compare June 23, 2023 02:47
@Michaelvll Michaelvll mentioned this pull request Jul 4, 2023
2 tasks
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll, left some questions. We should probably do back compat + smoke.

@@ -50,7 +50,7 @@ def bootstrap_instance_config(self, node_config):
instance_config['serverType'] = node_config['InstanceType']
instance_config['contractId'] = "None"
instance_config['initialScript'] = self._get_vm_init_script(
node_config['auth']['ssh_public_key'])
node_config['AuthorizedKey'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we pick AuthorizedKey on our own, or is it a convention?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we log into a remote linux server, it will look for its local ~/.ssh/authorized_keys file. I guess it is more accurate to use AuthorizedKey here, instead of the ssh_public_key?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question is it seems a bit weird that the different *.j2 use different field names for this.

One option is to conform to AWS's j2, i.e., cloud-init's "ssh_authorized_keys" field?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is that we want to keep the naming style the same as the other keys in the yaml file. For example, in oci-ray.yml.j2, all the keys are camel-case. Having a snake-case ssh_authorized_keys among them might be a bit weird?

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Jul 10, 2023

Thanks for the review @concretevitamin! Just tested it and it passed the smoke tests. For backward compatibility, the changes in this PR will not affect the existing clusters, as the config restoring in write_cluster_config will not apply any of the new changes to the existing cluster.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, some minor questions.

"""Module to enable a single SkyPilot key for all VMs in each cloud."""
"""Module to enable a single SkyPilot key for all VMs in each cloud.

The `setup_<cloud>_authentication` functions will be called after the ray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `setup_<cloud>_authentication` functions will be called after the ray
The `setup_<cloud>_authentication` functions will be called on every new cluster provision request.
Specifically, after the ray yaml template file `<cloud>-ray.yml.j2` is filled in with resource specific
information, these functions are called with the filled in ray yaml config as input,

Please double check if "new cluster provision request" is correct, e.g., what about launch -c existing or start.

Copy link
Collaborator Author

@Michaelvll Michaelvll Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions will be called no matter whether the cluster exists or not. I applied the change, and changed the wording a bit. We can probably fix when the functions should be called in a future PR. : )

@@ -50,7 +50,7 @@ def bootstrap_instance_config(self, node_config):
instance_config['serverType'] = node_config['InstanceType']
instance_config['contractId'] = "None"
instance_config['initialScript'] = self._get_vm_init_script(
node_config['auth']['ssh_public_key'])
node_config['AuthorizedKey'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question is it seems a bit weird that the different *.j2 use different field names for this.

One option is to conform to AWS's j2, i.e., cloud-init's "ssh_authorized_keys" field?

Michaelvll and others added 4 commits July 10, 2023 15:14
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @Michaelvll!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Spot/Auth] Public key should not be uploaded to the remote VM
2 participants