Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s_cloud_beta1] Adding support for ssh using kubectl port-forward to access k8s instance #2412

Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
3021576
Add sshjump support.
aviweit Aug 8, 2023
8afe74a
Update lcm script.
aviweit Aug 9, 2023
e87bcd6
Set imagePullPolicy to IfNotPresent.
aviweit Aug 9, 2023
be18925
add support for port-forward
landscapepainter Aug 12, 2023
82e1dd6
Merge branch 'master' into ssh-port-forward-beta1
landscapepainter Aug 12, 2023
94e56f5
remove unused
landscapepainter Aug 12, 2023
cd970b2
comments
landscapepainter Aug 12, 2023
99edb40
Disable ControlMaster for ssh_options_list
landscapepainter Aug 12, 2023
d44a153
nit
landscapepainter Aug 12, 2023
714cce0
update to disable rest of the ControlMaster
landscapepainter Aug 15, 2023
d475914
command runner rsync update
landscapepainter Aug 15, 2023
8434636
relocating run_on_k8s
landscapepainter Aug 17, 2023
2760922
relocate run_on_k8s
landscapepainter Aug 17, 2023
1ece560
Merge branch 'k8s_cloud_beta1' into ssh-port-forward-beta1
landscapepainter Aug 17, 2023
ae96627
Make Kubernetes specific env variables available when joining a clust…
hemildesai Aug 17, 2023
9ae2f27
merge k8s_cloud_beta1
landscapepainter Aug 17, 2023
102ec10
Merge branch 'k8s_cloud_beta1' into ssh-port-forward-beta1
landscapepainter Aug 17, 2023
0385dda
format
landscapepainter Aug 17, 2023
526ec35
remove redundant utils.py
landscapepainter Aug 17, 2023
6b54887
format and comments
landscapepainter Aug 18, 2023
f229080
update with proxy_to_k8s
landscapepainter Aug 19, 2023
1e1a201
Update sky/authentication.py
landscapepainter Aug 19, 2023
82b7182
resolving comments on structures
landscapepainter Aug 20, 2023
d0a4461
Merge branch 'ssh-port-forward-beta1' of https://github.com/landscape…
landscapepainter Aug 20, 2023
60ee3c7
Update sky/utils/command_runner.py
landscapepainter Aug 20, 2023
21b2d18
document on nodeport/port-forward proxycommand
landscapepainter Aug 20, 2023
760b226
error handling when socat is not installed
landscapepainter Aug 21, 2023
fa1136f
removing KUBECONFIG from port-forward shell script
landscapepainter Aug 21, 2023
72ed006
Merge branch 'k8s_cloud_beta1' into ssh-port-forward-beta1
landscapepainter Aug 21, 2023
12cd002
nit
landscapepainter Aug 22, 2023
efef2ab
nit
landscapepainter Aug 23, 2023
da940ff
Add suport for nodeport
hemildesai Aug 24, 2023
c333b11
Update sky/utils/kubernetes_utils.py
landscapepainter Aug 24, 2023
826430d
update
landscapepainter Aug 24, 2023
276f310
switch svc when conflicting jump pod svc exist
landscapepainter Aug 25, 2023
01bd78b
format
landscapepainter Aug 25, 2023
b017ec5
Merge branch 'k8s_cloud_beta1' into ssh-port-forward-beta1
landscapepainter Aug 25, 2023
f225c38
Update sky/utils/kubernetes_utils.py
landscapepainter Aug 27, 2023
0944ce9
refactoring check for socat
landscapepainter Aug 27, 2023
beebedf
Merge branch 'ssh-port-forward-beta1' of https://github.com/landscape…
landscapepainter Aug 27, 2023
b61e824
resolve comments
landscapepainter Aug 27, 2023
eef84f0
add ServiceType enum and port-forward proxy script
landscapepainter Aug 29, 2023
3d977a0
Merge branch 'hd/k8s-env-vars-for-ssh'
landscapepainter Aug 30, 2023
84abe99
update k8s env var access
landscapepainter Aug 30, 2023
c0d7ecf
add check for container status remove unused func
landscapepainter Aug 30, 2023
cb37ac1
nit
landscapepainter Aug 30, 2023
2a4cc0d
update get_external_ip for portforward mode
landscapepainter Aug 30, 2023
05b3fb8
Merge branch 'k8s_cloud_beta1' into ssh-port-forward-beta1
landscapepainter Aug 30, 2023
6e7c511
conditionally use sudo and quote values of env var
landscapepainter Aug 31, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions sky/adaptors/kubernetes.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@
kubernetes = None
urllib3 = None

LOCAL_PORT_FOR_PORT_FORWARD = 23100
PORT_FORWARD_PROXY_CMD_TEMPLATE = \
'kubernetes-port-forward-proxy-command.yaml.j2'
PORT_FORWARD_PROXY_CMD_PATH = '~/.sky/port-forward-proxy-cmd.sh'
KUBE_CONFIG_DEFAULT_PATH = '~/.kube/config'

_configured = False
_core_api = None
_auth_api = None
Expand Down
80 changes: 67 additions & 13 deletions sky/authentication.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,11 @@

from sky import clouds
from sky import sky_logging
from sky import skypilot_config
from sky.adaptors import gcp
from sky.adaptors import ibm
from sky.adaptors import kubernetes
from sky.backends import backend_utils
from sky.skylet.providers.lambda_cloud import lambda_utils
from sky.utils import common_utils
from sky.utils import kubernetes_utils
Expand Down Expand Up @@ -377,9 +380,57 @@ def setup_scp_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
return _replace_ssh_info_in_config(config, public_key)


def _get_kubernetes_proxy_command(ingress: int, ipaddress: str,
ssh_setup_mode: str):
""" returns Proxycommand to use when establishing ssh connection
to the k8s instance through the jump pod.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this method, I would err on the side on over-documenting. It would be good to add details here on:

  • why we use a proxycommand
  • what does the proxycommand do behind the scenes

Copy link
Collaborator Author

@landscapepainter landscapepainter Aug 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj I wrote the doc that resolves all two bullet points. Please take a look!

Args:
ingress: int; the port number host machine is listening to
ipaddress: str; ip address of the host machine
ssh_setup_mode: str; networking mode for ssh session. It is either
'nodeport' or 'port-forward'
"""
if ssh_setup_mode == 'nodeport':
proxy_command = (f'ssh -tt -i {PRIVATE_SSH_KEY_PATH} '
'-o StrictHostKeyChecking=no '
'-o UserKnownHostsFile=/dev/null '
f'-o IdentitiesOnly=yes -p {ingress} '
f'-W %h:%p sky@{ipaddress}')
# Setting kubectl port-forward/socat to establish ssh session using
# ClusterIP service to disallow any ports opened
else:
ssh_jump_name = clouds.Kubernetes.SKY_SSH_JUMP_NAME
kube_config_path = os.path.expanduser(
kubernetes.KUBE_CONFIG_DEFAULT_PATH)
vars_to_fill = {
'ssh_jump_name': ssh_jump_name,
'ipaddress': ipaddress,
'local_port': ingress,
'kube_config_path': kube_config_path
}
port_forward_proxy_cmd_path = os.path.expanduser(
kubernetes.PORT_FORWARD_PROXY_CMD_PATH)
backend_utils.fill_template(kubernetes.PORT_FORWARD_PROXY_CMD_TEMPLATE,
vars_to_fill,
output_path=port_forward_proxy_cmd_path)
os.chmod(port_forward_proxy_cmd_path,
os.stat(port_forward_proxy_cmd_path).st_mode | 0o111)
proxy_command = (f'ssh -tt -i {PRIVATE_SSH_KEY_PATH} '
f'-o ProxyCommand=\'{port_forward_proxy_cmd_path}\' '
'-o StrictHostKeyChecking=no '
'-o UserKnownHostsFile=/dev/null '
f'-o IdentitiesOnly=yes -p {ingress} '
f'-W %h:%p sky@{ipaddress}')
return proxy_command


def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
# Default ssh session is established with kubectl port-forwarding with
# ClusterIP service
ssh_setup_mode = skypilot_config.get_nested(('kubernetes', 'networking'),
'port-forward').lower()
get_or_generate_keys()

# Run kubectl command to add the public key to the cluster.
public_key_path = os.path.expanduser(PUBLIC_SSH_KEY_PATH)
key_label = clouds.Kubernetes.SKY_SSH_KEY_SECRET_NAME
Expand Down Expand Up @@ -407,19 +458,22 @@ def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
sshjump_name = clouds.Kubernetes.SKY_SSH_JUMP_NAME
sshjump_image = clouds.Kubernetes.IMAGE_CPU
namespace = kubernetes_utils.get_current_kube_config_context_namespace()
ssh_jump_ip = clouds.Kubernetes.get_external_ip()
if ssh_setup_mode == 'nodeport':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the enum KubernetesNetworkingMode everywhere to avoid hardcoding strings 'nodeport' and 'portforward'? We can also ask the user to use the same string in the config file and then read directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj I'm wondering if we should create an enum for service types as well since there are currently some places(setup_kubernetes_authentication, setup_sshjump_svc) using NodePort and ClusterIP as a hardcoded string. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it'll be good to have KubernetesServiceType enum

service_type = 'NodePort'
# If ssh connection is establisehd with kubectl port-forward, the jump pod
# will run on ClusterIP service. This enables to establish ssh session
# without opening any ports on the Kubernetes cluster.
else:
service_type = 'ClusterIP'

kubernetes_utils.setup_sshjump(sshjump_name, sshjump_image, key_label,
namespace)

ssh_jump_port = clouds.Kubernetes.get_port(sshjump_name)
namespace, service_type)
ssh_jump_ip = clouds.Kubernetes.get_external_ip()

ssh_jump_proxy_command = (f'ssh -tt -i {PRIVATE_SSH_KEY_PATH} '
'-o StrictHostKeyChecking=no '
'-o UserKnownHostsFile=/dev/null '
'-o IdentitiesOnly=yes '
f'-p {ssh_jump_port} -W %h:%p sky@{ssh_jump_ip}')

config['auth']['ssh_proxy_command'] = ssh_jump_proxy_command

if ssh_setup_mode == 'nodeport':
ssh_jump_port = clouds.Kubernetes.get_port(sshjump_name)
else:
ssh_jump_port = kubernetes.LOCAL_PORT_FOR_PORT_FORWARD
config['auth']['ssh_proxy_command'] = _get_kubernetes_proxy_command(
ssh_jump_port, ssh_jump_ip, ssh_setup_mode)
return config
6 changes: 5 additions & 1 deletion sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1341,7 +1341,7 @@ def wait_until_ray_cluster_ready(

def ssh_credential_from_yaml(cluster_yaml: str,
docker_user: Optional[str] = None
) -> Dict[str, str]:
) -> Dict[str, Any]:
"""Returns ssh_user, ssh_private_key and ssh_control name."""
config = common_utils.read_yaml(cluster_yaml)
auth_section = config['auth']
Expand All @@ -1357,6 +1357,10 @@ def ssh_credential_from_yaml(cluster_yaml: str,
}
if docker_user is not None:
credentials['docker_user'] = docker_user
ssh_provider_module = config['provider']['module']
# If we are running ssh command on kubernetes node.
if 'kubernetes' in ssh_provider_module:
credentials['proxy_to_k8s'] = True
return credentials


Expand Down
2 changes: 0 additions & 2 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -2824,7 +2824,6 @@ def _sync_workdir(self, handle: CloudVmRayResourceHandle,
f'down rsync.{style.RESET_ALL}')

log_path = os.path.join(self.log_dir, 'workdir_sync.log')

ssh_credentials = backend_utils.ssh_credential_from_yaml(
handle.cluster_yaml, handle.docker_user)

Expand Down Expand Up @@ -4068,7 +4067,6 @@ def _set_tpu_name(self, handle: CloudVmRayResourceHandle,
assert ip_list is not None, 'external_ips is not cached in handle'
ssh_credentials = backend_utils.ssh_credential_from_yaml(
handle.cluster_yaml, handle.docker_user)

runners = command_runner.SSHCommandRunner.make_runner_list(
ip_list, port_list=None, **ssh_credentials)

Expand Down
25 changes: 25 additions & 0 deletions sky/templates/kubernetes-port-forward-proxy-command.yaml.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/usr/bin/env bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials" to make sure everything, including file_mounts, work correctly? I have manually verified, but going forward we want to run Kubernetes smoke tests for k8s PRs :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj Currently, passing all the tests besides the ones requiring GPUs for this branch.

set -uo pipefail

# Specifies which config file kubectl should use.
export KUBECONFIG="{{ kube_config_path }}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? We ideally want to operate with whatever KUBECONFIG var is set by the user. I know some parts of our code won't support custom KUBECONFIG, but we should avoid adding new code that puts strict dependency on a kubeconfig path set by us.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Got them removed.


# If the port-forward command is already running on the jump pod, kill it
pgrep -f "kubectl port-forward svc/{{ ssh_jump_name }} {{ local_port }}:22" > /dev/null
if [ $? -eq 0 ]; then
pkill -f "kubectl port-forward svc/{{ ssh_jump_name }} {{ local_port }}:22"
fi

# Establishes connection between local port and the ssh jump pod
kubectl port-forward svc/{{ ssh_jump_name }} {{ local_port }}:22 &
K8S_PID=$!
trap "kill -9 $K8S_PID" EXIT

# checks if a conection to local_port of ipaddress is established
while ! nc -z {{ ipaddress }} {{ local_port }}; do
sleep 0.1
done

# Estalbishes two directional byte streams to handle stdin/stdout between
# terminal and the jump pod
socat - tcp:{{ ipaddress }}:{{ local_port }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying this our and it silently failed to ssh for a long time before I realized I don't have socat installed.

Is it possible to check if socat is installed at the start of the script, raise an error if its not and propagate this error cleanly up to the user? Otherwise, we may want to add a check if socat is installed elsewhere in our code...

Copy link
Collaborator Author

@landscapepainter landscapepainter Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj I added a check for socat installation and displays error msg and exit at the beginning of the script if it's not installed so that it shows the msg when users attempts to ssh <k8s-instance-name> without socat installed.

But it doesn't seem like there's a clean way to handle this exit and raise an error msg for every possible ssh session runs within skypilot. So I added another check for socat installation in authentication.py/setup_kubernetes_authentication.py when 'port-forward' mode is being setup.

Running sky launch:

$ sky launch -y
I 08-21 00:29:50 optimizer.py:652] == Optimizer ==
I 08-21 00:29:50 optimizer.py:663] Target: minimizing cost
I 08-21 00:29:50 optimizer.py:675] Estimated cost: $0.0 / hour
I 08-21 00:29:50 optimizer.py:675] 
I 08-21 00:29:50 optimizer.py:748] Considered resources (1 node):
I 08-21 00:29:50 optimizer.py:797] ---------------------------------------------------------------------------------------------------
I 08-21 00:29:50 optimizer.py:797]  CLOUD        INSTANCE        vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN   
I 08-21 00:29:50 optimizer.py:797] ---------------------------------------------------------------------------------------------------
I 08-21 00:29:50 optimizer.py:797]  Kubernetes   2CPU--2GB       2       2         -              kubernetes      0.00          ✔     
I 08-21 00:29:50 optimizer.py:797]  AWS          m6i.2xlarge     8       32        -              us-east-1       0.38                
I 08-21 00:29:50 optimizer.py:797]  GCP          n2-standard-8   8       32        -              us-central1-a   0.39                
I 08-21 00:29:50 optimizer.py:797] ---------------------------------------------------------------------------------------------------
I 08-21 00:29:50 optimizer.py:797] 
Running task on cluster sky-6c5a-gcpuser...
I 08-21 00:29:50 cloud_vm_ray_backend.py:4052] Creating a new cluster: "sky-6c5a-gcpuser" [1x Kubernetes(2CPU--2GB)].
I 08-21 00:29:50 cloud_vm_ray_backend.py:4052] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 08-21 00:29:51 cloud_vm_ray_backend.py:1418] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2023-08-21-00-29-46-434429/provision.log
Clusters
NAME              LAUNCHED     RESOURCES                 STATUS  AUTOSTOP  COMMAND                       
sky-2208-gcpuser  2 hrs ago    1x Kubernetes(2CPU--2GB)  UP      -         sky exec sky-2208-gcpuser...  

RuntimeError: `socat` is required to setup Kubernetes cloud with `port-forward` default networking mode and it is not installed. For Debian/Ubuntu system, install it with:
  $ sudo apt install socat

Running ssh <k8s-instance-name>:

$ ssh sky-2208-gcpuser
Using 'port-forward' mode to ssh into Kubernetes instances requires 'socat' to be installed. Please install 'socat'
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host

Running sky exec:

$ sky exec sky-2208-gcpuser printenv
Task from command: printenv
Executing task on cluster sky-2208-gcpuser...
E 08-21 00:32:07 subprocess_utils.py:73] Using 'port-forward' mode to ssh into Kubernetes instances requires 'socat' to be installed. Please install 'socat'
E 08-21 00:32:07 subprocess_utils.py:73] ssh_exchange_identification: Connection closed by remote host
E 08-21 00:32:07 subprocess_utils.py:73] ssh_exchange_identification: Connection closed by remote host
E 08-21 00:32:07 subprocess_utils.py:73] 
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] 
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] Cluster name: sky-2208-gcpuser
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To log into the head VM:	ssh sky-2208-gcpuser
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To submit a job:		sky exec sky-2208-gcpuser yaml_file
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To stop the cluster:	sky stop sky-2208-gcpuser
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To teardown the cluster:	sky down sky-2208-gcpuser
eClusters
NAME              LAUNCHED     RESOURCES                 STATUS  AUTOSTOP  COMMAND                       
sky-2208-gcpuser  2 hrs ago    1x Kubernetes(2CPU--2GB)  UP      -         sky exec sky-2208-gcpuser...  
ee
eesky.exceptions.CommandError: Command python3 -u -c 'import os;from sky.skylet import job_lib, log_lib;job_id = job_lib.add_job('"'"'sky-cmd'"'"', '"'"'gcpuser'"'"', '"'"'sky-2023-08-21-00-32-07-240600'"'"', '"'"'1x [CPU:0.5]'"'"');print("Job ID: " + str(job_id), flush=True)' failed with return code 255.
Failed to fetch job id.

3 changes: 1 addition & 2 deletions sky/templates/kubernetes-sshjump.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -50,14 +50,13 @@ service_spec:
name: {{ name }}
app: skypilot
spec:
type: NodePort
type: {{ service_type }}
selector:
component: {{ name }}
ports:
- protocol: TCP
port: 22
targetPort: 22

# The following ServiceAccount/Role/RoleBinding sets up an RBAC for life cycle
# management of the jump pod/service
service_account:
Expand Down
46 changes: 29 additions & 17 deletions sky/utils/command_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,16 @@ def _ssh_control_path(ssh_control_filename: Optional[str]) -> Optional[str]:
return path


def ssh_options_list(ssh_private_key: Optional[str],
ssh_control_name: Optional[str],
*,
ssh_proxy_command: Optional[str] = None,
docker_ssh_proxy_command: Optional[str] = None,
timeout: int = 30,
port: int = 22) -> List[str]:
def ssh_options_list(
ssh_private_key: Optional[str],
ssh_control_name: Optional[str],
proxy_to_k8s: Optional[bool] = False,
*,
ssh_proxy_command: Optional[str] = None,
docker_ssh_proxy_command: Optional[str] = None,
timeout: int = 30,
port: int = 22,
) -> List[str]:
"""Returns a list of sane options for 'ssh'."""
# Forked from Ray SSHOptions:
# https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/command_runner.py
Expand Down Expand Up @@ -79,7 +82,12 @@ def ssh_options_list(ssh_private_key: Optional[str],
}
# SSH Control will have a severe delay when using docker_ssh_proxy_command.
# TODO(tian): Investigate why.
if ssh_control_name is not None and docker_ssh_proxy_command is None:
# k8s instances are accessed with an ssh session using Proxycommand. The
# process running Proxycommand is kept running as long as the ssh session
# is running and the ControlMaster keeps the session, which results in
# 'ControlPersist' number of seconds delay per ssh commands ran.
if ssh_control_name is not None and docker_ssh_proxy_command is None \
and not proxy_to_k8s:
arg_dict.update({
# Control path: important optimization as we do multiple ssh in one
# sky.launch().
Expand Down Expand Up @@ -136,6 +144,7 @@ def __init__(
ssh_proxy_command: Optional[str] = None,
port: int = 22,
docker_user: Optional[str] = None,
proxy_to_k8s: Optional[bool] = False,
):
"""Initialize SSHCommandRunner.

Expand All @@ -159,12 +168,15 @@ def __init__(
docker_user: The docker user to use for ssh. If specified, the
command will be run inside a docker container which have a ssh
server running at port sky.skylet.constants.DEFAULT_DOCKER_PORT.
proxy_to_k8s: bool; specifies either or not the ssh command will be
ran on k8s instance through a jump pod with proxy command.
"""
self.ssh_private_key = ssh_private_key
self.ssh_control_name = (
None if ssh_control_name is None else hashlib.md5(
ssh_control_name.encode()).hexdigest()[:_HASH_MAX_LENGTH])
self._ssh_proxy_command = ssh_proxy_command
self.proxy_to_k8s = proxy_to_k8s
if docker_user is not None:
assert port is None or port == 22, (
f'port must be None or 22 for docker_user, got {port}.')
Expand All @@ -190,6 +202,7 @@ def make_runner_list(
ssh_private_key: str,
ssh_control_name: Optional[str] = None,
ssh_proxy_command: Optional[str] = None,
proxy_to_k8s: Optional[bool] = False,
port_list: Optional[List[int]] = None,
docker_user: Optional[str] = None,
) -> List['SSHCommandRunner']:
Expand All @@ -198,7 +211,7 @@ def make_runner_list(
port_list = [22] * len(ip_list)
return [
SSHCommandRunner(ip, ssh_user, ssh_private_key, ssh_control_name,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On GKE clusters, This PR is currently using the external IP of nodes (sky@34.16.78.81):

2023-08-29 17:53:01,300	VVINFO command_runner.py:367 -- Full command is `�[1mssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8d2ba49c09/098f6bcd46/%C -o ControlPersist=10s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -p 23100 -W %h:%p sky@34.16.78.81 -o ProxyCommand='/Users/romilb/.sky/port-forward-proxy-cmd.sh'  -o Port=22 -o ConnectTimeout=120s sky@10.72.0.17 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p ~/.sky/.runtime_files)'�[22m�[26m`
building file list ... done

This will not work if the nodes are behind a firewall. It should instead of be connecting to sky@127.0.0.1 since the the port-forward is running locally.

You may need to update get_external_ip to accept an arg with the KubernetesNetworkingMode, which would return 127.0.0.1 if the mode is port-forward, else use the existing logic if the mode is nodeport.

ssh_proxy_command, port, docker_user)
ssh_proxy_command, port, docker_user, proxy_to_k8s)
for ip, port in zip(ip_list, port_list)
]

Expand Down Expand Up @@ -228,7 +241,7 @@ def _ssh_base_command(self, *, ssh_mode: SshMode,
ssh_proxy_command=self._ssh_proxy_command,
docker_ssh_proxy_command=docker_ssh_proxy_command,
port=self.port,
) + [f'{self.ssh_user}@{self.ip}']
proxy_to_k8s=self.proxy_to_k8s) + [f'{self.ssh_user}@{self.ip}']

def run(
self,
Expand Down Expand Up @@ -382,13 +395,12 @@ def rsync(
else:
docker_ssh_proxy_command = None
ssh_options = ' '.join(
ssh_options_list(
self.ssh_private_key,
self.ssh_control_name,
ssh_proxy_command=self._ssh_proxy_command,
docker_ssh_proxy_command=docker_ssh_proxy_command,
port=self.port,
))
ssh_options_list(self.ssh_private_key,
self.ssh_control_name,
ssh_proxy_command=self._ssh_proxy_command,
docker_ssh_proxy_command=docker_ssh_proxy_command,
port=self.port,
proxy_to_k8s=self.proxy_to_k8s))
rsync_command.append(f'-e "ssh {ssh_options}"')
# To support spaces in the path, we need to quote source and target.
# rsync doesn't support '~' in a quoted local path, but it is ok to
Expand Down
30 changes: 19 additions & 11 deletions sky/utils/command_runner.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,13 @@ RSYNC_FILTER_OPTION: str
RSYNC_EXCLUDE_OPTION: str


def ssh_options_list(ssh_private_key: Optional[str],
ssh_control_name: Optional[str],
*,
timeout: int = ...) -> List[str]:
def ssh_options_list(
ssh_private_key: Optional[str],
ssh_control_name: Optional[str],
proxy_to_k8s: Optional[bool] = False,
*,
timeout: int = ...,
) -> List[str]:
...


Expand All @@ -40,14 +43,18 @@ class SSHCommandRunner:
ssh_control_name: Optional[str]
docker_user: str
port: int
proxy_to_k8s: Optional[bool]

def __init__(self,
ip: str,
ssh_user: str,
ssh_private_key: str,
ssh_control_name: Optional[str] = ...,
port: str = ...,
docker_user: Optional[str] = ...) -> None:
def __init__(
self,
ip: str,
ssh_user: str,
ssh_private_key: str,
ssh_control_name: Optional[str] = ...,
port: str = ...,
docker_user: Optional[str] = ...,
proxy_to_k8s: Optional[bool] = ...,
) -> None:
...

@staticmethod
Expand All @@ -59,6 +66,7 @@ class SSHCommandRunner:
ssh_proxy_command: Optional[str] = ...,
port_list: Optional[List[int]] = ...,
docker_user: Optional[str] = ...,
proxy_to_k8s: Optional[bool] = ...,
) -> List['SSHCommandRunner']:
...

Expand Down
Loading