-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s_cloud_beta1] Adding support for ssh using kubectl port-forward to access k8s instance #2412
[k8s_cloud_beta1] Adding support for ssh using kubectl port-forward to access k8s instance #2412
Conversation
- add comments - rename variables - typo
Thanks @landscapepainter! I merged both the latest master and |
@romilbhardwaj merged the updated One thing that is not resolved is on how to have the users install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter! Works nicely. Left some comments.
I'm planning to refactor the SSH jump pod creation to node_provider in k8s_cloud_beta1 branch, so we may need to update this branch after that's done.
sky/authentication.py
Outdated
ssh_setup_mode: str): | ||
""" returns Proxycommand to use when establishing ssh connection | ||
to the k8s instance through the jump pod. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this method, I would err on the side on over-documenting. It would be good to add details here on:
- why we use a proxycommand
- what does the proxycommand do behind the scenes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romilbhardwaj I wrote the doc that resolves all two bullet points. Please take a look!
|
||
# Estalbishes two directional byte streams to handle stdin/stdout between | ||
# terminal and the jump pod | ||
socat - tcp:{{ ipaddress }}:{{ local_port }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying this our and it silently failed to ssh for a long time before I realized I don't have socat installed.
Is it possible to check if socat
is installed at the start of the script, raise an error if its not and propagate this error cleanly up to the user? Otherwise, we may want to add a check if socat
is installed elsewhere in our code...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romilbhardwaj I added a check for socat
installation and displays error msg and exit at the beginning of the script if it's not installed so that it shows the msg when users attempts to ssh <k8s-instance-name>
without socat
installed.
But it doesn't seem like there's a clean way to handle this exit and raise an error msg for every possible ssh session runs within skypilot. So I added another check for socat
installation in authentication.py/setup_kubernetes_authentication.py
when 'port-forward' mode is being setup.
Running sky launch
:
$ sky launch -y
I 08-21 00:29:50 optimizer.py:652] == Optimizer ==
I 08-21 00:29:50 optimizer.py:663] Target: minimizing cost
I 08-21 00:29:50 optimizer.py:675] Estimated cost: $0.0 / hour
I 08-21 00:29:50 optimizer.py:675]
I 08-21 00:29:50 optimizer.py:748] Considered resources (1 node):
I 08-21 00:29:50 optimizer.py:797] ---------------------------------------------------------------------------------------------------
I 08-21 00:29:50 optimizer.py:797] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 08-21 00:29:50 optimizer.py:797] ---------------------------------------------------------------------------------------------------
I 08-21 00:29:50 optimizer.py:797] Kubernetes 2CPU--2GB 2 2 - kubernetes 0.00 ✔
I 08-21 00:29:50 optimizer.py:797] AWS m6i.2xlarge 8 32 - us-east-1 0.38
I 08-21 00:29:50 optimizer.py:797] GCP n2-standard-8 8 32 - us-central1-a 0.39
I 08-21 00:29:50 optimizer.py:797] ---------------------------------------------------------------------------------------------------
I 08-21 00:29:50 optimizer.py:797]
Running task on cluster sky-6c5a-gcpuser...
I 08-21 00:29:50 cloud_vm_ray_backend.py:4052] Creating a new cluster: "sky-6c5a-gcpuser" [1x Kubernetes(2CPU--2GB)].
I 08-21 00:29:50 cloud_vm_ray_backend.py:4052] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 08-21 00:29:51 cloud_vm_ray_backend.py:1418] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2023-08-21-00-29-46-434429/provision.log
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
sky-2208-gcpuser 2 hrs ago 1x Kubernetes(2CPU--2GB) UP - sky exec sky-2208-gcpuser...
RuntimeError: `socat` is required to setup Kubernetes cloud with `port-forward` default networking mode and it is not installed. For Debian/Ubuntu system, install it with:
$ sudo apt install socat
Running ssh <k8s-instance-name>
:
$ ssh sky-2208-gcpuser
Using 'port-forward' mode to ssh into Kubernetes instances requires 'socat' to be installed. Please install 'socat'
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host
Running sky exec
:
$ sky exec sky-2208-gcpuser printenv
Task from command: printenv
Executing task on cluster sky-2208-gcpuser...
E 08-21 00:32:07 subprocess_utils.py:73] Using 'port-forward' mode to ssh into Kubernetes instances requires 'socat' to be installed. Please install 'socat'
E 08-21 00:32:07 subprocess_utils.py:73] ssh_exchange_identification: Connection closed by remote host
E 08-21 00:32:07 subprocess_utils.py:73] ssh_exchange_identification: Connection closed by remote host
E 08-21 00:32:07 subprocess_utils.py:73]
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242]
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] Cluster name: sky-2208-gcpuser
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To log into the head VM: ssh sky-2208-gcpuser
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To submit a job: sky exec sky-2208-gcpuser yaml_file
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To stop the cluster: sky stop sky-2208-gcpuser
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To teardown the cluster: sky down sky-2208-gcpuser
eClusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
sky-2208-gcpuser 2 hrs ago 1x Kubernetes(2CPU--2GB) UP - sky exec sky-2208-gcpuser...
ee
eesky.exceptions.CommandError: Command python3 -u -c 'import os;from sky.skylet import job_lib, log_lib;job_id = job_lib.add_job('"'"'sky-cmd'"'"', '"'"'gcpuser'"'"', '"'"'sky-2023-08-21-00-32-07-240600'"'"', '"'"'1x [CPU:0.5]'"'"');print("Job ID: " + str(job_id), flush=True)' failed with return code 255.
Failed to fetch job id.
@@ -0,0 +1,25 @@ | |||
#!/usr/bin/env bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials"
to make sure everything, including file_mounts, work correctly? I have manually verified, but going forward we want to run Kubernetes smoke tests for k8s PRs :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romilbhardwaj Currently, passing all the tests besides the ones requiring GPUs for this branch.
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
…painter/skypilot into ssh-port-forward-beta1
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@landscapepainter - I have done some refactoring in k8s_cloud_beta1
. Can you merge those changes?
@romilbhardwaj This is ready for another look! |
Thanks @landscapepainter. After #2328 has been merged, I updated |
@romilbhardwaj Merged with updated Also, I added a functionality to switch the ssh jump pod's service if the already existing service does not match with user's networking configuration in Solely recreating the service for ssh jump pod allows users to switch between different networking mode and access any k8s instance, but this is currently only allowed by running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter. Left some comments, still reading code and yet to try.
sky/authentication.py
Outdated
@@ -404,16 +408,49 @@ def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]: | |||
logger.error(suffix) | |||
raise | |||
|
|||
ssh_jump_name = clouds.Kubernetes.SKY_SSH_JUMP_NAME | |||
if ssh_setup_mode == 'nodeport': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the enum KubernetesNetworkingMode
everywhere to avoid hardcoding strings 'nodeport' and 'portforward'? We can also ask the user to use the same string in the config file and then read directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romilbhardwaj I'm wondering if we should create an enum
for service types as well since there are currently some places(setup_kubernetes_authentication
, setup_sshjump_svc
) using NodePort
and ClusterIP
as a hardcoded string. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it'll be good to have KubernetesServiceType
enum
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
…painter/skypilot into ssh-port-forward-beta1 merge
Thanks @romilbhardwaj! This is ready for another look. |
# Establishes two directional byte streams to handle stdin/stdout between | ||
# terminal and the jump pod. | ||
# socat process terminates when port-forward terminates. | ||
socat - tcp:127.0.0.1:{{ local_port }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On running any command (sky launch or ssh), stdout prints:
Connection to 127.0.0.1 port 23100 [tcp/*] succeeded!
Is there some way to suppress this message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem like there is a clear way to go about this especially since we are utilizing socat's stdout/stderr to interact with the node. Wasn't able to find a clear way from this man page as well. Also, it looks like the message is OS specific? I'm using Debian GNU/Linux 10 (buster) and am not seeing the message:
$ ssh sky-1aad-gcpuser
Warning: Permanently added '[127.0.0.1]:23100' (ECDSA) to the list of known hosts.
Warning: Permanently added 'xx.xxx.x.x' (ECDSA) to the list of known hosts.
Linux sky-1aad-gcpuser-ray-head 4.19.0-22-cloud-amd64 #1 SMP Debian 4.19.260-1 (2022-09-29) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue Aug 29 04:19:04 2023 from xx.xxx.x.x
pgrep -f "kubectl port-forward svc/{{ ssh_jump_name }} {{ local_port }}:22" > /dev/null | ||
if [ $? -eq 0 ]; then | ||
pkill -f "kubectl port-forward svc/{{ ssh_jump_name }} {{ local_port }}:22" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into a bug where creating multiple ssh connections in parallel (e.g., multiple ssh windows or parallel sky launches) would allow only one connection to live at a time. I.e., the previous connection gets abruptly terminated.
To replicate:
sky launch -c test
- Open a terminal and run
ssh test
- While the previous terminal is running, open a new terminal and run
ssh test
- The ssh connection created in step 2 gets broken.
(base) sky@test-ray-head:~$ /Users/romilb/.sky/port-forward-proxy-cmd.sh: line 30: 32784 Terminated: 15 kubectl port-forward svc/sky-sshjump-2ea485ef 23100:22
/Users/romilb/.sky/port-forward-proxy-cmd.sh: line 31: kill: (32784) - No such process
client_loop: send disconnect: Broken pipe
client_loop: send disconnect: Broken pipe
We may need to
a) find a better way to handle open kubectl port-forward
instead of killing them. Perhaps check if it is in use or not
b) If the port is in use, use another local_port to create the new connection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@landscapepainter - here's a script that works and handles multiple SSH connections:
#!/usr/bin/env bash
set -uo pipefail
# Checks if socat is installed
if ! command -v socat > /dev/null; then
echo "Using 'port-forward' mode to run ssh session on Kubernetes instances requires 'socat' to be installed. Please install 'socat'" >&2
exit
fi
# Checks if lsof is installed
if ! command -v lsof > /dev/null; then
echo "Checking port availability requires 'lsof' to be installed. Please install 'lsof'" >&2
exit 1
fi
# Function to check if port is in use
is_port_in_use() {
local port="$1"
lsof -i :${port} > /dev/null 2>&1
}
# Start from a fixed local port and increment if in use
local_port={{ local_port }}
while is_port_in_use "${local_port}"; do
local_port=$((local_port + 1))
done
# Establishes connection between local port and the ssh jump pod
kubectl port-forward svc/{{ ssh_jump_name }} "${local_port}":22 &
# Terminate the port-forward process when this script exits.
K8S_PORT_FWD_PID=$!
trap "kill $K8S_PORT_FWD_PID" EXIT
# checks if a connection to local_port of 127.0.0.1:[local_port] is established
while ! nc -z 127.0.0.1 "${local_port}"; do
sleep 0.1
done
# Establishes two directional byte streams to handle stdin/stdout between
# terminal and the jump pod.
# socat process terminates when port-forward terminates.
socat - tcp:127.0.0.1:"${local_port}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romilbhardwaj Updated the script to the version above and added a check for lsof
installation in authentication.py
. Reason for why we need to check for installations in both authentication.py
and port-fordward-proxy-cmd.sh
: #2412 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romilbhardwaj Do you think it's necessary to set an upper bound on possible local_port
usage? It can limit the number of concurrent ssh sessions, but was wondering if we should set a range on available local_port
for this purpose.
@romilbhardwaj Ready for another look! Updated the proxy script with what you left in the comment and created another |
@@ -198,7 +213,8 @@ def make_runner_list( | |||
port_list = [22] * len(ip_list) | |||
return [ | |||
SSHCommandRunner(ip, ssh_user, ssh_private_key, ssh_control_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On GKE clusters, This PR is currently using the external IP of nodes (sky@34.16.78.81
):
2023-08-29 17:53:01,300 VVINFO command_runner.py:367 -- Full command is `�[1mssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8d2ba49c09/098f6bcd46/%C -o ControlPersist=10s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -p 23100 -W %h:%p sky@34.16.78.81 -o ProxyCommand='/Users/romilb/.sky/port-forward-proxy-cmd.sh' -o Port=22 -o ConnectTimeout=120s sky@10.72.0.17 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p ~/.sky/.runtime_files)'�[22m�[26m`
building file list ... done
This will not work if the nodes are behind a firewall. It should instead of be connecting to sky@127.0.0.1
since the the port-forward is running locally.
You may need to update get_external_ip to accept an arg with the KubernetesNetworkingMode
, which would return 127.0.0.1 if the mode is port-forward, else use the existing logic if the mode is nodeport.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romilbhardwaj This is ready for another look.
- merged @hemildesai's branch on allowing k8s specific env. vars accessible within ssh session and updated the implementation
- updated
get_external_ip
- removed
query_env_vars
and_update_envs_for_k8s
Confirmed successful provision on k8s instance with GPU
# shell sessions. | ||
set_k8s_env_var_cmd = [ | ||
'/bin/sh', '-c', | ||
'printenv | awk \'{print "export " $0}\' > ~/k8s_env_var.sh && sudo mv ~/k8s_env_var.sh /etc/profile.d/k8s_env_var.sh' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: After the first run, bash: 419: No such file or directory
gets printed every time I SSH or exec a task.
Repro:
sky launch -c test --gpus T4:1 -- nvidia-smi
sky exec test --gpus T4:1 -- nvidia-smi
(This prints the above error). Logs.ssh test
this also prints the above error.
Cause:
/etc/profile.d/k8s_env_var.sh
contains a line like:
export NVIDIA_REQUIRE_CUDA=cuda>=11.6 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471
The shell tries to interpret special characters > and <, causing errors.
Solution - wrap the exported value in single quotes:
export NVIDIA_REQUIRE_CUDA='cuda>=11.6 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471'
Catch - if the envvar var value itself contains single quotes, we'll need to be careful in handling single quotes ('
) appearing in the string. See some suggestions here.
# shell sessions. | ||
set_k8s_env_var_cmd = [ | ||
'/bin/sh', '-c', | ||
'printenv | awk \'{print "export " $0}\' > ~/k8s_env_var.sh && sudo mv ~/k8s_env_var.sh /etc/profile.d/k8s_env_var.sh' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sudo
may not be present on all images. Can we try mv without sudo first, and use sudo if it fails?
Thanks @landscapepainter - can you also run |
* Working Ray K8s node provider based on SSH * wip * working provisioning with SkyPilot and ssh config * working provisioning with SkyPilot and ssh config * Updates to master * ray2.3 * Clean up docs * multiarch build * hacking around ray start * more port fixes * fix up default instance selection * fix resource selection * Add provisioning timeout by checking if pods are ready * Working mounting * Remove catalog * fixes * fixes * Fix ssh-key auth to create unique secrets * Fix for ContainerCreating timeout * Fix head node ssh port caching * mypy * lint * fix ports * typo * cleanup * cleanup * wip * Update setup * readme updates * lint * Fix failover * Fix failover * optimize setup * Fix sync down logs for k8s * test wip * instance name parsing wip * Fix instance name parsing * Merge fixes for query_status * [k8s_cloud] Delete k8s service resources. (#2105) Delete k8s service resources. - 'sky down' for Kubernetes cloud to remove cluster service resources. * Status refresh WIP * refactor to kubernetes adaptor * tests wip * clean up auth * wip tests * cli * cli * sky local up/down cli * cli * lint * lint * lint * Speed up kind cluster creation * tests * lint * tests * handling for non-reachable clusters * Invalid kubeconfig handling * Timeout for sky check * code cleanup * lint * Do not raise error if GPUs requested, return empty list * Address comments * comments * lint * Remove public key upload * GPU support init * wip * add shebang * comments * change permissions * remove chmod * merge 2241 * add todo * Handle kube config management for sky local commands (#2253) * Set current-context (if availablee) after sky local down and remove incorrect prompt in sky local up * Warn user of kubeconfig context switch during sky local up * Use Optional instead of Union * Switch context in create_cluster if cluster already exists. * fix typo * update sky check error msg after sky local down * lint * update timeout check * fix import error * Fix kube API access from within cluster (load_incluster_auth) * lint * lint * working autodown and sky status -r * lint * add test_kubernetes_autodown * lint * address comments * address comments * lint * deletion timeouts wip * [k8s_cloud] Ray pod not created under current context namespace. (#2302) 'namespace' exists under 'context' key. * head ssh port namespace fix * [k8s-cloud] Typo in sky local --help. (#2308) Typo. * [k8s-cloud] Set build_image.sh to be executable. (#2307) * Set build_image.sh to be executable. * Use TAG to easily switch between registries. * remove ingress * remove debug statements * UX and readme updates * lint * fix logging for 409 retry * lint * lint * Debug dockerfile * wip * Fix GPU image * Query cloud specific env vars in task setup (#2347) * Query cloud specific env vars in task setup * Make query_env_vars specific to Kubernetes cloud * Address PR comments * working GPU type selection for GKE and EKS. GFD needs work. * TODO for auto-detection * Add image toggling for CPU/GPU * Add image toggling for CPU/GPU * Fix none acce_type * remove memory from j2 * Make resnet examples run again * lint * v100 readme * dockerfile and smoketest * fractional cpu and mem * nits * refactor utils * lint and cleanup * lint and cleanup * lint and cleanup * lint and cleanup * lint and cleanup * lint and cleanup * lint * lint * manual lint * manual isort * test readme update * Remove EKS * lint * add gpu labeler * updates * lint * update script * ux * fix formatter * test update * test update * fix test_optimizer_dryruns * docs * cleanup * test readme update * lint * lint * [k8s_cloud_beta1] Add sshjump host support. (#2369) * Update build image * fix image path * fix merge * cleanup * lint * fix utils ref * typo * refactor pod creation * lint * merge fixes * portfix * merge fixes * [k8s_cloud_beta1] Sky down for a cluster deployed in Kubernetes to possibly remove sshjump pod. (#2425) * Sky down for a kubernetes cluster to possibly terminate sshjump pod. - If the related sshjump pod is being reported as its main container not have been started, then remove its pod and service. This is to minimize the chances for remaining with dangling sshjump pod. * Remove sshjump service in case of an failure to analyze sshjump. - remove _request_timeout as it might not be needed due to terminationGracePeriodSeconds being set in sshjump template. * Move sshjump analysis to kubernetes_utils. * Apply changes per ./format.sh. * Minor comment rephrase. * Use sshjump_name from ray pod label. - rather than from clouds.Kubernetes * cleanup * Add networking benchmarks * comment * comment * lint * autodown fixes * lint * fix label * [k8s_cloud_beta1] Adding support for ssh using kubectl port-forward to access k8s instance (#2412) * Add sshjump support. * Update lcm script. - add comments - rename variables - typo * Set imagePullPolicy to IfNotPresent. * add support for port-forward * remove unused * comments * Disable ControlMaster for ssh_options_list * nit * update to disable rest of the ControlMaster * command runner rsync update * relocating run_on_k8s * relocate run_on_k8s * Make Kubernetes specific env variables available when joining a cluster via SSH * merge k8s_cloud_beta1 * format * remove redundant utils.py * format and comments * update with proxy_to_k8s * Update sky/authentication.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * resolving comments on structures * Update sky/utils/command_runner.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * document on nodeport/port-forward proxycommand * error handling when socat is not installed * removing KUBECONFIG from port-forward shell script * nit * nit * Add suport for nodeport * Update sky/utils/kubernetes_utils.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * update * switch svc when conflicting jump pod svc exist * format * Update sky/utils/kubernetes_utils.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * refactoring check for socat * resolve comments * add ServiceType enum and port-forward proxy script * update k8s env var access * add check for container status remove unused func * nit * update get_external_ip for portforward mode * conditionally use sudo and quote values of env var --------- Co-authored-by: Avi Weit <weit@il.ibm.com> Co-authored-by: hemildesai <hemil.desai10@gmail.com> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * refactor * fix * updates * lint * Update sky/skylet/providers/kubernetes/node_provider.py * fix test * [k8s] Showing reasons for provisioning failure in K8s (#2422) * surface provision failure message * nit * nit * format * nit * CPU message fix * update Insufficient memory handling * nit * nit * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * format * update gpu failure message and condition * fix GPU handling cases * fix * comment * nit * add try except block with general error handling --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * cleanup * lint * fix for ssh jump image_id * comments * ssh jump refactor * lint * image build fixes --------- Co-authored-by: Avi Weit <weit@il.ibm.com> Co-authored-by: Hemil Desai <hemil.desai10@gmail.com> Co-authored-by: Doyoung Kim <34902420+landscapepainter@users.noreply.github.com>
This PR allows to set up an ssh session to access k8s instance through a jump pod using
kubectl port-forward
andsocat
as aProxycommand
. Default ssh method would be to usekubectl port-forward
andsocat
, and if the user wants to open up aNodePort
service to access the jump pod instead, the following can be written in~/.sky/config.yaml
:Note: Allowing to use
ControlMaster
that is set fromssh_options_list()
causesControlPersist
amount of seconds of stalling wheneverssh
command is ran while runningsky
commands in some linux distributions(ubuntu 22.04, debian 11 seems to be okay, but debian 10 has the stalling issue). This is resolved in this PR by disallowing the use ofControlMaster
for k8s instances.Tested (run the relevant ones):
bash format.sh
sky launch
,sky exec
,ssh <cluster-name>
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh