[k8s_cloud_beta1] Adding support for ssh using kubectl port-forward to access k8s instance #2412

landscapepainter · 2023-08-17T06:06:45Z

This PR allows to set up an ssh session to access k8s instance through a jump pod using kubectl port-forward and socat as a Proxycommand. Default ssh method would be to use kubectl port-forward and socat, and if the user wants to open up a NodePort service to access the jump pod instead, the following can be written in ~/.sky/config.yaml:

kubernetes:
  networking: nodeport

Note: Allowing to use ControlMaster that is set from ssh_options_list() causes ControlPersist amount of seconds of stalling whenever ssh command is ran while running sky commands in some linux distributions(ubuntu 22.04, debian 11 seems to be okay, but debian 10 has the stalling issue). This is resolved in this PR by disallowing the use of ControlMaster for k8s instances.

Tested (run the relevant ones):

Code formatting: bash format.sh
Manual tests running sky launch, sky exec, ssh <cluster-name>
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

- add comments - rename variables - typo

…er via SSH

romilbhardwaj · 2023-08-17T16:34:59Z

Thanks @landscapepainter! I merged both the latest master and k8s_cloud_gpu into k8s_cloud_beta1. Can you update this PR by merging the latest k8s_cloud_beta1 into your branch?

landscapepainter · 2023-08-18T02:20:31Z

@romilbhardwaj merged the updated k8s_cloud_beta1 and added more comments.

One thing that is not resolved is on how to have the users install socat. This is not an application that is installed by default and not supported by PYPI. I was wondering if we should inform the users to install it in the document.

sky/backends/cloud_vm_ray_backend.py

romilbhardwaj

Thanks @landscapepainter! Works nicely. Left some comments.

I'm planning to refactor the SSH jump pod creation to node_provider in k8s_cloud_beta1 branch, so we may need to update this branch after that's done.

sky/templates/kubernetes-port-forward-proxy-command.yaml.j2

sky/adaptors/kubernetes.py

romilbhardwaj · 2023-08-19T11:42:09Z

sky/authentication.py

+                                  ssh_setup_mode: str):
+    """ returns Proxycommand to use when establishing ssh connection
+    to the k8s instance through the jump pod.
+


For this method, I would err on the side on over-documenting. It would be good to add details here on:

why we use a proxycommand

what does the proxycommand do behind the scenes

@romilbhardwaj I wrote the doc that resolves all two bullet points. Please take a look!

sky/authentication.py

romilbhardwaj · 2023-08-19T12:17:12Z

sky/templates/kubernetes-port-forward-proxy-command.yaml.j2

+
+# Estalbishes two directional byte streams to handle stdin/stdout between
+# terminal and the jump pod
+socat - tcp:{{ ipaddress }}:{{ local_port }}


I was trying this our and it silently failed to ssh for a long time before I realized I don't have socat installed.

Is it possible to check if socat is installed at the start of the script, raise an error if its not and propagate this error cleanly up to the user? Otherwise, we may want to add a check if socat is installed elsewhere in our code...

@romilbhardwaj I added a check for socat installation and displays error msg and exit at the beginning of the script if it's not installed so that it shows the msg when users attempts to ssh <k8s-instance-name> without socat installed.

But it doesn't seem like there's a clean way to handle this exit and raise an error msg for every possible ssh session runs within skypilot. So I added another check for socat installation in authentication.py/setup_kubernetes_authentication.py when 'port-forward' mode is being setup.

Running sky launch:

$ sky launch -y I 08-21 00:29:50 optimizer.py:652] == Optimizer == I 08-21 00:29:50 optimizer.py:663] Target: minimizing cost I 08-21 00:29:50 optimizer.py:675] Estimated cost: $0.0 / hour I 08-21 00:29:50 optimizer.py:675] I 08-21 00:29:50 optimizer.py:748] Considered resources (1 node): I 08-21 00:29:50 optimizer.py:797] --------------------------------------------------------------------------------------------------- I 08-21 00:29:50 optimizer.py:797] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN I 08-21 00:29:50 optimizer.py:797] --------------------------------------------------------------------------------------------------- I 08-21 00:29:50 optimizer.py:797] Kubernetes 2CPU--2GB 2 2 - kubernetes 0.00 ✔ I 08-21 00:29:50 optimizer.py:797] AWS m6i.2xlarge 8 32 - us-east-1 0.38 I 08-21 00:29:50 optimizer.py:797] GCP n2-standard-8 8 32 - us-central1-a 0.39 I 08-21 00:29:50 optimizer.py:797] --------------------------------------------------------------------------------------------------- I 08-21 00:29:50 optimizer.py:797] Running task on cluster sky-6c5a-gcpuser... I 08-21 00:29:50 cloud_vm_ray_backend.py:4052] Creating a new cluster: "sky-6c5a-gcpuser" [1x Kubernetes(2CPU--2GB)]. I 08-21 00:29:50 cloud_vm_ray_backend.py:4052] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters. I 08-21 00:29:51 cloud_vm_ray_backend.py:1418] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2023-08-21-00-29-46-434429/provision.log Clusters NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND sky-2208-gcpuser 2 hrs ago 1x Kubernetes(2CPU--2GB) UP - sky exec sky-2208-gcpuser... RuntimeError: `socat` is required to setup Kubernetes cloud with `port-forward` default networking mode and it is not installed. For Debian/Ubuntu system, install it with: $ sudo apt install socat

Running ssh <k8s-instance-name>:

$ ssh sky-2208-gcpuser Using 'port-forward' mode to ssh into Kubernetes instances requires 'socat' to be installed. Please install 'socat' ssh_exchange_identification: Connection closed by remote host ssh_exchange_identification: Connection closed by remote host

Running sky exec:

$ sky exec sky-2208-gcpuser printenv Task from command: printenv Executing task on cluster sky-2208-gcpuser... E 08-21 00:32:07 subprocess_utils.py:73] Using 'port-forward' mode to ssh into Kubernetes instances requires 'socat' to be installed. Please install 'socat' E 08-21 00:32:07 subprocess_utils.py:73] ssh_exchange_identification: Connection closed by remote host E 08-21 00:32:07 subprocess_utils.py:73] ssh_exchange_identification: Connection closed by remote host E 08-21 00:32:07 subprocess_utils.py:73] I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] Cluster name: sky-2208-gcpuser I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To log into the head VM: ssh sky-2208-gcpuser I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To submit a job: sky exec sky-2208-gcpuser yaml_file I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To stop the cluster: sky stop sky-2208-gcpuser I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To teardown the cluster: sky down sky-2208-gcpuser eClusters NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND sky-2208-gcpuser 2 hrs ago 1x Kubernetes(2CPU--2GB) UP - sky exec sky-2208-gcpuser... ee eesky.exceptions.CommandError: Command python3 -u -c 'import os;from sky.skylet import job_lib, log_lib;job_id = job_lib.add_job('"'"'sky-cmd'"'"', '"'"'gcpuser'"'"', '"'"'sky-2023-08-21-00-32-07-240600'"'"', '"'"'1x [CPU:0.5]'"'"');print("Job ID: " + str(job_id), flush=True)' failed with return code 255. Failed to fetch job id.

sky/utils/command_runner.py

romilbhardwaj · 2023-08-19T12:43:48Z

sky/templates/kubernetes-port-forward-proxy-command.yaml.j2

@@ -0,0 +1,25 @@
+#!/usr/bin/env bash


Can you run pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials" to make sure everything, including file_mounts, work correctly? I have manually verified, but going forward we want to run Kubernetes smoke tests for k8s PRs :)

@romilbhardwaj Currently, passing all the tests besides the ones requiring GPUs for this branch.

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

…painter/skypilot into ssh-port-forward-beta1

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

romilbhardwaj

@landscapepainter - I have done some refactoring in k8s_cloud_beta1. Can you merge those changes?

sky/authentication.py

landscapepainter · 2023-08-24T08:07:03Z

@romilbhardwaj This is ready for another look!

romilbhardwaj · 2023-08-25T03:53:32Z

Thanks @landscapepainter. After #2328 has been merged, I updated k8s_cloud_beta1 and merged it with the latest master. Before I try this out, can you please update this branch to be up to date with k8s_cloud_beta1?

landscapepainter · 2023-08-25T05:23:19Z

@romilbhardwaj Merged with updated k8s_cloud_beta1.

Also, I added a functionality to switch the ssh jump pod's service if the already existing service does not match with user's networking configuration in ~/.sky/config.yaml when sky launch is ran. It deletes the existing service and recreates a new service based on the config.

Solely recreating the service for ssh jump pod allows users to switch between different networking mode and access any k8s instance, but this is currently only allowed by running sky launch. I'm wondering if we should think about a way to allow users to switch around the modes, i.e. adding another CLI option.

romilbhardwaj

Thanks @landscapepainter. Left some comments, still reading code and yet to try.

sky/authentication.py

sky/utils/kubernetes_utils.py

romilbhardwaj · 2023-08-27T04:23:45Z

sky/authentication.py

@@ -404,16 +408,49 @@ def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
            logger.error(suffix)
            raise

+    ssh_jump_name = clouds.Kubernetes.SKY_SSH_JUMP_NAME
+    if ssh_setup_mode == 'nodeport':


Can we use the enum KubernetesNetworkingMode everywhere to avoid hardcoding strings 'nodeport' and 'portforward'? We can also ask the user to use the same string in the config file and then read directly.

@romilbhardwaj I'm wondering if we should create an enum for service types as well since there are currently some places(setup_kubernetes_authentication, setup_sshjump_svc) using NodePort and ClusterIP as a hardcoded string. What do you think?

Yes, I think it'll be good to have KubernetesServiceType enum

sky/utils/kubernetes_utils.py

tests/test_config.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

…painter/skypilot into ssh-port-forward-beta1 merge

landscapepainter · 2023-08-27T22:39:47Z

Thanks @romilbhardwaj! This is ready for another look.

romilbhardwaj · 2023-08-28T07:38:08Z

sky/templates/kubernetes-port-forward-proxy-command.sh.j2

+# Establishes two directional byte streams to handle stdin/stdout between
+# terminal and the jump pod.
+# socat process terminates when port-forward terminates.
+socat - tcp:127.0.0.1:{{ local_port }}


On running any command (sky launch or ssh), stdout prints:

Connection to 127.0.0.1 port 23100 [tcp/*] succeeded!

Is there some way to suppress this message?

Doesn't seem like there is a clear way to go about this especially since we are utilizing socat's stdout/stderr to interact with the node. Wasn't able to find a clear way from this man page as well. Also, it looks like the message is OS specific? I'm using Debian GNU/Linux 10 (buster) and am not seeing the message:

$ ssh sky-1aad-gcpuser Warning: Permanently added '[127.0.0.1]:23100' (ECDSA) to the list of known hosts. Warning: Permanently added 'xx.xxx.x.x' (ECDSA) to the list of known hosts. Linux sky-1aad-gcpuser-ray-head 4.19.0-22-cloud-amd64 #1 SMP Debian 4.19.260-1 (2022-09-29) x86_64 The programs included with the Debian GNU/Linux system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. Last login: Tue Aug 29 04:19:04 2023 from xx.xxx.x.x

romilbhardwaj · 2023-08-28T07:43:04Z

sky/templates/kubernetes-port-forward-proxy-command.sh.j2

+pgrep -f "kubectl port-forward svc/{{ ssh_jump_name }} {{ local_port }}:22" > /dev/null
+if [ $? -eq 0 ]; then
+    pkill -f "kubectl port-forward svc/{{ ssh_jump_name }} {{ local_port }}:22"
+fi


I ran into a bug where creating multiple ssh connections in parallel (e.g., multiple ssh windows or parallel sky launches) would allow only one connection to live at a time. I.e., the previous connection gets abruptly terminated.

To replicate:

sky launch -c test

Open a terminal and run ssh test

While the previous terminal is running, open a new terminal and run ssh test

The ssh connection created in step 2 gets broken.

(base) sky@test-ray-head:~$ /Users/romilb/.sky/port-forward-proxy-cmd.sh: line 30: 32784 Terminated: 15 kubectl port-forward svc/sky-sshjump-2ea485ef 23100:22 /Users/romilb/.sky/port-forward-proxy-cmd.sh: line 31: kill: (32784) - No such process client_loop: send disconnect: Broken pipe client_loop: send disconnect: Broken pipe

We may need to
a) find a better way to handle open kubectl port-forward instead of killing them. Perhaps check if it is in use or not
b) If the port is in use, use another local_port to create the new connection

@landscapepainter - here's a script that works and handles multiple SSH connections:

#!/usr/bin/env bash set -uo pipefail # Checks if socat is installed if ! command -v socat > /dev/null; then echo "Using 'port-forward' mode to run ssh session on Kubernetes instances requires 'socat' to be installed. Please install 'socat'" >&2 exit fi # Checks if lsof is installed if ! command -v lsof > /dev/null; then echo "Checking port availability requires 'lsof' to be installed. Please install 'lsof'" >&2 exit 1 fi # Function to check if port is in use is_port_in_use() { local port="$1" lsof -i :${port} > /dev/null 2>&1 } # Start from a fixed local port and increment if in use local_port={{ local_port }} while is_port_in_use "${local_port}"; do local_port=$((local_port + 1)) done # Establishes connection between local port and the ssh jump pod kubectl port-forward svc/{{ ssh_jump_name }} "${local_port}":22 & # Terminate the port-forward process when this script exits. K8S_PORT_FWD_PID=$! trap "kill $K8S_PORT_FWD_PID" EXIT # checks if a connection to local_port of 127.0.0.1:[local_port] is established while ! nc -z 127.0.0.1 "${local_port}"; do sleep 0.1 done # Establishes two directional byte streams to handle stdin/stdout between # terminal and the jump pod. # socat process terminates when port-forward terminates. socat - tcp:127.0.0.1:"${local_port}"

@romilbhardwaj Updated the script to the version above and added a check for lsof installation in authentication.py. Reason for why we need to check for installations in both authentication.py and port-fordward-proxy-cmd.sh: #2412 (comment)

@romilbhardwaj Do you think it's necessary to set an upper bound on possible local_port usage? It can limit the number of concurrent ssh sessions, but was wondering if we should set a range on available local_port for this purpose.

landscapepainter · 2023-08-29T05:56:43Z

@romilbhardwaj Ready for another look! Updated the proxy script with what you left in the comment and created another enum for service types. Also added another check for lsof installation in kubernetes_utils.py

romilbhardwaj · 2023-08-29T12:36:15Z

sky/utils/command_runner.py

@@ -198,7 +213,8 @@ def make_runner_list(
            port_list = [22] * len(ip_list)
        return [
            SSHCommandRunner(ip, ssh_user, ssh_private_key, ssh_control_name,


On GKE clusters, This PR is currently using the external IP of nodes (sky@34.16.78.81):

2023-08-29 17:53:01,300 VVINFO command_runner.py:367 -- Full command is `�[1mssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8d2ba49c09/098f6bcd46/%C -o ControlPersist=10s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -p 23100 -W %h:%p sky@34.16.78.81 -o ProxyCommand='/Users/romilb/.sky/port-forward-proxy-cmd.sh' -o Port=22 -o ConnectTimeout=120s sky@10.72.0.17 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p ~/.sky/.runtime_files)'�[22m�[26m` building file list ... done

This will not work if the nodes are behind a firewall. It should instead of be connecting to sky@127.0.0.1 since the the port-forward is running locally.

You may need to update get_external_ip to accept an arg with the KubernetesNetworkingMode, which would return 127.0.0.1 if the mode is port-forward, else use the existing logic if the mode is nodeport.

landscapepainter

@romilbhardwaj This is ready for another look.

merged @hemildesai's branch on allowing k8s specific env. vars accessible within ssh session and updated the implementation
updated get_external_ip
removed query_env_vars and _update_envs_for_k8s

Confirmed successful provision on k8s instance with GPU

romilbhardwaj · 2023-08-30T17:09:18Z

sky/skylet/providers/kubernetes/node_provider.py

+        # shell sessions.
+        set_k8s_env_var_cmd = [
+            '/bin/sh', '-c',
+            'printenv | awk \'{print "export " $0}\' > ~/k8s_env_var.sh && sudo mv ~/k8s_env_var.sh /etc/profile.d/k8s_env_var.sh'


Bug: After the first run, bash: 419: No such file or directory gets printed every time I SSH or exec a task.

Repro:

sky launch -c test --gpus T4:1 -- nvidia-smi

sky exec test --gpus T4:1 -- nvidia-smi (This prints the above error). Logs.

ssh test this also prints the above error.

Cause:
/etc/profile.d/k8s_env_var.sh contains a line like:

export NVIDIA_REQUIRE_CUDA=cuda>=11.6 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471

The shell tries to interpret special characters > and <, causing errors.

Solution - wrap the exported value in single quotes:

export NVIDIA_REQUIRE_CUDA='cuda>=11.6 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471'

Catch - if the envvar var value itself contains single quotes, we'll need to be careful in handling single quotes (') appearing in the string. See some suggestions here.

romilbhardwaj · 2023-08-30T17:11:47Z

sky/skylet/providers/kubernetes/node_provider.py

+        # shell sessions.
+        set_k8s_env_var_cmd = [
+            '/bin/sh', '-c',
+            'printenv | awk \'{print "export " $0}\' > ~/k8s_env_var.sh && sudo mv ~/k8s_env_var.sh /etc/profile.d/k8s_env_var.sh'


sudo may not be present on all images. Can we try mv without sudo first, and use sudo if it fails?

romilbhardwaj · 2023-08-30T17:17:44Z

Thanks @landscapepainter - can you also run pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials" on a test GKE cluster with all resources required for smoke tests?

* Working Ray K8s node provider based on SSH * wip * working provisioning with SkyPilot and ssh config * working provisioning with SkyPilot and ssh config * Updates to master * ray2.3 * Clean up docs * multiarch build * hacking around ray start * more port fixes * fix up default instance selection * fix resource selection * Add provisioning timeout by checking if pods are ready * Working mounting * Remove catalog * fixes * fixes * Fix ssh-key auth to create unique secrets * Fix for ContainerCreating timeout * Fix head node ssh port caching * mypy * lint * fix ports * typo * cleanup * cleanup * wip * Update setup * readme updates * lint * Fix failover * Fix failover * optimize setup * Fix sync down logs for k8s * test wip * instance name parsing wip * Fix instance name parsing * Merge fixes for query_status * [k8s_cloud] Delete k8s service resources. (#2105) Delete k8s service resources. - 'sky down' for Kubernetes cloud to remove cluster service resources. * Status refresh WIP * refactor to kubernetes adaptor * tests wip * clean up auth * wip tests * cli * cli * sky local up/down cli * cli * lint * lint * lint * Speed up kind cluster creation * tests * lint * tests * handling for non-reachable clusters * Invalid kubeconfig handling * Timeout for sky check * code cleanup * lint * Do not raise error if GPUs requested, return empty list * Address comments * comments * lint * Remove public key upload * GPU support init * wip * add shebang * comments * change permissions * remove chmod * merge 2241 * add todo * Handle kube config management for sky local commands (#2253) * Set current-context (if availablee) after sky local down and remove incorrect prompt in sky local up * Warn user of kubeconfig context switch during sky local up * Use Optional instead of Union * Switch context in create_cluster if cluster already exists. * fix typo * update sky check error msg after sky local down * lint * update timeout check * fix import error * Fix kube API access from within cluster (load_incluster_auth) * lint * lint * working autodown and sky status -r * lint * add test_kubernetes_autodown * lint * address comments * address comments * lint * deletion timeouts wip * [k8s_cloud] Ray pod not created under current context namespace. (#2302) 'namespace' exists under 'context' key. * head ssh port namespace fix * [k8s-cloud] Typo in sky local --help. (#2308) Typo. * [k8s-cloud] Set build_image.sh to be executable. (#2307) * Set build_image.sh to be executable. * Use TAG to easily switch between registries. * remove ingress * remove debug statements * UX and readme updates * lint * fix logging for 409 retry * lint * lint * Debug dockerfile * wip * Fix GPU image * Query cloud specific env vars in task setup (#2347) * Query cloud specific env vars in task setup * Make query_env_vars specific to Kubernetes cloud * Address PR comments * working GPU type selection for GKE and EKS. GFD needs work. * TODO for auto-detection * Add image toggling for CPU/GPU * Add image toggling for CPU/GPU * Fix none acce_type * remove memory from j2 * Make resnet examples run again * lint * v100 readme * dockerfile and smoketest * fractional cpu and mem * nits * refactor utils * lint and cleanup * lint and cleanup * lint and cleanup * lint and cleanup * lint and cleanup * lint and cleanup * lint * lint * manual lint * manual isort * test readme update * Remove EKS * lint * add gpu labeler * updates * lint * update script * ux * fix formatter * test update * test update * fix test_optimizer_dryruns * docs * cleanup * test readme update * lint * lint * [k8s_cloud_beta1] Add sshjump host support. (#2369) * Update build image * fix image path * fix merge * cleanup * lint * fix utils ref * typo * refactor pod creation * lint * merge fixes * portfix * merge fixes * [k8s_cloud_beta1] Sky down for a cluster deployed in Kubernetes to possibly remove sshjump pod. (#2425) * Sky down for a kubernetes cluster to possibly terminate sshjump pod. - If the related sshjump pod is being reported as its main container not have been started, then remove its pod and service. This is to minimize the chances for remaining with dangling sshjump pod. * Remove sshjump service in case of an failure to analyze sshjump. - remove _request_timeout as it might not be needed due to terminationGracePeriodSeconds being set in sshjump template. * Move sshjump analysis to kubernetes_utils. * Apply changes per ./format.sh. * Minor comment rephrase. * Use sshjump_name from ray pod label. - rather than from clouds.Kubernetes * cleanup * Add networking benchmarks * comment * comment * lint * autodown fixes * lint * fix label * [k8s_cloud_beta1] Adding support for ssh using kubectl port-forward to access k8s instance (#2412) * Add sshjump support. * Update lcm script. - add comments - rename variables - typo * Set imagePullPolicy to IfNotPresent. * add support for port-forward * remove unused * comments * Disable ControlMaster for ssh_options_list * nit * update to disable rest of the ControlMaster * command runner rsync update * relocating run_on_k8s * relocate run_on_k8s * Make Kubernetes specific env variables available when joining a cluster via SSH * merge k8s_cloud_beta1 * format * remove redundant utils.py * format and comments * update with proxy_to_k8s * Update sky/authentication.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * resolving comments on structures * Update sky/utils/command_runner.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * document on nodeport/port-forward proxycommand * error handling when socat is not installed * removing KUBECONFIG from port-forward shell script * nit * nit * Add suport for nodeport * Update sky/utils/kubernetes_utils.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * update * switch svc when conflicting jump pod svc exist * format * Update sky/utils/kubernetes_utils.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * refactoring check for socat * resolve comments * add ServiceType enum and port-forward proxy script * update k8s env var access * add check for container status remove unused func * nit * update get_external_ip for portforward mode * conditionally use sudo and quote values of env var --------- Co-authored-by: Avi Weit <weit@il.ibm.com> Co-authored-by: hemildesai <hemil.desai10@gmail.com> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * refactor * fix * updates * lint * Update sky/skylet/providers/kubernetes/node_provider.py * fix test * [k8s] Showing reasons for provisioning failure in K8s (#2422) * surface provision failure message * nit * nit * format * nit * CPU message fix * update Insufficient memory handling * nit * nit * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update sky/skylet/providers/kubernetes/node_provider.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * format * update gpu failure message and condition * fix GPU handling cases * fix * comment * nit * add try except block with general error handling --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * cleanup * lint * fix for ssh jump image_id * comments * ssh jump refactor * lint * image build fixes --------- Co-authored-by: Avi Weit <weit@il.ibm.com> Co-authored-by: Hemil Desai <hemil.desai10@gmail.com> Co-authored-by: Doyoung Kim <34902420+landscapepainter@users.noreply.github.com>

aviweit and others added 16 commits August 9, 2023 09:34

Add sshjump support.

3021576

Update lcm script.

8afe74a

- add comments - rename variables - typo

Set imagePullPolicy to IfNotPresent.

e87bcd6

add support for port-forward

be18925

Merge branch 'master' into ssh-port-forward-beta1

82e1dd6

remove unused

94e56f5

comments

cd970b2

Disable ControlMaster for ssh_options_list

99edb40

nit

d44a153

update to disable rest of the ControlMaster

714cce0

command runner rsync update

d475914

relocating run_on_k8s

8434636

relocate run_on_k8s

2760922

Merge branch 'k8s_cloud_beta1' into ssh-port-forward-beta1

1ece560

Make Kubernetes specific env variables available when joining a clust…

ae96627

…er via SSH

merge k8s_cloud_beta1

9ae2f27

landscapepainter added 4 commits August 17, 2023 23:26

Merge branch 'k8s_cloud_beta1' into ssh-port-forward-beta1

102ec10

format

0385dda

remove redundant utils.py

526ec35

format and comments

6b54887

Michaelvll reviewed Aug 18, 2023

View reviewed changes

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved

update with proxy_to_k8s

f229080

romilbhardwaj reviewed Aug 19, 2023

View reviewed changes

landscapepainter and others added 4 commits August 19, 2023 15:56

Update sky/authentication.py

1e1a201

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

resolving comments on structures

82b7182

Merge branch 'ssh-port-forward-beta1' of https://github.com/landscape…

d0a4461

…painter/skypilot into ssh-port-forward-beta1

Update sky/utils/command_runner.py

60ee3c7

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

romilbhardwaj reviewed Aug 20, 2023

View reviewed changes

sky/authentication.py Show resolved Hide resolved

update

826430d

landscapepainter added 3 commits August 25, 2023 04:36

switch svc when conflicting jump pod svc exist

276f310

format

01bd78b

Merge branch 'k8s_cloud_beta1' into ssh-port-forward-beta1

b017ec5

romilbhardwaj reviewed Aug 27, 2023

View reviewed changes

landscapepainter and others added 4 commits August 27, 2023 14:17

Update sky/utils/kubernetes_utils.py

f225c38

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

refactoring check for socat

0944ce9

Merge branch 'ssh-port-forward-beta1' of https://github.com/landscape…

beebedf

…painter/skypilot into ssh-port-forward-beta1 merge

resolve comments

b61e824

romilbhardwaj reviewed Aug 28, 2023

View reviewed changes

add ServiceType enum and port-forward proxy script

eef84f0

romilbhardwaj reviewed Aug 29, 2023

View reviewed changes

landscapepainter and others added 6 commits August 30, 2023 04:28

Merge branch 'hd/k8s-env-vars-for-ssh'

3d977a0

update k8s env var access

84abe99

add check for container status remove unused func

c0d7ecf

nit

cb37ac1

update get_external_ip for portforward mode

2a4cc0d

Merge branch 'k8s_cloud_beta1' into ssh-port-forward-beta1

05b3fb8

landscapepainter commented Aug 30, 2023

View reviewed changes

romilbhardwaj reviewed Aug 30, 2023

View reviewed changes

conditionally use sudo and quote values of env var

6e7c511

romilbhardwaj merged commit fb09398 into skypilot-org:k8s_cloud_beta1 Aug 31, 2023

roymiloh mentioned this pull request Feb 11, 2025

Remove tracebacks for exceptions to improve UX #441

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s_cloud_beta1] Adding support for ssh using kubectl port-forward to access k8s instance #2412

[k8s_cloud_beta1] Adding support for ssh using kubectl port-forward to access k8s instance #2412

landscapepainter commented Aug 17, 2023 •

edited

Loading

romilbhardwaj commented Aug 17, 2023

landscapepainter commented Aug 18, 2023

romilbhardwaj left a comment

romilbhardwaj Aug 19, 2023

landscapepainter Aug 20, 2023 •

edited

Loading

romilbhardwaj Aug 19, 2023

landscapepainter Aug 21, 2023 •

edited

Loading

romilbhardwaj Aug 19, 2023

landscapepainter Aug 23, 2023

romilbhardwaj left a comment

landscapepainter commented Aug 24, 2023

romilbhardwaj commented Aug 25, 2023

landscapepainter commented Aug 25, 2023 •

edited

Loading

romilbhardwaj left a comment

romilbhardwaj Aug 27, 2023

landscapepainter Aug 27, 2023

romilbhardwaj Aug 28, 2023

landscapepainter commented Aug 27, 2023

romilbhardwaj Aug 28, 2023

landscapepainter Aug 29, 2023

romilbhardwaj Aug 28, 2023

romilbhardwaj Aug 29, 2023

landscapepainter Aug 29, 2023 •

edited

Loading

landscapepainter Aug 29, 2023

landscapepainter commented Aug 29, 2023 •

edited

Loading

romilbhardwaj Aug 29, 2023

landscapepainter left a comment •

edited

Loading

romilbhardwaj Aug 30, 2023

romilbhardwaj Aug 30, 2023

romilbhardwaj commented Aug 30, 2023

[k8s_cloud_beta1] Adding support for ssh using kubectl port-forward to access k8s instance #2412

[k8s_cloud_beta1] Adding support for ssh using kubectl port-forward to access k8s instance #2412

Conversation

landscapepainter commented Aug 17, 2023 • edited Loading

romilbhardwaj commented Aug 17, 2023

landscapepainter commented Aug 18, 2023

romilbhardwaj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

landscapepainter Aug 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

landscapepainter Aug 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romilbhardwaj left a comment

Choose a reason for hiding this comment

landscapepainter commented Aug 24, 2023

romilbhardwaj commented Aug 25, 2023

landscapepainter commented Aug 25, 2023 • edited Loading

romilbhardwaj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

landscapepainter commented Aug 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

landscapepainter Aug 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

landscapepainter commented Aug 29, 2023 • edited Loading

Choose a reason for hiding this comment

landscapepainter left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romilbhardwaj commented Aug 30, 2023

landscapepainter commented Aug 17, 2023 •

edited

Loading

landscapepainter Aug 20, 2023 •

edited

Loading

landscapepainter Aug 21, 2023 •

edited

Loading

landscapepainter commented Aug 25, 2023 •

edited

Loading

landscapepainter Aug 29, 2023 •

edited

Loading

landscapepainter commented Aug 29, 2023 •

edited

Loading

landscapepainter left a comment •

edited

Loading