Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s_cloud_beta1] Adding support for ssh using kubectl port-forward to access k8s instance #2412

Conversation

landscapepainter
Copy link
Collaborator

@landscapepainter landscapepainter commented Aug 17, 2023

This PR allows to set up an ssh session to access k8s instance through a jump pod using kubectl port-forward and socat as a Proxycommand. Default ssh method would be to use kubectl port-forward and socat, and if the user wants to open up a NodePort service to access the jump pod instead, the following can be written in ~/.sky/config.yaml:

kubernetes:
  networking: nodeport

Note: Allowing to use ControlMaster that is set from ssh_options_list() causes ControlPersist amount of seconds of stalling whenever ssh command is ran while running sky commands in some linux distributions(ubuntu 22.04, debian 11 seems to be okay, but debian 10 has the stalling issue). This is resolved in this PR by disallowing the use of ControlMaster for k8s instances.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manual tests running sky launch, sky exec, ssh <cluster-name>
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@romilbhardwaj
Copy link
Collaborator

Thanks @landscapepainter! I merged both the latest master and k8s_cloud_gpu into k8s_cloud_beta1. Can you update this PR by merging the latest k8s_cloud_beta1 into your branch?

@landscapepainter
Copy link
Collaborator Author

@romilbhardwaj merged the updated k8s_cloud_beta1 and added more comments.

One thing that is not resolved is on how to have the users install socat. This is not an application that is installed by default and not supported by PYPI. I was wondering if we should inform the users to install it in the document.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter! Works nicely. Left some comments.

I'm planning to refactor the SSH jump pod creation to node_provider in k8s_cloud_beta1 branch, so we may need to update this branch after that's done.

ssh_setup_mode: str):
""" returns Proxycommand to use when establishing ssh connection
to the k8s instance through the jump pod.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this method, I would err on the side on over-documenting. It would be good to add details here on:

  • why we use a proxycommand
  • what does the proxycommand do behind the scenes

Copy link
Collaborator Author

@landscapepainter landscapepainter Aug 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj I wrote the doc that resolves all two bullet points. Please take a look!


# Estalbishes two directional byte streams to handle stdin/stdout between
# terminal and the jump pod
socat - tcp:{{ ipaddress }}:{{ local_port }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying this our and it silently failed to ssh for a long time before I realized I don't have socat installed.

Is it possible to check if socat is installed at the start of the script, raise an error if its not and propagate this error cleanly up to the user? Otherwise, we may want to add a check if socat is installed elsewhere in our code...

Copy link
Collaborator Author

@landscapepainter landscapepainter Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj I added a check for socat installation and displays error msg and exit at the beginning of the script if it's not installed so that it shows the msg when users attempts to ssh <k8s-instance-name> without socat installed.

But it doesn't seem like there's a clean way to handle this exit and raise an error msg for every possible ssh session runs within skypilot. So I added another check for socat installation in authentication.py/setup_kubernetes_authentication.py when 'port-forward' mode is being setup.

Running sky launch:

$ sky launch -y
I 08-21 00:29:50 optimizer.py:652] == Optimizer ==
I 08-21 00:29:50 optimizer.py:663] Target: minimizing cost
I 08-21 00:29:50 optimizer.py:675] Estimated cost: $0.0 / hour
I 08-21 00:29:50 optimizer.py:675] 
I 08-21 00:29:50 optimizer.py:748] Considered resources (1 node):
I 08-21 00:29:50 optimizer.py:797] ---------------------------------------------------------------------------------------------------
I 08-21 00:29:50 optimizer.py:797]  CLOUD        INSTANCE        vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN   
I 08-21 00:29:50 optimizer.py:797] ---------------------------------------------------------------------------------------------------
I 08-21 00:29:50 optimizer.py:797]  Kubernetes   2CPU--2GB       2       2         -              kubernetes      0.00          ✔     
I 08-21 00:29:50 optimizer.py:797]  AWS          m6i.2xlarge     8       32        -              us-east-1       0.38                
I 08-21 00:29:50 optimizer.py:797]  GCP          n2-standard-8   8       32        -              us-central1-a   0.39                
I 08-21 00:29:50 optimizer.py:797] ---------------------------------------------------------------------------------------------------
I 08-21 00:29:50 optimizer.py:797] 
Running task on cluster sky-6c5a-gcpuser...
I 08-21 00:29:50 cloud_vm_ray_backend.py:4052] Creating a new cluster: "sky-6c5a-gcpuser" [1x Kubernetes(2CPU--2GB)].
I 08-21 00:29:50 cloud_vm_ray_backend.py:4052] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 08-21 00:29:51 cloud_vm_ray_backend.py:1418] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2023-08-21-00-29-46-434429/provision.log
Clusters
NAME              LAUNCHED     RESOURCES                 STATUS  AUTOSTOP  COMMAND                       
sky-2208-gcpuser  2 hrs ago    1x Kubernetes(2CPU--2GB)  UP      -         sky exec sky-2208-gcpuser...  

RuntimeError: `socat` is required to setup Kubernetes cloud with `port-forward` default networking mode and it is not installed. For Debian/Ubuntu system, install it with:
  $ sudo apt install socat

Running ssh <k8s-instance-name>:

$ ssh sky-2208-gcpuser
Using 'port-forward' mode to ssh into Kubernetes instances requires 'socat' to be installed. Please install 'socat'
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host

Running sky exec:

$ sky exec sky-2208-gcpuser printenv
Task from command: printenv
Executing task on cluster sky-2208-gcpuser...
E 08-21 00:32:07 subprocess_utils.py:73] Using 'port-forward' mode to ssh into Kubernetes instances requires 'socat' to be installed. Please install 'socat'
E 08-21 00:32:07 subprocess_utils.py:73] ssh_exchange_identification: Connection closed by remote host
E 08-21 00:32:07 subprocess_utils.py:73] ssh_exchange_identification: Connection closed by remote host
E 08-21 00:32:07 subprocess_utils.py:73] 
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] 
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] Cluster name: sky-2208-gcpuser
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To log into the head VM:	ssh sky-2208-gcpuser
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To submit a job:		sky exec sky-2208-gcpuser yaml_file
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To stop the cluster:	sky stop sky-2208-gcpuser
I 08-21 00:32:07 cloud_vm_ray_backend.py:3242] To teardown the cluster:	sky down sky-2208-gcpuser
eClusters
NAME              LAUNCHED     RESOURCES                 STATUS  AUTOSTOP  COMMAND                       
sky-2208-gcpuser  2 hrs ago    1x Kubernetes(2CPU--2GB)  UP      -         sky exec sky-2208-gcpuser...  
ee
eesky.exceptions.CommandError: Command python3 -u -c 'import os;from sky.skylet import job_lib, log_lib;job_id = job_lib.add_job('"'"'sky-cmd'"'"', '"'"'gcpuser'"'"', '"'"'sky-2023-08-21-00-32-07-240600'"'"', '"'"'1x [CPU:0.5]'"'"');print("Job ID: " + str(job_id), flush=True)' failed with return code 255.
Failed to fetch job id.

@@ -0,0 +1,25 @@
#!/usr/bin/env bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials" to make sure everything, including file_mounts, work correctly? I have manually verified, but going forward we want to run Kubernetes smoke tests for k8s PRs :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj Currently, passing all the tests besides the ones requiring GPUs for this branch.

landscapepainter and others added 4 commits August 19, 2023 15:56
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landscapepainter - I have done some refactoring in k8s_cloud_beta1. Can you merge those changes?

@landscapepainter
Copy link
Collaborator Author

@romilbhardwaj This is ready for another look!

@romilbhardwaj
Copy link
Collaborator

Thanks @landscapepainter. After #2328 has been merged, I updated k8s_cloud_beta1 and merged it with the latest master. Before I try this out, can you please update this branch to be up to date with k8s_cloud_beta1?

@landscapepainter
Copy link
Collaborator Author

landscapepainter commented Aug 25, 2023

@romilbhardwaj Merged with updated k8s_cloud_beta1.

Also, I added a functionality to switch the ssh jump pod's service if the already existing service does not match with user's networking configuration in ~/.sky/config.yaml when sky launch is ran. It deletes the existing service and recreates a new service based on the config.

Solely recreating the service for ssh jump pod allows users to switch between different networking mode and access any k8s instance, but this is currently only allowed by running sky launch. I'm wondering if we should think about a way to allow users to switch around the modes, i.e. adding another CLI option.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter. Left some comments, still reading code and yet to try.

@@ -404,16 +408,49 @@ def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
logger.error(suffix)
raise

ssh_jump_name = clouds.Kubernetes.SKY_SSH_JUMP_NAME
if ssh_setup_mode == 'nodeport':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the enum KubernetesNetworkingMode everywhere to avoid hardcoding strings 'nodeport' and 'portforward'? We can also ask the user to use the same string in the config file and then read directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj I'm wondering if we should create an enum for service types as well since there are currently some places(setup_kubernetes_authentication, setup_sshjump_svc) using NodePort and ClusterIP as a hardcoded string. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it'll be good to have KubernetesServiceType enum

@landscapepainter
Copy link
Collaborator Author

Thanks @romilbhardwaj! This is ready for another look.

# Establishes two directional byte streams to handle stdin/stdout between
# terminal and the jump pod.
# socat process terminates when port-forward terminates.
socat - tcp:127.0.0.1:{{ local_port }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On running any command (sky launch or ssh), stdout prints:

Connection to 127.0.0.1 port 23100 [tcp/*] succeeded!

Is there some way to suppress this message?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem like there is a clear way to go about this especially since we are utilizing socat's stdout/stderr to interact with the node. Wasn't able to find a clear way from this man page as well. Also, it looks like the message is OS specific? I'm using Debian GNU/Linux 10 (buster) and am not seeing the message:

$ ssh sky-1aad-gcpuser
Warning: Permanently added '[127.0.0.1]:23100' (ECDSA) to the list of known hosts.
Warning: Permanently added 'xx.xxx.x.x' (ECDSA) to the list of known hosts.
Linux sky-1aad-gcpuser-ray-head 4.19.0-22-cloud-amd64 #1 SMP Debian 4.19.260-1 (2022-09-29) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue Aug 29 04:19:04 2023 from xx.xxx.x.x

Comment on lines 11 to 14
pgrep -f "kubectl port-forward svc/{{ ssh_jump_name }} {{ local_port }}:22" > /dev/null
if [ $? -eq 0 ]; then
pkill -f "kubectl port-forward svc/{{ ssh_jump_name }} {{ local_port }}:22"
fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into a bug where creating multiple ssh connections in parallel (e.g., multiple ssh windows or parallel sky launches) would allow only one connection to live at a time. I.e., the previous connection gets abruptly terminated.

To replicate:

  1. sky launch -c test
  2. Open a terminal and run ssh test
  3. While the previous terminal is running, open a new terminal and run ssh test
  4. The ssh connection created in step 2 gets broken.
(base) sky@test-ray-head:~$ /Users/romilb/.sky/port-forward-proxy-cmd.sh: line 30: 32784 Terminated: 15          kubectl port-forward svc/sky-sshjump-2ea485ef 23100:22
                                           /Users/romilb/.sky/port-forward-proxy-cmd.sh: line 31: kill: (32784) - No such process
     client_loop: send disconnect: Broken pipe
client_loop: send disconnect: Broken pipe

We may need to
a) find a better way to handle open kubectl port-forward instead of killing them. Perhaps check if it is in use or not
b) If the port is in use, use another local_port to create the new connection

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landscapepainter - here's a script that works and handles multiple SSH connections:

#!/usr/bin/env bash
set -uo pipefail

# Checks if socat is installed
if ! command -v socat > /dev/null; then
  echo "Using 'port-forward' mode to run ssh session on Kubernetes instances requires 'socat' to be installed. Please install 'socat'" >&2
  exit
fi

# Checks if lsof is installed
if ! command -v lsof > /dev/null; then
  echo "Checking port availability requires 'lsof' to be installed. Please install 'lsof'" >&2
  exit 1
fi

# Function to check if port is in use
is_port_in_use() {
    local port="$1"
    lsof -i :${port} > /dev/null 2>&1
}

# Start from a fixed local port and increment if in use
local_port={{ local_port }}
while is_port_in_use "${local_port}"; do
    local_port=$((local_port + 1))
done

# Establishes connection between local port and the ssh jump pod
kubectl port-forward svc/{{ ssh_jump_name }} "${local_port}":22 &

# Terminate the port-forward process when this script exits.
K8S_PORT_FWD_PID=$!
trap "kill $K8S_PORT_FWD_PID" EXIT

# checks if a connection to local_port of 127.0.0.1:[local_port] is established
while ! nc -z 127.0.0.1 "${local_port}"; do
    sleep 0.1
done

# Establishes two directional byte streams to handle stdin/stdout between
# terminal and the jump pod.
# socat process terminates when port-forward terminates.
socat - tcp:127.0.0.1:"${local_port}"

Copy link
Collaborator Author

@landscapepainter landscapepainter Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj Updated the script to the version above and added a check for lsof installation in authentication.py. Reason for why we need to check for installations in both authentication.py and port-fordward-proxy-cmd.sh: #2412 (comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj Do you think it's necessary to set an upper bound on possible local_port usage? It can limit the number of concurrent ssh sessions, but was wondering if we should set a range on available local_port for this purpose.

@landscapepainter
Copy link
Collaborator Author

landscapepainter commented Aug 29, 2023

@romilbhardwaj Ready for another look! Updated the proxy script with what you left in the comment and created another enum for service types. Also added another check for lsof installation in kubernetes_utils.py

@@ -198,7 +213,8 @@ def make_runner_list(
port_list = [22] * len(ip_list)
return [
SSHCommandRunner(ip, ssh_user, ssh_private_key, ssh_control_name,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On GKE clusters, This PR is currently using the external IP of nodes (sky@34.16.78.81):

2023-08-29 17:53:01,300	VVINFO command_runner.py:367 -- Full command is `�[1mssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8d2ba49c09/098f6bcd46/%C -o ControlPersist=10s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -p 23100 -W %h:%p sky@34.16.78.81 -o ProxyCommand='/Users/romilb/.sky/port-forward-proxy-cmd.sh'  -o Port=22 -o ConnectTimeout=120s sky@10.72.0.17 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p ~/.sky/.runtime_files)'�[22m�[26m`
building file list ... done

This will not work if the nodes are behind a firewall. It should instead of be connecting to sky@127.0.0.1 since the the port-forward is running locally.

You may need to update get_external_ip to accept an arg with the KubernetesNetworkingMode, which would return 127.0.0.1 if the mode is port-forward, else use the existing logic if the mode is nodeport.

Copy link
Collaborator Author

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj This is ready for another look.

  • merged @hemildesai's branch on allowing k8s specific env. vars accessible within ssh session and updated the implementation
  • updated get_external_ip
  • removed query_env_vars and _update_envs_for_k8s

Confirmed successful provision on k8s instance with GPU

# shell sessions.
set_k8s_env_var_cmd = [
'/bin/sh', '-c',
'printenv | awk \'{print "export " $0}\' > ~/k8s_env_var.sh && sudo mv ~/k8s_env_var.sh /etc/profile.d/k8s_env_var.sh'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: After the first run, bash: 419: No such file or directory gets printed every time I SSH or exec a task.

Repro:

  1. sky launch -c test --gpus T4:1 -- nvidia-smi
  2. sky exec test --gpus T4:1 -- nvidia-smi (This prints the above error). Logs.
  3. ssh test this also prints the above error.

Cause:
/etc/profile.d/k8s_env_var.sh contains a line like:

export NVIDIA_REQUIRE_CUDA=cuda>=11.6 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471

The shell tries to interpret special characters > and <, causing errors.

Solution - wrap the exported value in single quotes:

export NVIDIA_REQUIRE_CUDA='cuda>=11.6 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471'

Catch - if the envvar var value itself contains single quotes, we'll need to be careful in handling single quotes (') appearing in the string. See some suggestions here.

# shell sessions.
set_k8s_env_var_cmd = [
'/bin/sh', '-c',
'printenv | awk \'{print "export " $0}\' > ~/k8s_env_var.sh && sudo mv ~/k8s_env_var.sh /etc/profile.d/k8s_env_var.sh'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sudo may not be present on all images. Can we try mv without sudo first, and use sudo if it fails?

@romilbhardwaj
Copy link
Collaborator

Thanks @landscapepainter - can you also run pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials" on a test GKE cluster with all resources required for smoke tests?

@romilbhardwaj romilbhardwaj merged commit fb09398 into skypilot-org:k8s_cloud_beta1 Aug 31, 2023
romilbhardwaj added a commit that referenced this pull request Sep 16, 2023
* Working Ray K8s node provider based on SSH

* wip

* working provisioning with SkyPilot and ssh config

* working provisioning with SkyPilot and ssh config

* Updates to master

* ray2.3

* Clean up docs

* multiarch build

* hacking around ray start

* more port fixes

* fix up default instance selection

* fix resource selection

* Add provisioning timeout by checking if pods are ready

* Working mounting

* Remove catalog

* fixes

* fixes

* Fix ssh-key auth to create unique secrets

* Fix for ContainerCreating timeout

* Fix head node ssh port caching

* mypy

* lint

* fix ports

* typo

* cleanup

* cleanup

* wip

* Update setup

* readme updates

* lint

* Fix failover

* Fix failover

* optimize setup

* Fix sync down logs for k8s

* test wip

* instance name parsing wip

* Fix instance name parsing

* Merge fixes for query_status

* [k8s_cloud] Delete k8s service resources. (#2105)

Delete k8s service resources.

- 'sky down' for Kubernetes cloud to remove cluster service resources.

* Status refresh WIP

* refactor to kubernetes adaptor

* tests wip

* clean up auth

* wip tests

* cli

* cli

* sky local up/down cli

* cli

* lint

* lint

* lint

* Speed up kind cluster creation

* tests

* lint

* tests

* handling for non-reachable clusters

* Invalid kubeconfig handling

* Timeout for sky check

* code cleanup

* lint

* Do not raise error if GPUs requested, return empty list

* Address comments

* comments

* lint

* Remove public key upload

* GPU support init

* wip

* add shebang

* comments

* change permissions

* remove chmod

* merge 2241

* add todo

* Handle kube config management for sky local commands (#2253)

* Set current-context (if availablee) after sky local down and remove incorrect prompt in sky local up

* Warn user of kubeconfig context switch during sky local up

* Use Optional instead of Union

* Switch context in create_cluster if cluster already exists.

* fix typo

* update sky check error msg after sky local down

* lint

* update timeout check

* fix import error

* Fix kube API access from within cluster (load_incluster_auth)

* lint

* lint

* working autodown and sky status -r

* lint

* add test_kubernetes_autodown

* lint

* address comments

* address comments

* lint

* deletion timeouts wip

* [k8s_cloud] Ray pod not created under current context namespace. (#2302)

'namespace' exists under 'context' key.

* head ssh port namespace fix

* [k8s-cloud] Typo in sky local --help. (#2308)

Typo.

* [k8s-cloud] Set build_image.sh to be executable. (#2307)

* Set build_image.sh to be executable.

* Use TAG to easily switch between registries.

* remove ingress

* remove debug statements

* UX and readme updates

* lint

* fix logging for 409 retry

* lint

* lint

* Debug dockerfile

* wip

* Fix GPU image

* Query cloud specific env vars in task setup (#2347)

* Query cloud specific env vars in task setup

* Make query_env_vars specific to Kubernetes cloud

* Address PR comments

* working GPU type selection for GKE and EKS. GFD needs work.

* TODO for auto-detection

* Add image toggling for CPU/GPU

* Add image toggling for CPU/GPU

* Fix none acce_type

* remove memory from j2

* Make resnet examples run again

* lint

* v100 readme

* dockerfile and smoketest

* fractional cpu and mem

* nits

* refactor utils

* lint and cleanup

* lint and cleanup

* lint and cleanup

* lint and cleanup

* lint and cleanup

* lint and cleanup

* lint

* lint

* manual lint

* manual isort

* test readme update

* Remove EKS

* lint

* add gpu labeler

* updates

* lint

* update script

* ux

* fix formatter

* test update

* test update

* fix test_optimizer_dryruns

* docs

* cleanup

* test readme update

* lint

* lint

* [k8s_cloud_beta1] Add sshjump host support. (#2369)

* Update build image

* fix image path

* fix merge

* cleanup

* lint

* fix utils ref

* typo

* refactor pod creation

* lint

* merge fixes

* portfix

* merge fixes

* [k8s_cloud_beta1] Sky down for a cluster deployed in Kubernetes to possibly remove sshjump pod. (#2425)

* Sky down for a kubernetes cluster to possibly terminate sshjump pod.

- If the related sshjump pod is being reported as its main container
  not have been started, then remove its pod and service. This is to
  minimize the chances for remaining with dangling sshjump pod.

* Remove sshjump service in case of an failure to analyze sshjump.

- remove _request_timeout as it might not be needed due to
  terminationGracePeriodSeconds being set in sshjump template.

* Move sshjump analysis to kubernetes_utils.

* Apply changes per ./format.sh.

* Minor comment rephrase.

* Use sshjump_name from ray pod label.

- rather than from clouds.Kubernetes

* cleanup

* Add networking benchmarks

* comment

* comment

* lint

* autodown fixes

* lint

* fix label

* [k8s_cloud_beta1] Adding support for ssh using kubectl port-forward to access k8s instance (#2412)

* Add sshjump support.

* Update lcm script.

- add comments
- rename variables
- typo

* Set imagePullPolicy to IfNotPresent.

* add support for port-forward

* remove unused

* comments

* Disable ControlMaster for ssh_options_list

* nit

* update to disable rest of the ControlMaster

* command runner rsync update

* relocating run_on_k8s

* relocate run_on_k8s

* Make Kubernetes specific env variables available when joining a cluster via SSH

* merge k8s_cloud_beta1

* format

* remove redundant utils.py

* format and comments

* update with proxy_to_k8s

* Update sky/authentication.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* resolving comments on structures

* Update sky/utils/command_runner.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* document on nodeport/port-forward proxycommand

* error handling when socat is not installed

* removing KUBECONFIG from port-forward shell script

* nit

* nit

* Add suport for nodeport

* Update sky/utils/kubernetes_utils.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* update

* switch svc when conflicting jump pod svc exist

* format

* Update sky/utils/kubernetes_utils.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* refactoring check for socat

* resolve comments

* add ServiceType enum and port-forward proxy script

* update k8s env var access

* add check for container status remove unused func

* nit

* update get_external_ip for portforward mode

* conditionally use sudo and quote values of env var

---------

Co-authored-by: Avi Weit <weit@il.ibm.com>
Co-authored-by: hemildesai <hemil.desai10@gmail.com>
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* refactor

* fix

* updates

* lint

* Update sky/skylet/providers/kubernetes/node_provider.py

* fix test

* [k8s] Showing reasons for provisioning failure in K8s (#2422)

* surface provision failure message

* nit

* nit

* format

* nit

* CPU message fix

* update Insufficient memory handling

* nit

* nit

* Update sky/skylet/providers/kubernetes/node_provider.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* Update sky/skylet/providers/kubernetes/node_provider.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* Update sky/skylet/providers/kubernetes/node_provider.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* Update sky/skylet/providers/kubernetes/node_provider.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* format

* update gpu failure message and condition

* fix GPU handling cases

* fix

* comment

* nit

* add try except block with general error handling

---------

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* cleanup

* lint

* fix for ssh jump image_id

* comments

* ssh jump refactor

* lint

* image build fixes

---------

Co-authored-by: Avi Weit <weit@il.ibm.com>
Co-authored-by: Hemil Desai <hemil.desai10@gmail.com>
Co-authored-by: Doyoung Kim <34902420+landscapepainter@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants