Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s_cloud] Ray pod not created under current context namespace. #2302

Merged
merged 1 commit into from
Jul 26, 2023

Conversation

aviweit
Copy link
Contributor

@aviweit aviweit commented Jul 25, 2023

Hi @romilbhardwaj ,

I encountered an issue where ray pods and services are not created under the namespace defined in the current context (by defining it via kubectl config set-context --current --namespace=<namespace name>).

This PR seems to fix the issue under my environment. Can you please review it?
Thanks.

Please find below what I did:

sky local up # wait for kind to create skypilot cluster
kubectl config set-context --current --namespace=test
kubectl create ns test
sky launch ../tasks/hi.yaml # very simple task that echoes

Ray pod and services are created under default namespace rather than test. The provision seems to hang since the secret is being created under test namespace.

Looking into it more, it seems that namespace of the current context is located under context key

e.g. in my kubeadm cluster

{'context': {'cluster': 'kubernetes', 'namespace': 'test', 'user': 'kubernetes-admin'}, 'name': 'kubernetes-admin@kubernetes'}

in my kind cluster

 {'context': {'cluster': 'kind-skypilot', 'namespace': 'test', 'user': 'kind-skypilot'}, 'name': 'kind-skypilot'}

@romilbhardwaj
Copy link
Collaborator

romilbhardwaj commented Jul 26, 2023

Thanks for catching this @aviweit! This requires one more change in get_head_ssh_port in backend_utils.py to work correctly (namespace is hardcoded to default there), but I'll go ahead and merge this and fix directly in k8s_cloud. Thanks again for finding this bug!

(base) ➜  ~ sky launch -c clus
I 07-25 16:54:54 optimizer.py:636] == Optimizer ==
I 07-25 16:54:54 optimizer.py:647] Target: minimizing cost
I 07-25 16:54:54 optimizer.py:659] Estimated cost: $0.0 / hour
I 07-25 16:54:54 optimizer.py:659]
I 07-25 16:54:54 optimizer.py:732] Considered resources (1 node):
I 07-25 16:54:54 optimizer.py:781] ---------------------------------------------------------------------------------------------------
I 07-25 16:54:54 optimizer.py:781]  CLOUD        INSTANCE          vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
I 07-25 16:54:54 optimizer.py:781] ---------------------------------------------------------------------------------------------------
I 07-25 16:54:54 optimizer.py:781]  Kubernetes   2CPU--2GB         2       2         -              kubernetes    0.00          ✔
I 07-25 16:54:54 optimizer.py:781]  AWS          m6i.2xlarge       8       32        -              us-east-1     0.38
I 07-25 16:54:54 optimizer.py:781]  Azure        Standard_D8s_v5   8       32        -              eastus        0.38
I 07-25 16:54:54 optimizer.py:781]  IBM          bx2-8x32          8       32        -              us-east       0.38
I 07-25 16:54:54 optimizer.py:781]  GCP          n2-standard-8     8       32        -              us-central1   0.39
I 07-25 16:54:54 optimizer.py:781]  Lambda       gpu_1x_a10        30      200       A10:1          us-east-1     0.60
I 07-25 16:54:54 optimizer.py:781] ---------------------------------------------------------------------------------------------------
I 07-25 16:54:54 optimizer.py:781]
Launching a new cluster 'clus'. Proceed? [Y/n]: Y
I 07-25 16:54:55 cloud_vm_ray_backend.py:3962] Creating a new cluster: "clus" [1x Kubernetes(2CPU--2GB)].
I 07-25 16:54:55 cloud_vm_ray_backend.py:3962] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 07-25 16:54:55 cloud_vm_ray_backend.py:1409] To view detailed progress: tail -n100 -f /Users/romilb/sky_logs/sky-2023-07-25-16-54-53-205112/provision.log
I 07-25 16:54:55 cloud_vm_ray_backend.py:1762] Launching on Kubernetes
I 07-25 16:55:13 cloud_vm_ray_backend.py:1575] Successfully provisioned or found existing VM.
Clusters
NAME  LAUNCHED     RESOURCES                 STATUS  AUTOSTOP  COMMAND
clus  21 secs ago  1x Kubernetes(2CPU--2GB)  INIT    -         sky launch -c clus

Traceback (most recent call last):
  File "/Users/romilb/tools/anaconda3/bin/sky", line 8, in <module>
    sys.exit(cli())
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 239, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/cli.py", line 1112, in invoke
    return super().invoke(ctx)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 260, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/cli.py", line 1380, in launch
    _launch_with_confirm(task,
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/cli.py", line 796, in _launch_with_confirm
    sky.launch(
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 260, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 260, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/execution.py", line 487, in launch
    _execute(
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/execution.py", line 324, in _execute
    handle = backend.provision(task,
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 260, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 239, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/backend.py", line 56, in provision
    return self._provision(task, to_provision, dryrun, stream_logs,
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/cloud_vm_ray_backend.py", line 2619, in _provision
    ssh_port_list = handle.external_ssh_ports(
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/cloud_vm_ray_backend.py", line 2310, in external_ssh_ports
    self._update_stable_ssh_ports(max_attempts=max_attempts)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/cloud_vm_ray_backend.py", line 2229, in _update_stable_ssh_ports
    head_port = backend_utils.get_head_ssh_port(
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/common_utils.py", line 260, in _record
    return f(*args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/backend_utils.py", line 1622, in get_head_ssh_port
    head_ssh_port = clouds.Kubernetes.get_port(svc_name, 'default')
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/clouds/kubernetes.py", line 217, in get_port
    return kubernetes_utils.get_port(svc_name, namespace)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/skylet/providers/kubernetes/utils.py", line 15, in get_port
    head_service = kubernetes.core_api().read_namespaced_service(
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 24931, in read_namespaced_service
    return self.read_namespaced_service_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", line 25018, in read_namespaced_service_with_http_info
    return self.api_client.call_api(
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 373, in request
    return self.rest_client.GET(url,
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/kubernetes/client/rest.py", line 241, in GET
    return self.request("GET", url,
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/kubernetes/client/rest.py", line 235, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '4dde30ba-2773-442c-a2ad-c8b76836a3d1', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'f7320e7d-cba7-43fd-a67f-d02390250a65', 'X-Kubernetes-Pf-Prioritylevel-Uid': '3b49a09f-6452-4859-99bc-ffe73df162b2', 'Date': 'Tue, 25 Jul 2023 23:55:15 GMT', 'Content-Length': '210'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"services \"clus-ray-head-ssh\" not found","reason":"NotFound","details":{"name":"clus-ray-head-ssh","kind":"services"},"code":404}

@romilbhardwaj romilbhardwaj merged commit b36fba4 into skypilot-org:k8s_cloud Jul 26, 2023
romilbhardwaj added a commit that referenced this pull request Aug 2, 2023
* Working Ray K8s node provider based on SSH

* wip

* working provisioning with SkyPilot and ssh config

* working provisioning with SkyPilot and ssh config

* Updates to master

* ray2.3

* Clean up docs

* multiarch build

* hacking around ray start

* more port fixes

* fix up default instance selection

* fix resource selection

* Add provisioning timeout by checking if pods are ready

* Working mounting

* Remove catalog

* fixes

* fixes

* Fix ssh-key auth to create unique secrets

* Fix for ContainerCreating timeout

* Fix head node ssh port caching

* mypy

* lint

* fix ports

* typo

* cleanup

* cleanup

* wip

* Update setup

* readme updates

* lint

* Fix failover

* Fix failover

* optimize setup

* Fix sync down logs for k8s

* test wip

* instance name parsing wip

* Fix instance name parsing

* Merge fixes for query_status

* [k8s_cloud] Delete k8s service resources. (#2105)

Delete k8s service resources.

- 'sky down' for Kubernetes cloud to remove cluster service resources.

* Status refresh WIP

* refactor to kubernetes adaptor

* tests wip

* clean up auth

* wip tests

* cli

* cli

* sky local up/down cli

* cli

* lint

* lint

* lint

* Speed up kind cluster creation

* tests

* lint

* tests

* handling for non-reachable clusters

* Invalid kubeconfig handling

* Timeout for sky check

* code cleanup

* lint

* Do not raise error if GPUs requested, return empty list

* Address comments

* comments

* lint

* Remove public key upload

* add shebang

* comments

* change permissions

* remove chmod

* merge 2241

* add todo

* Handle kube config management for sky local commands (#2253)

* Set current-context (if availablee) after sky local down and remove incorrect prompt in sky local up

* Warn user of kubeconfig context switch during sky local up

* Use Optional instead of Union

* Switch context in create_cluster if cluster already exists.

* fix typo

* update sky check error msg after sky local down

* lint

* update timeout check

* fix import error

* Fix kube API access from within cluster (load_incluster_auth)

* lint

* lint

* working autodown and sky status -r

* lint

* add test_kubernetes_autodown

* lint

* address comments

* address comments

* lint

* deletion timeouts wip

* [k8s_cloud] Ray pod not created under current context namespace. (#2302)

'namespace' exists under 'context' key.

* head ssh port namespace fix

* [k8s-cloud] Typo in sky local --help. (#2308)

Typo.

* [k8s-cloud] Set build_image.sh to be executable. (#2307)

* Set build_image.sh to be executable.

* Use TAG to easily switch between registries.

* remove ingress

* remove debug statements

* UX and readme updates

* lint

* fix logging for 409 retry

* lint

* lint

* comments

* remove k8s from default clouds to run

---------

Co-authored-by: Avi Weit <weit@il.ibm.com>
Co-authored-by: Hemil Desai <hemil.desai10@gmail.com>
romilbhardwaj added a commit that referenced this pull request Aug 25, 2023
* Working Ray K8s node provider based on SSH

* wip

* working provisioning with SkyPilot and ssh config

* working provisioning with SkyPilot and ssh config

* Updates to master

* ray2.3

* Clean up docs

* multiarch build

* hacking around ray start

* more port fixes

* fix up default instance selection

* fix resource selection

* Add provisioning timeout by checking if pods are ready

* Working mounting

* Remove catalog

* fixes

* fixes

* Fix ssh-key auth to create unique secrets

* Fix for ContainerCreating timeout

* Fix head node ssh port caching

* mypy

* lint

* fix ports

* typo

* cleanup

* cleanup

* wip

* Update setup

* readme updates

* lint

* Fix failover

* Fix failover

* optimize setup

* Fix sync down logs for k8s

* test wip

* instance name parsing wip

* Fix instance name parsing

* Merge fixes for query_status

* [k8s_cloud] Delete k8s service resources. (#2105)

Delete k8s service resources.

- 'sky down' for Kubernetes cloud to remove cluster service resources.

* Status refresh WIP

* refactor to kubernetes adaptor

* tests wip

* clean up auth

* wip tests

* cli

* cli

* sky local up/down cli

* cli

* lint

* lint

* lint

* Speed up kind cluster creation

* tests

* lint

* tests

* handling for non-reachable clusters

* Invalid kubeconfig handling

* Timeout for sky check

* code cleanup

* lint

* Do not raise error if GPUs requested, return empty list

* Address comments

* comments

* lint

* Remove public key upload

* GPU support init

* wip

* add shebang

* comments

* change permissions

* remove chmod

* merge 2241

* add todo

* Handle kube config management for sky local commands (#2253)

* Set current-context (if availablee) after sky local down and remove incorrect prompt in sky local up

* Warn user of kubeconfig context switch during sky local up

* Use Optional instead of Union

* Switch context in create_cluster if cluster already exists.

* fix typo

* update sky check error msg after sky local down

* lint

* update timeout check

* fix import error

* Fix kube API access from within cluster (load_incluster_auth)

* lint

* lint

* working autodown and sky status -r

* lint

* add test_kubernetes_autodown

* lint

* address comments

* address comments

* lint

* deletion timeouts wip

* [k8s_cloud] Ray pod not created under current context namespace. (#2302)

'namespace' exists under 'context' key.

* head ssh port namespace fix

* [k8s-cloud] Typo in sky local --help. (#2308)

Typo.

* [k8s-cloud] Set build_image.sh to be executable. (#2307)

* Set build_image.sh to be executable.

* Use TAG to easily switch between registries.

* remove ingress

* remove debug statements

* UX and readme updates

* lint

* fix logging for 409 retry

* lint

* lint

* Debug dockerfile

* wip

* Fix GPU image

* Query cloud specific env vars in task setup (#2347)

* Query cloud specific env vars in task setup

* Make query_env_vars specific to Kubernetes cloud

* Address PR comments

* working GPU type selection for GKE and EKS. GFD needs work.

* TODO for auto-detection

* Add image toggling for CPU/GPU

* Add image toggling for CPU/GPU

* Fix none acce_type

* remove memory from j2

* Make resnet examples run again

* lint

* v100 readme

* dockerfile and smoketest

* fractional cpu and mem

* nits

* refactor utils

* lint and cleanup

* lint and cleanup

* lint and cleanup

* lint and cleanup

* lint and cleanup

* lint and cleanup

* lint

* lint

* manual lint

* manual isort

* test readme update

* Remove EKS

* lint

* add gpu labeler

* updates

* lint

* update script

* ux

* fix formatter

* test update

* test update

* fix test_optimizer_dryruns

* docs

* cleanup

* test readme update

* lint

* lint

* Update imagepullpolicy to always

* update image build

* typing hints

* update docstr

* some comments

* refactor

* refactor

* lint

* lint

* update gke cmd

* update monkeypatch

* yapf

* comments

* increase default mem when GPU task

* increase default mem when GPU task

* fix test_optimize_speed

* lint

* Add CPU+Mem based early filtering and better debug logging

* lint

* fix test_optimizer

* fixes

* fix k8s port fetch logic

* clean up instance make logic

* increase default mem when GPU task

* lint

* clean up fit check logic

* add catalog todo

* eksctl update

* update readme and gpu_labeler with comments

* monkeypatch in enable_all_clouds

* change to T4

---------

Co-authored-by: Avi Weit <weit@il.ibm.com>
Co-authored-by: Hemil Desai <hemil.desai10@gmail.com>
romilbhardwaj added a commit that referenced this pull request Sep 16, 2023
* Working Ray K8s node provider based on SSH

* wip

* working provisioning with SkyPilot and ssh config

* working provisioning with SkyPilot and ssh config

* Updates to master

* ray2.3

* Clean up docs

* multiarch build

* hacking around ray start

* more port fixes

* fix up default instance selection

* fix resource selection

* Add provisioning timeout by checking if pods are ready

* Working mounting

* Remove catalog

* fixes

* fixes

* Fix ssh-key auth to create unique secrets

* Fix for ContainerCreating timeout

* Fix head node ssh port caching

* mypy

* lint

* fix ports

* typo

* cleanup

* cleanup

* wip

* Update setup

* readme updates

* lint

* Fix failover

* Fix failover

* optimize setup

* Fix sync down logs for k8s

* test wip

* instance name parsing wip

* Fix instance name parsing

* Merge fixes for query_status

* [k8s_cloud] Delete k8s service resources. (#2105)

Delete k8s service resources.

- 'sky down' for Kubernetes cloud to remove cluster service resources.

* Status refresh WIP

* refactor to kubernetes adaptor

* tests wip

* clean up auth

* wip tests

* cli

* cli

* sky local up/down cli

* cli

* lint

* lint

* lint

* Speed up kind cluster creation

* tests

* lint

* tests

* handling for non-reachable clusters

* Invalid kubeconfig handling

* Timeout for sky check

* code cleanup

* lint

* Do not raise error if GPUs requested, return empty list

* Address comments

* comments

* lint

* Remove public key upload

* GPU support init

* wip

* add shebang

* comments

* change permissions

* remove chmod

* merge 2241

* add todo

* Handle kube config management for sky local commands (#2253)

* Set current-context (if availablee) after sky local down and remove incorrect prompt in sky local up

* Warn user of kubeconfig context switch during sky local up

* Use Optional instead of Union

* Switch context in create_cluster if cluster already exists.

* fix typo

* update sky check error msg after sky local down

* lint

* update timeout check

* fix import error

* Fix kube API access from within cluster (load_incluster_auth)

* lint

* lint

* working autodown and sky status -r

* lint

* add test_kubernetes_autodown

* lint

* address comments

* address comments

* lint

* deletion timeouts wip

* [k8s_cloud] Ray pod not created under current context namespace. (#2302)

'namespace' exists under 'context' key.

* head ssh port namespace fix

* [k8s-cloud] Typo in sky local --help. (#2308)

Typo.

* [k8s-cloud] Set build_image.sh to be executable. (#2307)

* Set build_image.sh to be executable.

* Use TAG to easily switch between registries.

* remove ingress

* remove debug statements

* UX and readme updates

* lint

* fix logging for 409 retry

* lint

* lint

* Debug dockerfile

* wip

* Fix GPU image

* Query cloud specific env vars in task setup (#2347)

* Query cloud specific env vars in task setup

* Make query_env_vars specific to Kubernetes cloud

* Address PR comments

* working GPU type selection for GKE and EKS. GFD needs work.

* TODO for auto-detection

* Add image toggling for CPU/GPU

* Add image toggling for CPU/GPU

* Fix none acce_type

* remove memory from j2

* Make resnet examples run again

* lint

* v100 readme

* dockerfile and smoketest

* fractional cpu and mem

* nits

* refactor utils

* lint and cleanup

* lint and cleanup

* lint and cleanup

* lint and cleanup

* lint and cleanup

* lint and cleanup

* lint

* lint

* manual lint

* manual isort

* test readme update

* Remove EKS

* lint

* add gpu labeler

* updates

* lint

* update script

* ux

* fix formatter

* test update

* test update

* fix test_optimizer_dryruns

* docs

* cleanup

* test readme update

* lint

* lint

* [k8s_cloud_beta1] Add sshjump host support. (#2369)

* Update build image

* fix image path

* fix merge

* cleanup

* lint

* fix utils ref

* typo

* refactor pod creation

* lint

* merge fixes

* portfix

* merge fixes

* [k8s_cloud_beta1] Sky down for a cluster deployed in Kubernetes to possibly remove sshjump pod. (#2425)

* Sky down for a kubernetes cluster to possibly terminate sshjump pod.

- If the related sshjump pod is being reported as its main container
  not have been started, then remove its pod and service. This is to
  minimize the chances for remaining with dangling sshjump pod.

* Remove sshjump service in case of an failure to analyze sshjump.

- remove _request_timeout as it might not be needed due to
  terminationGracePeriodSeconds being set in sshjump template.

* Move sshjump analysis to kubernetes_utils.

* Apply changes per ./format.sh.

* Minor comment rephrase.

* Use sshjump_name from ray pod label.

- rather than from clouds.Kubernetes

* cleanup

* Add networking benchmarks

* comment

* comment

* lint

* autodown fixes

* lint

* fix label

* [k8s_cloud_beta1] Adding support for ssh using kubectl port-forward to access k8s instance (#2412)

* Add sshjump support.

* Update lcm script.

- add comments
- rename variables
- typo

* Set imagePullPolicy to IfNotPresent.

* add support for port-forward

* remove unused

* comments

* Disable ControlMaster for ssh_options_list

* nit

* update to disable rest of the ControlMaster

* command runner rsync update

* relocating run_on_k8s

* relocate run_on_k8s

* Make Kubernetes specific env variables available when joining a cluster via SSH

* merge k8s_cloud_beta1

* format

* remove redundant utils.py

* format and comments

* update with proxy_to_k8s

* Update sky/authentication.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* resolving comments on structures

* Update sky/utils/command_runner.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* document on nodeport/port-forward proxycommand

* error handling when socat is not installed

* removing KUBECONFIG from port-forward shell script

* nit

* nit

* Add suport for nodeport

* Update sky/utils/kubernetes_utils.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* update

* switch svc when conflicting jump pod svc exist

* format

* Update sky/utils/kubernetes_utils.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* refactoring check for socat

* resolve comments

* add ServiceType enum and port-forward proxy script

* update k8s env var access

* add check for container status remove unused func

* nit

* update get_external_ip for portforward mode

* conditionally use sudo and quote values of env var

---------

Co-authored-by: Avi Weit <weit@il.ibm.com>
Co-authored-by: hemildesai <hemil.desai10@gmail.com>
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* refactor

* fix

* updates

* lint

* Update sky/skylet/providers/kubernetes/node_provider.py

* fix test

* [k8s] Showing reasons for provisioning failure in K8s (#2422)

* surface provision failure message

* nit

* nit

* format

* nit

* CPU message fix

* update Insufficient memory handling

* nit

* nit

* Update sky/skylet/providers/kubernetes/node_provider.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* Update sky/skylet/providers/kubernetes/node_provider.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* Update sky/skylet/providers/kubernetes/node_provider.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* Update sky/skylet/providers/kubernetes/node_provider.py

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* format

* update gpu failure message and condition

* fix GPU handling cases

* fix

* comment

* nit

* add try except block with general error handling

---------

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>

* cleanup

* lint

* fix for ssh jump image_id

* comments

* ssh jump refactor

* lint

* image build fixes

---------

Co-authored-by: Avi Weit <weit@il.ibm.com>
Co-authored-by: Hemil Desai <hemil.desai10@gmail.com>
Co-authored-by: Doyoung Kim <34902420+landscapepainter@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants