Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skypilot with Kubernetes #3033

Merged
merged 23 commits into from
Sep 25, 2024
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
c5aead6
Add Skypilot Kubernetes integration for remote orchestration
safoinme Sep 21, 2024
342fa34
Add Skypilot Kubernetes integration for remote orchestration
safoinme Sep 21, 2024
7590584
Refactor SkypilotKubernetesOrchestrator prepare_environment_variable …
safoinme Sep 21, 2024
03951d8
Refactor SkypilotKubernetesOrchestrator prepare_environment_variable …
safoinme Sep 21, 2024
89718f0
Refactor SkypilotKubernetesOrchestrator prepare_environment_variable …
safoinme Sep 22, 2024
775253f
Refactor SkypilotBaseOrchestrator setup_credentials method
safoinme Sep 22, 2024
4b42940
Refactor SkypilotBaseOrchestrator setup_credentials method and Skypil…
safoinme Sep 22, 2024
305322e
Refactor SkypilotBaseOrchestrator setup_credentials method and Skypil…
safoinme Sep 22, 2024
edae0e8
Update src/zenml/integrations/skypilot/orchestrators/skypilot_base_vm…
safoinme Sep 23, 2024
2c68fa6
Merge branch 'develop' into feature/skypilot-kubernetes
safoinme Sep 23, 2024
38922b2
Merge branch 'feature/skypilot-kubernetes' of https://github.com/zenm…
safoinme Sep 23, 2024
364f8ac
Merge branch 'develop' into feature/skypilot-kubernetes
safoinme Sep 23, 2024
85e7f63
Refactor SkypilotKubernetesOrchestrator to handle missing service con…
safoinme Sep 23, 2024
2c421b0
Merge branch 'feature/skypilot-kubernetes' of https://github.com/zenm…
safoinme Sep 23, 2024
f4639f6
Refactor SkypilotBaseOrchestrator to use the correct virtual environm…
safoinme Sep 24, 2024
46194cc
specll check
safoinme Sep 24, 2024
abf58c4
Merge branch 'develop' into feature/skypilot-kubernetes
safoinme Sep 24, 2024
b3a4520
fix run command
safoinme Sep 24, 2024
1eef87a
Merge branch 'feature/skypilot-kubernetes' of https://github.com/zenm…
safoinme Sep 24, 2024
615c083
Refactor SkypilotKubernetesVMOrchestrator to remove unnecessary code
safoinme Sep 24, 2024
5b48068
Merge branch 'develop' into feature/skypilot-kubernetes
safoinme Sep 24, 2024
98973a6
Refactor SkypilotKubernetesVMOrchestrator to remove unnecessary code
safoinme Sep 24, 2024
126eefb
Merge branch 'feature/skypilot-kubernetes' of https://github.com/zenm…
safoinme Sep 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions docs/book/component-guide/orchestrators/skypilot-vm.md
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,55 @@ The Lambda Labs orchestrator does not support some of the features like `job_rec
While testing the orchestrator, we noticed that the Lambda Labs orchestrator does not support the `down` flag. This means the orchestrator will not automatically tear down the cluster after all jobs finish. We recommend manually tearing down the cluster after all jobs finish to avoid unnecessary costs.
{% endhint %}
{% endtab %}

{% tab title="Kubernetes" %}
We need first to install the SkyPilot integration for Kubernetes, using the following two commands:

```shell
zenml integration install skypilot_kubernetes
```

To provision skypilot on kubernetes cluster, your orchestrator stack components needs to be configured to authenticate with a
[Service Connector](../../how-to/auth-management/service-connectors-guide.md). To configure the Service Connector, you need to register a new service connector configured with the appropriate credentials and permissions to access the K8s cluster. You can then use the service connector to configure your registered the Orchestrator stack component using the following command:

First, check that the Kubernetes service connector type is available using the following command:

```shell
zenml service-connector list-types --type kubernetes
```
```shell
┏━━━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━━━┯━━━━━━━┯━━━━━━━━┓
┃ │ │ RESOURCE │ AUTH │ │ ┃
┃ NAME │ TYPE │ TYPES │ METHODS │ LOCAL │ REMOTE ┃
┠────────────┼────────────┼────────────┼───────────┼───────┼────────┨
┃ Kubernetes │ 🌀 │ 🌀 │ password │ ✅ │ ✅ ┃
┃ Service │ kubernetes │ kubernetes │ token │ │ ┃
┃ Connector │ │ -cluster │ │ │ ┃
┗━━━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━━━┷━━━━━━━┷━━━━━━━━┛
```

Next, configure a service connector using the CLI or the dashboard with the AWS credentials. For example, the following command uses the local AWS CLI credentials to auto-configure the service connector:

```shell
zenml service-connector register kubernetes-skypilot --type kubernetes -i
```

This will automatically configure the service connector with the appropriate credentials and permissions to provision VMs on AWS. You can then use the service connector to configure your registered VM Orchestrator stack component using the following command:

```shell
# Register the orchestrator
zenml orchestrator register <ORCHESTRATOR_NAME> --flavor sky_kubernetes
# Connect the orchestrator to the service connector
zenml orchestrator connect <ORCHESTRATOR_NAME> --connector kubernetes-skypilot

# Register and activate a stack with the new orchestrator
zenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set
```

{% hint style="warning" %}
Some of the features like `job_recovery`, `disk_tier`, `image_id`, `zone`, `idle_minutes_to_autostop`, `disk_size`, `use_spot` are not supported by the Kubernetes orchestrator. It is recommended not to use these features with the Kubernetes orchestrator and not to use [step-specific settings](skypilot-vm.md#configuring-step-specific-resources).
{% endhint %}
{% endtab %}
{% endtabs %}

#### Additional Configuration
Expand Down Expand Up @@ -392,6 +441,34 @@ skypilot_settings = SkypilotLambdaOrchestratorSettings(
)


@pipeline(
settings={
"orchestrator": skypilot_settings
}
)
```
{% endtab %}

{% tab title="Kubernetes" %}

**Code Example:**

```python
from zenml.integrations.skypilot_kubernetes.flavors.skypilot_orchestrator_kubernetes_vm_flavor import SkypilotKubernetesOrchestratorSettings

skypilot_settings = SkypilotKubernetesOrchestratorSettings(
cpus="2",
memory="16",
accelerators="V100:2",
image_id="ami-1234567890abcdef0",
disk_size=100,
cluster_name="my_cluster",
retry_until_up=True,
stream_logs=True
docker_run_args=["--gpus=all"]
)


@pipeline(
settings={
"orchestrator": skypilot_settings
Expand Down
1 change: 1 addition & 0 deletions src/zenml/integrations/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@
from zenml.integrations.skypilot_gcp import SkypilotGCPIntegration # noqa
from zenml.integrations.skypilot_azure import SkypilotAzureIntegration # noqa
from zenml.integrations.skypilot_lambda import SkypilotLambdaIntegration # noqa
from zenml.integrations.skypilot_kubernetes import SkypilotKubernetesIntegration # noqa
from zenml.integrations.slack import SlackIntegration # noqa
from zenml.integrations.spark import SparkIntegration # noqa
from zenml.integrations.tekton import TektonIntegration # noqa
Expand Down
1 change: 1 addition & 0 deletions src/zenml/integrations/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@
SKYPILOT_GCP = "skypilot_gcp"
SKYPILOT_AZURE = "skypilot_azure"
SKYPILOT_LAMBDA = "skypilot_lambda"
SKYPILOT_KUBERNETES = "skypilot_kubernetes"
SLACK = "slack"
SPARK = "spark"
TEKTON = "tekton"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,7 @@ def prepare_or_run_pipeline(
entrypoint_str = " ".join(command)
arguments_str = " ".join(args)

task_envs = environment
docker_environment_str = " ".join(
f"-e {k}={v}" for k, v in environment.items()
)
Expand All @@ -271,29 +272,33 @@ def prepare_or_run_pipeline(
f"sudo docker login --username $DOCKER_USERNAME --password "
f"$DOCKER_PASSWORD {stack.container_registry.config.uri}"
)
task_envs = {
"DOCKER_USERNAME": docker_username,
"DOCKER_PASSWORD": docker_password,
}
task_envs["DOCKER_USERNAME"] = docker_username
task_envs["DOCKER_PASSWORD"] = docker_password
else:
setup = None
task_envs = None

# Run the entire pipeline

# Set the service connector AWS profile ENV variable
self.prepare_environment_variable(set=True)

try:
if isinstance(self.cloud, sky.clouds.Kubernetes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might seem mostly cosmetic, but I'll say it anyway: you're adding implementation specific conditions in the base class, which is supposed to be implementation agnostic. The correct way to do this is through static attributes, properties or methods that the implementations can override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well i think i want to re-work the whole Skypilot integration if possible so instead of have a base abstraction and many sub integration which blocks one of the most important features of skypilot cross-cloud/env failover, tho I totally agree with your comment

run_command = f"$VIRTUAL_ENV{entrypoint_str} {arguments_str}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't work right, $VIRTUAL_ENV is for example /opt/venv and the python is in /opt/venv/bin/python. So this needs the bin directory to work, and then also a fallback in case $VIRTUAL_ENV is not set.

And we need to make sure this is actually working

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is important @safoinme

setup = None
down = False
idle_minutes_to_autostop = None
safoinme marked this conversation as resolved.
Show resolved Hide resolved
else:
run_command = f"sudo docker run --rm {custom_run_args}{docker_environment_str} {image} {entrypoint_str} {arguments_str}"
down = settings.down
idle_minutes_to_autostop = settings.idle_minutes_to_autostop
task = sky.Task(
run=f"sudo docker run --rm {custom_run_args}{docker_environment_str} {image} {entrypoint_str} {arguments_str}",
run=run_command,
setup=setup,
envs=task_envs,
)
logger.debug(
f"Running run: sudo docker run --rm {custom_run_args}{docker_environment_str} {image} {entrypoint_str} {arguments_str}"
)
logger.debug(f"Running run: {setup}")
logger.debug(f"Running run: {run_command}")

task = task.set_resources(
sky.Resources(
cloud=self.cloud,
Expand All @@ -306,15 +311,24 @@ def prepare_or_run_pipeline(
job_recovery=settings.job_recovery,
region=settings.region,
zone=settings.zone,
image_id=settings.image_id,
image_id=image
if isinstance(self.cloud, sky.clouds.Kubernetes)
else settings.image_id,
disk_size=settings.disk_size,
disk_tier=settings.disk_tier,
)
)

# Set the cluster name
cluster_name = settings.cluster_name
if cluster_name is None:
if settings.cluster_name:
sky.exec(
task,
settings.cluster_name,
down=down,
stream_logs=settings.stream_logs,
backend=None,
detach_run=True,
)
else:
# Find existing cluster
for i in sky.status(refresh=True):
if isinstance(
Expand All @@ -324,21 +338,19 @@ def prepare_or_run_pipeline(
logger.info(
f"Found existing cluster {cluster_name}. Reusing..."
)
if cluster_name is None:
cluster_name = self.sanitize_cluster_name(
f"{orchestrator_run_name}"
)

# Launch the cluster
sky.launch(
task,
cluster_name,
retry_until_up=settings.retry_until_up,
idle_minutes_to_autostop=settings.idle_minutes_to_autostop,
down=settings.down,
stream_logs=settings.stream_logs,
detach_setup=True,
)
# Launch the cluster
sky.launch(
task,
cluster_name,
retry_until_up=settings.retry_until_up,
idle_minutes_to_autostop=idle_minutes_to_autostop,
down=down,
stream_logs=settings.stream_logs,
detach_setup=True,
)

except Exception as e:
logger.error(f"Pipeline run failed: {e}")
Expand Down
52 changes: 52 additions & 0 deletions src/zenml/integrations/skypilot_kubernetes/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Copyright (c) ZenML GmbH 2024. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at:
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied. See the License for the specific language governing
# permissions and limitations under the License.
"""Initialization of the Skypilot Kubernetes integration for ZenML.

The Skypilot integration sub-module powers an alternative to the local
orchestrator for a remote orchestration of ZenML pipelines on VMs.
"""
from typing import List, Type

from zenml.integrations.constants import (
SKYPILOT_KUBERNETES,
)
from zenml.integrations.integration import Integration
from zenml.stack import Flavor

SKYPILOT_KUBERNETES_ORCHESTRATOR_FLAVOR = "vm_kubernetes"


class SkypilotKubernetesIntegration(Integration):
"""Definition of Skypilot Kubernetes Integration for ZenML."""

NAME = SKYPILOT_KUBERNETES
# all 0.6.x versions of skypilot[kubernetes] are compatible
REQUIREMENTS = ["skypilot[kubernetes]~=0.6.1"]
APT_PACKAGES = ["openssh-client", "rsync"]

@classmethod
def flavors(cls) -> List[Type[Flavor]]:
"""Declare the stack component flavors for the Skypilot Kubernetes integration.

Returns:
List of stack component flavors for this integration.
"""
from zenml.integrations.skypilot_kubernetes.flavors import (
SkypilotKubernetesOrchestratorFlavor,
)

return [SkypilotKubernetesOrchestratorFlavor]


SkypilotKubernetesIntegration.check_installation()
26 changes: 26 additions & 0 deletions src/zenml/integrations/skypilot_kubernetes/flavors/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Copyright (c) ZenML GmbH 2024. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at:
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
# or implied. See the License for the specific language governing
# permissions and limitations under the License.
"""Skypilot integration flavor for Skypilot Kubernetes orchestrator."""

from zenml.integrations.skypilot_kubernetes.flavors.skypilot_orchestrator_kubernetes_vm_flavor import (
SkypilotKubernetesOrchestratorConfig,
SkypilotKubernetesOrchestratorFlavor,
SkypilotKubernetesOrchestratorSettings,
)

__all__ = [
"SkypilotKubernetesOrchestratorConfig",
"SkypilotKubernetesOrchestratorFlavor",
"SkypilotKubernetesOrchestratorSettings",
]
Loading
Loading