Databricks integrations (#2823)

* Add Databricks integration to ZenML * Add entrypoint and wheeled orchestrator classes * Add environment variable support to entrypoint.py and base_step.py * Update Databricks integration requirements The Databricks integration now requires the "databricks-sdk" package instead of "workflows_authoring_toolkit". * refactor: Remove unnecessary code in DatabricksIntegration class * Update Databricks integration * Update DatabricksOrchestrator to use empty strings for host, client_id, and client_secret * Auto-update of LLM Finetuning template * Refactor DatabricksOrchestratorConfig to use empty strings for host, client_id, and client_secret * chore: Remove hardcoding of ZenML library in Databricks orchestrator utils * Update Databricks orchestrator utils to remove hardcoding of ZenML library * databricks deployer * Update Databricks orchestrator utils to use correct import path * Update Databricks integration * Update Databricks integration * Update Databricks integration * format * update host * update pipeline_name * Update Databricks orchestrator to use orchestrator_run_name instead of pipeline_name * Update Databricks orchestrator to use orchestrator_run_name instead of pipeline_name * Refactor DatabricksDeploymentService to use Client from zenml.client module * add mlflow example as databricks example to delete later * format * update demo * update demo * update demo * Auto-update of E2E template * update demo * Auto-update of E2E template * update demo * update demo and orchetrator * remove logs * rename demo * rename demo * rename demo * Fixed info log * Fixed smaller log messages * Update DatabricksDeploymentService to handle missing endpoint secret name and token * Simplify DatabricksOrchestrator environment variable handling * Update wheeled_orchestrator.py with logger for wheel creation output * update how wheel are run * update load source * Refactor DatabricksOrchestrator environment variable handling * Refactor DatabricksOrchestrator environment variable handling * Refactor DatabricksOrchestrator environment variable handling * test env varibale * Refactor DatabricksOrchestrator environment variable handling * Remove unused job parameter definition in DatabricksOrchestrator * Refactor DatabricksOrchestrator environment variable handling * fix mypy and add docs * Optimised images with calibre/image-actions * Refactor DatabricksOrchestrator error message for job id retrieval * Refactor DatabricksOrchestrator error message for job id retrieval * update based on reviews * Optimised images with calibre/image-actions * add diagram * Optimised images with calibre/image-actions * Update docs/book/component-guide/model-deployers/databricks.md Co-authored-by: Hamza Tahir <hamza@zenml.io> * remove demo and fix mypy * add databricks to docs * add alpha tag to databricks * update logos --------- Co-authored-by: AlexejPenner <thealexejpenner@gmail.com> Co-authored-by: GitHub Actions <actions@github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Hamza Tahir <hamza@zenml.io>
zenml-io · Jul 16, 2024 · ef66cc0 · ef66cc0
1 parent 9be9fcd
commit ef66cc0
Show file tree

Hide file tree

Showing 27 changed files with 2,258 additions and 2 deletions.
diff --git a/docs/book/.gitbook/assets/DatabricksPermessions.png b/docs/book/.gitbook/assets/DatabricksPermessions.png
diff --git a/docs/book/.gitbook/assets/DatabricksRunUI.png b/docs/book/.gitbook/assets/DatabricksRunUI.png
diff --git a/docs/book/.gitbook/assets/DatabricksUI.png b/docs/book/.gitbook/assets/DatabricksUI.png
diff --git a/docs/book/.gitbook/assets/Databricks_How_It_works.png b/docs/book/.gitbook/assets/Databricks_How_It_works.png
diff --git a/docs/book/component-guide/model-deployers/databricks.md b/docs/book/component-guide/model-deployers/databricks.md
@@ -0,0 +1,148 @@
+---
+description: >-
+  Deploying models to Databricks Inference Endpoints with Databricks
+---
+
+# Databricks
+
+
+Databricks Model Serving or Mosaic AI Model Serving provides a unified interface to deploy, govern, and query AI models. Each model you serve is available as a REST API that you can integrate into your web or client application.
+
+This service provides dedicated and autoscaling infrastructure managed by Databricks, allowing you to deploy models without dealing with containers and GPUs.
+
+
+{% hint style="info" %}
+Databricks Model deployer can be considered as a managed service for deploying models using MLflow, This means you can switch between MLflow and Databricks Model Deployers without changing your pipeline code even for custom complex models.
+{% endhint %}
+
+## When to use it?
+
+You should use Databricks Model Deployer:
+
+*   You are already using Databricks for your data and ML workloads.
+*   If you want to deploy AI models without dealing with containers and GPUs, Databricks Model Deployer provides a unified interface to deploy, govern, and query models.
+*   Databricks Model Deployer offers dedicated and autoscaling infrastructure managed by Databricks, making it easier to deploy models at scale.
+*   Enterprise security is a priority, and you need to deploy models into secure offline endpoints accessible only via a direct connection to your Virtual Private Cloud (VPCs).
+*   if your goal is to turn your models into production-ready APIs with minimal infrastructure or MLOps involvement.
+
+
+If you are looking for a more easy way to deploy your models locally, you can use the [MLflow Model Deployer](mlflow.md) flavor.
+
+## How to deploy it?
+
+The Databricks Model Deployer flavor is provided by the Databricks ZenML integration, so you need to install it on your local machine to be able to deploy your models. You can do this by running the following command:
+
+```bash
+zenml integration install databricks -y
+```
+
+To register the Databricks model deployer with ZenML you need to run the following command:
+
+```bash
+zenml model-deployer register <MODEL_DEPLOYER_NAME> --flavor=databricks --host=<HOST> --client_id={{databricks.client_id}} --client_secret={{databricks.client_secret}}
+```
+
+{% hint style="info" %}
+We recommend creating a Databricks service account with the necessary permissions to create and run jobs. You can find more information on how to create a service account [here](https://docs.databricks.com/dev-tools/api/latest/authentication.html). You can generate a client_id and client_secret for the service account and use them to authenticate with Databricks.
+{% endhint %}
+
+We can now use the model deployer in our stack.
+
+```bash
+zenml stack update <CUSTOM_STACK_NAME> --model-deployer=<MODEL_DEPLOYER_NAME>
+```
+
+See the [databricks\_model\_deployer\_step](https://sdkdocs.zenml.io/latest/integration\_code\_docs/integrations-databricks/#zenml.integrations.databricks.steps.databricks\_deployer.databricks\_model\_deployer\_step) for an example of using the Databricks Model Deployer to deploy a model inside a ZenML pipeline step.
+
+## Configuration
+
+Within the `DatabricksServiceConfig` you can configure:
+
+
+* `model_name`: The name of the model that will be served, this will be used to identify the model in the Databricks Model Registry.
+* `model_version`: The version of the model that will be served, this will be used to identify the model in the Databricks Model Registry.
+* `workload_size`: The size of the workload that the model will be serving. This can be `ServedModelInputWorkloadSize.SMALL`, `ServedModelInputWorkloadSize.MEDIUM`, or `ServedModelInputWorkloadSize.LARGE`, you can import this enum from `from databricks.sdk.service.serving import ServedModelInputWorkloadSize`.
+* `scale_to_zero_enabled`: A boolean flag to enable or disable the scale to zero feature.
+* `env_vars`: A dictionary of environment variables to be passed to the model serving container.
+* `workload_type`: The type of workload that the model will be serving. This can be `ServedModelInputWorkloadType.CPU`, `ServedModelInputWorkloadType.GPU_LARGE`, `ServedModelInputWorkloadType.GPU_MEDIUM`, `ServedModelInputWorkloadType.GPU_SMALL`, or `ServedModelInputWorkloadType.MULTIGPU_MEDIUM`, you can import this enum from `from databricks.sdk.service.serving import ServedModelInputWorkloadType`.
+* `endpoint_secret_name`: The name of the secret that will be used to secure the endpoint and authenticate requests.
+
+For more information and a full list of configurable attributes of the Databricks Model Deployer, check out the [SDK Docs](https://sdkdocs.zenml.io/latest/integration\_code\_docs/integrations-databricks/#zenml.integrations.databricks.model\_deployers) and Databricks endpoint [code](https://github.com/databricks/databricks\_hub/blob/5e3b603ccc7cd6523d998e75f82848215abf9415/src/databricks\_hub/hf\_api.py#L6957).
+
+### Run inference on a provisioned inference endpoint
+
+The following code example shows how to run inference against a provisioned inference endpoint:
+
+```python
+from typing import Annotated
+from zenml import step, pipeline
+from zenml.integrations.databricks.model_deployers import DatabricksModelDeployer
+from zenml.integrations.databricks.services import DatabricksDeploymentService
+
+
+# Load a prediction service deployed in another pipeline
+@step(enable_cache=False)
+def prediction_service_loader(
+    pipeline_name: str,
+    pipeline_step_name: str,
+    running: bool = True,
+    model_name: str = "default",
+) -> DatabricksDeploymentService:
+    """Get the prediction service started by the deployment pipeline.
+
+    Args:
+        pipeline_name: name of the pipeline that deployed the MLflow prediction
+            server
+        step_name: the name of the step that deployed the MLflow prediction
+            server
+        running: when this flag is set, the step only returns a running service
+        model_name: the name of the model that is deployed
+    """
+    # get the Databricks model deployer stack component
+    model_deployer = DatabricksModelDeployer.get_active_model_deployer()
+
+    # fetch existing services with same pipeline name, step name and model name
+    existing_services = model_deployer.find_model_server(
+        pipeline_name=pipeline_name,
+        pipeline_step_name=pipeline_step_name,
+        model_name=model_name,
+        running=running,
+    )
+
+    if not existing_services:
+        raise RuntimeError(
+            f"No Databricks inference endpoint deployed by step "
+            f"'{pipeline_step_name}' in pipeline '{pipeline_name}' with name "
+            f"'{model_name}' is currently running."
+        )
+
+    return existing_services[0]
+
+
+# Use the service for inference
+@step
+def predictor(
+    service: DatabricksDeploymentService,
+    data: str
+) -> Annotated[str, "predictions"]:
+    """Run a inference request against a prediction service"""
+
+    prediction = service.predict(data)
+    return prediction
+
+
+@pipeline
+def databricks_deployment_inference_pipeline(
+    pipeline_name: str, pipeline_step_name: str = "databricks_model_deployer_step",
+):
+    inference_data = ...
+    model_deployment_service = prediction_service_loader(
+        pipeline_name=pipeline_name,
+        pipeline_step_name=pipeline_step_name,
+    )
+    predictions = predictor(model_deployment_service, inference_data)
+```
+
+For more information and a full list of configurable attributes of the Databricks Model Deployer, check out the [SDK Docs](https://sdkdocs.zenml.io/latest/integration\_code\_docs/integrations-databricks/#zenml.integrations.databricks.model\_deployers).
+
+<figure><img src="https://static.scarf.sh/a.png?x-pxid=f0b4f458-0a54-4fcd-aa95-d5ee424815bc" alt="ZenML Scarf"><figcaption></figcaption></figure>
diff --git a/docs/book/component-guide/orchestrators/databricks.md b/docs/book/component-guide/orchestrators/databricks.md
@@ -0,0 +1,192 @@
+---
+description: Orchestrating your pipelines to run on Databricks.
+---
+
+# Databricks Orchestrator
+
+[Databricks](https://www.databricks.com/) is a unified data analytics platform that combines the best of data warehouses and data lakes to offer an integrated solution for big data processing and machine learning. It provides a collaborative environment for data scientists, data engineers, and business analysts to work together on data projects. Databricks offers optimized performance and scalability for big data workloads.
+
+The Databricks orchestrator is an orchestrator flavor provided by the ZenML databricks integration that allows you to run your pipelines on Databricks. This integration enables you to leverage Databricks' powerful distributed computing capabilities and optimized environment for your ML pipelines within the ZenML framework.
+
+{% hint style="warning" %}
+The following features are currently in Alpha and may be subject to change. We recommend using them in a controlled environment and providing feedback to the ZenML team.
+{% endhint %}
+
+### When to use it
+
+You should use the Databricks orchestrator if:
+
+* you're already using Databricks for your data and ML workloads.
+* you want to leverage Databricks' powerful distributed computing capabilities for your ML pipelines.
+* you're looking for a managed solution that integrates well with other Databricks services.
+* you want to take advantage of Databricks' optimization for big data processing and machine learning.
+
+### Prerequisites
+
+You will need to do the following to start using the Databricks orchestrator:
+
+* An Active Databricks workspace, depends on the cloud provider you are using, you can find more information on how to create a workspace:
+    * [AWS](https://docs.databricks.com/en/getting-started/onboarding-account.html)
+    * [Azure](https://learn.microsoft.com/en-us/azure/databricks/getting-started/#--create-an-azure-databricks-workspace)
+    * [GCP](https://docs.gcp.databricks.com/en/getting-started/index.html)
+* Active Databricks account or service account with sufficient permission to create and run jobs
+
+## How it works
+
+
+![Databricks How It works Diagram](../../.gitbook/assets/Databricks_How_It_works.png)
+
+The Databricks orchestrator in ZenML leverages the concept of Wheel Packages. When you run a pipeline with the Databricks orchestrator, ZenML creates a Python wheel package from your project. This wheel package contains all the necessary code and dependencies for your pipeline.
+
+Once the wheel package is created, ZenML uploads it to Databricks. ZenML leverage Databricks SDK to create a job definition, This job definition includes information about the pipeline steps and ensures that each step is executed only after its upstream steps have successfully completed.
+
+The Databricks job is also configured with the necessary cluster settings to run. This includes specifying the version of Spark to use, the number of workers, the node type, and other configuration options.
+
+When the Databricks job is executed, it retrieves the wheel package from Databricks and runs the pipeline using the specified cluster configuration. The job ensures that the steps are executed in the correct order based on their dependencies.
+
+Once the job is completed, ZenML retrieves the logs and status of the job and updates the pipeline run accordingly. This allows you to monitor the progress of your pipeline and view the logs of each step.
+
+
+### How to use it
+
+To use the Databricks orchestrator, you first need to register it and add it to your stack. Before registering the orchestrator, you need to install the Databricks integration by running the following command:
+
+```shell
+zenml integration install databricks
+```
+
+This command will install the necessary dependencies, including the `databricks-sdk` package, which is required for authentication with Databricks. Once the integration is installed, you can proceed with registering the orchestrator and configuring the necessary authentication details.
+
+```shell
+zenml integration install databricks
+```
+
+Then, we can register the orchestrator and use it in our active stack:
+
+```shell
+zenml orchestrator register databricks_orchestrator --flavor=databricks --host="https://xxxxx.x.azuredatabricks.net" --client_id={{databricks.client_id}} --client_secret={{databricks.client_secret}}
+```
+
+{% hint style="info" %}
+We recommend creating a Databricks service account with the necessary permissions to create and run jobs. You can find more information on how to create a service account [here](https://docs.databricks.com/dev-tools/api/latest/authentication.html). You can generate a client_id and client_secret for the service account and use them to authenticate with Databricks.
+
+![Databricks Service Account Permession](../../.gitbook/assets/DatabricksPermessions.png)
+{% endhint %}
+
+```shell
+# Add the orchestrator to your stack
+zenml stack register databricks_stack -o databricks_orchestrator ... --set
+```
+
+You can now run any ZenML pipeline using the Databricks orchestrator:
+
+```shell
+python run.py
+```
+
+### Databricks UI
+
+Databricks comes with its own UI that you can use to find further details about your pipeline runs, such as the logs of your steps.
+
+![Databricks UI](../../.gitbook/assets/DatabricksUI.png)
+
+For any runs executed on Databricks, you can get the URL to the Databricks UI in Python using the following code snippet:
+
+```python
+from zenml.client import Client
+
+pipeline_run = Client().get_pipeline_run("<PIPELINE_RUN_NAME>")
+orchestrator_url = pipeline_run.run_metadata["orchestrator_url"].value
+```
+
+![Databricks Run UI](../../.gitbook/assets/DatabricksRunUI.png)
+
+
+### Run pipelines on a schedule
+
+The Databricks Pipelines orchestrator supports running pipelines on a schedule using its [native scheduling capability](https://docs.databricks.com/en/workflows/jobs/schedule-jobs.html).
+
+**How to schedule a pipeline**
+
+```python
+from zenml.config.schedule import Schedule
+
+# Run a pipeline every 5th minute
+pipeline_instance.run(
+    schedule=Schedule(
+        cron_expression="*/5 * * * *"
+    )
+)
+```
+
+{% hint style="warning" %}
+The Databricks orchestrator only supports the `cron_expression`, in the `Schedule` object, and will ignore all other parameters supplied to define the schedule.
+{% endhint %}
+
+{% hint style="warning" %}
+The Databricks orchestrator requires Java Timezone IDs to be used in the `cron_expression`. You can find a list of supported timezones [here](https://docs.oracle.com/middleware/1221/wcs/tag-ref/MISC/TimeZones.html), the timezone ID must be set in the settings of the orchestrator (see below for more imformation how to set settings for the orchestrator).
+{% endhint %}
+
+**How to delete a scheduled pipeline**
+
+Note that ZenML only gets involved to schedule a run, but maintaining the lifecycle of the schedule is the responsibility of the user.
+
+In order to cancel a scheduled Databricks pipeline, you need to manually delete the schedule in Databricks (via the UI or the CLI).
+
+### Additional configuration
+
+For additional configuration of the Databricks orchestrator, you can pass `DatabricksOrchestratorSettings` which allows you to change the Spark version, number of workers, node type, autoscale settings, Spark configuration, Spark environment variables, and schedule timezone.
+
+```python
+from zenml.integrations.databricks.flavors.databricks_orchestrator_flavor import DatabricksOrchestratorSettings
+
+databricks_settings = DatabricksOrchestratorSettings(
+    spark_version="15.3.x-scala2.12",
+    num_workers="3",
+    node_type_id="Standard_D4s_v5",
+    policy_id=POLICY_ID,
+    autoscale=(2, 3),
+    spark_conf={},
+    spark_env_vars={},
+    schedule_timezone="America/Los_Angeles" or "PST" # You can get the timezone ID from here: https://docs.oracle.com/middleware/1221/wcs/tag-ref/MISC/TimeZones.html
+)
+```
+
+These settings can then be specified on either pipeline-level or step-level:
+
+```python
+# Either specify on pipeline-level
+@pipeline(
+    settings={
+        "orchestrator.databricks": databricks_settings,
+    }
+)
+def my_pipeline():
+    ...
+```
+
+We can also enable GPU support for the Databricks orchestrator changing the `spark_version` and `node_type_id` to a GPU-enabled version and node type:
+
+```python
+from zenml.integrations.databricks.flavors.databricks_orchestrator_flavor import DatabricksOrchestratorSettings
+
+databricks_settings = DatabricksOrchestratorSettings(
+    spark_version="15.3.x-gpu-ml-scala2.12",
+    node_type_id="Standard_NC24ads_A100_v4",
+    policy_id=POLICY_ID,
+    autoscale=(1, 2),
+)
+```
+
+With these settings, the orchestrator will use a GPU-enabled Spark version and a GPU-enabled node type to run the pipeline on Databricks, next section will show how to enable CUDA for the GPU to give its full acceleration for your pipeline.
+
+#### Enabling CUDA for GPU-backed hardware
+
+Note that if you wish to use this orchestrator to run steps on a GPU, you will need to follow [the instructions on this page](../../how-to/training-with-gpus/training-with-gpus.md) to ensure that it works. It requires adding some extra settings customization and is essential to enable CUDA for the GPU to give its full acceleration.
+
+<figure><img src="https://static.scarf.sh/a.png?x-pxid=f0b4f458-0a54-4fcd-aa95-d5ee424815bc" alt="ZenML Scarf"><figcaption></figcaption></figure>
+
+
+Check out the [SDK docs](https://sdkdocs.zenml.io/latest/integration\_code\_docs/integrations-databricks/#zenml.integrations.databricks.flavors.databricks\_orchestrator\_flavor.DatabricksOrchestratorSettings) for a full list of available attributes and [this docs page](../../how-to/use-configuration-files/runtime-configuration.md) for more information on how to specify settings.
+
+For more information and a full list of configurable attributes of the Databricks orchestrator, check out the [SDK Docs](https://sdkdocs.zenml.io/latest/integration\_code\_docs/integrations-databricks/#zenml.integrations.databricks.orchestrators.databricks\_orchestrator.DatabricksOrchestrator) .
diff --git a/docs/book/toc.md b/docs/book/toc.md
@@ -187,6 +187,7 @@
   * [Kubernetes Orchestrator](component-guide/orchestrators/kubernetes.md)
   * [Google Cloud VertexAI Orchestrator](component-guide/orchestrators/vertex.md)
   * [AWS Sagemaker Orchestrator](component-guide/orchestrators/sagemaker.md)
+  * [Databricks Orchestrator](component-guide/orchestrators/databricks.md)
   * [Tekton Orchestrator](component-guide/orchestrators/tekton.md)
   * [Airflow Orchestrator](component-guide/orchestrators/airflow.md)
   * [Skypilot VM Orchestrator](component-guide/orchestrators/skypilot-vm.md)
@@ -223,6 +224,7 @@
   * [Seldon](component-guide/model-deployers/seldon.md)
   * [BentoML](component-guide/model-deployers/bentoml.md)
   * [Hugging Face](component-guide/model-deployers/huggingface.md)
+  * [Databricks](component-guide/model-deployers/databricks.md)
   * [Develop a Custom Model Deployer](component-guide/model-deployers/custom.md)
 * [👣 Step Operators](component-guide/step-operators/step-operators.md)
   * [Amazon SageMaker](component-guide/step-operators/sagemaker.md)

diff --git a/src/zenml/__init__.py b/src/zenml/__init__.py
@@ -56,6 +56,7 @@
 from zenml.new.steps.step_decorator import step
 from zenml.new.steps.step_context import get_step_context
 from zenml.steps.utils import log_step_metadata
+from zenml.entrypoints import entrypoint
 
 __all__ = [
     "ArtifactConfig",
@@ -74,4 +75,5 @@
     "save_artifact",
     "show",
     "step",
+    "entrypoint",
 ]
diff --git a/src/zenml/integrations/__init__.py b/src/zenml/integrations/__init__.py
@@ -23,6 +23,7 @@
 from zenml.integrations.azure import AzureIntegration  # noqa
 from zenml.integrations.bentoml import BentoMLIntegration  # noqa
 from zenml.integrations.bitbucket import BitbucketIntegration  # noqa
+from zenml.integrations.databricks import DatabricksIntegration  # noqa
 from zenml.integrations.comet import CometIntegration  # noqa
 from zenml.integrations.deepchecks import DeepchecksIntegration  # noqa
 from zenml.integrations.discord import DiscordIntegration  # noqa

diff --git a/src/zenml/integrations/constants.py b/src/zenml/integrations/constants.py
@@ -22,6 +22,7 @@
 BITBUCKET = "bitbucket"
 COMET = "comet"
 DASH = "dash"
+DATABRICKS = "databricks"
 DEEPCHECKS = "deepchecks"
 DISCORD = "discord"
 EVIDENTLY = "evidently"