Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Dedicated docs on how to skip building an image on pipeline run #3079

Merged
merged 26 commits into from
Oct 18, 2024
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
ae03fd1
add some info on docker skip build
wjayesh Oct 8, 2024
73dd4e0
add docs on not building a docker image
wjayesh Oct 14, 2024
7687aae
update toc and title
wjayesh Oct 14, 2024
ddf105c
added text to stress that this doesnt always happen
wjayesh Oct 14, 2024
057aa73
Apply suggestions from code review
wjayesh Oct 14, 2024
1df5ba3
restructure headings
wjayesh Oct 14, 2024
b69e138
Merge branch 'docs/docker-skip-build' of https://github.com/zenml-io/…
wjayesh Oct 14, 2024
bbe9e95
more english
wjayesh Oct 14, 2024
516a214
Apply suggestions from code review
wjayesh Oct 15, 2024
0b966c1
Merge branch 'docs/docker-skip-build' of https://github.com/zenml-io/…
wjayesh Oct 14, 2024
d373e44
apply review changes
wjayesh Oct 16, 2024
563cb04
add how to reuse builds page
wjayesh Oct 16, 2024
75d947c
aoply hamza comments
wjayesh Oct 16, 2024
44dc550
add redirect for new page name
wjayesh Oct 16, 2024
e5cd75e
apply review changes
wjayesh Oct 16, 2024
a369a8c
move the artifact store block to the top
wjayesh Oct 16, 2024
b626e21
update redirect
wjayesh Oct 16, 2024
d2acb0a
add scarf
wjayesh Oct 16, 2024
a3d8da2
Update .gitbook.yaml
wjayesh Oct 17, 2024
d9daabc
link to code repository
wjayesh Oct 16, 2024
1a0d4dc
Merge branch 'develop' into docs/docker-skip-build
wjayesh Oct 17, 2024
b382722
Merge branch 'develop' into docs/docker-skip-build
wjayesh Oct 17, 2024
a958607
fix relative link
wjayesh Oct 16, 2024
dcbfd8d
Apply suggestions from code review
wjayesh Oct 17, 2024
a407181
Merge branch 'develop' into docs/docker-skip-build
wjayesh Oct 17, 2024
740150d
add where the code should be added
wjayesh Oct 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .gitbook.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ structure:
readme: introduction.md
summary: toc.md

#redirects:
# help: ./support.md
redirects:
how-to/customize-docker-builds/use-code-repositories-to-speed-up-docker-build-times: how-to/customize-docker-builds/how-to-reuse-builds.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ def my_pipeline(...):
```

{% hint style="warning" %}
This is an advanced feature and may cause unintended behavior when running your pipelines. If you use this, ensure your code files are correctly included in the image you specified.
This is an advanced feature and may cause unintended behavior when running your pipelines. If you use this, ensure your code files are correctly included in the image you specified. Read in detail about this feature [here](./use-a-prebuilt-image.md) before proceeding.
{% endhint %}

<figure><img src="https://static.scarf.sh/a.png?x-pxid=f0b4f458-0a54-4fcd-aa95-d5ee424815bc" alt="ZenML Scarf"><figcaption></figcaption></figure>
wjayesh marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,8 +1,43 @@
# Use code repositories to speed up Docker build times
---
description: >
Learn how to reuse builds to speed up your pipeline runs.
---

While reusing Docker builds is useful, it can be limited. This is because specifying a custom build when running a pipeline will **not run the code on your client machine** but will use the code **included in the Docker images of the build**. As a consequence, even if you make local code changes, reusing a build will _always_ execute the code bundled in the Docker image, rather than the local code. Therefore, if you would like to reuse a Docker build AND make sure your local code changes are also downloaded into the image, you need to disconnect your code from the build.
# How to reuse builds

You can do so by connecting a git repository. Registering a code repository lets you avoid building images each time you run a pipeline **and** quickly iterate on your code. When running a pipeline that is part of a local code repository checkout, ZenML can instead build the Docker images without including any of your source files, and download the files inside the container before running your code. This greatly speeds up the building process and also allows you to reuse images that one of your colleagues might have built for the same stack.
When you run a pipeline, ZenML will check if a build with the same pipeline and stack exists. If it does, it will reuse that build. If it doesn't, ZenML will create a new build. This guide explains what a build is and the best practices around reusing builds.

## What is a build?

wjayesh marked this conversation as resolved.
Show resolved Hide resolved
A pipeline build is an encapsulation of a pipeline and the stack it was run on. It contains the Docker images that were built for the pipeline with all the requirements from the stack, integrations and the user. Optionally, it also contains the pipeline code.

You can list all the builds for a pipeline using the CLI:

```bash
zenml pipeline builds list --pipeline_id='startswith:ab53ca'
```

You can also create a build manually using the CLI:

```bash
zenml pipeline build --stack vertex-stack my_module.my_pipeline_instance
```

You can use the options to specify the configuration file and the stack to use for the build. The source should be a path to a pipeline instance. Learn more about the build function [here](https://sdkdocs.zenml.io/latest/core_code_docs/core-new/#zenml.new.pipelines.pipeline.Pipeline.build).

## Reusing builds

As already mentioned, ZenML will find an existing build if it matches your pipeline and stack, by itself. However, you can also force it to use a specific build by [passing the build ID](../../how-to/use-configuration-files/what-can-be-configured.md#build-id) to the `build` parameter of the pipeline configuration.

While reusing Docker builds is useful, it can be limited. This is because specifying a custom build when running a pipeline will **not run the code on your client machine** but will use the code **included in the Docker images of the build**. As a consequence, even if you make local code changes, reusing a build will _always_ execute the code bundled in the Docker image, rather than the local code. Therefore, if you would like to reuse a Docker build AND make sure your local code changes are also downloaded into the image, you need to disconnect your code from the build. You can do this either by registering a code repository or by letting ZenML use the artifact store to upload your code.

wjayesh marked this conversation as resolved.
Show resolved Hide resolved
## Use the artifact store to upload your code

You can also let ZenML use the artifact store to upload your code. This is the default behaviour if no code repository is detected and the `allow_download_from_artifact_store` flag is not set to `False` in your `DockerSettings`.

## Use code repositories to speed up Docker build times

One way to speed up Docker builds is to connect a git repository. Registering a [code repository](../../user-guide/production-guide/connect-code-repository.md) lets you avoid building images each time you run a pipeline **and** quickly iterate on your code. When running a pipeline that is part of a local code repository checkout, ZenML can instead build the Docker images without including any of your source files, and download the files inside the container before running your code. This greatly speeds up the building process and also allows you to reuse images that one of your colleagues might have built for the same stack.

ZenML will **automatically figure out which builds match your pipeline and reuse the appropriate build id**. Therefore, you **do not** need to explicitly pass in the build id when you have a clean repository state and a connected git repository. This approach is **highly recommended**. See an end to end example [here](../../user-guide/production-guide/connect-code-repository.md).

Expand All @@ -14,18 +49,18 @@ zenml integration install github
```
{% endhint %}

## Detecting local code repository checkouts
### Detecting local code repository checkouts

wjayesh marked this conversation as resolved.
Show resolved Hide resolved
Once you have registered one or more code repositories, ZenML will check whether the files you use when running a pipeline are tracked inside one of those code repositories. This happens as follows:

* First, the [source root](./which-files-are-built-into-the-image.md) is computed
* Next, ZenML checks whether this source root directory is included in a local checkout of one of the registered code repositories

## Tracking code version for pipeline runs
### Tracking code version for pipeline runs

wjayesh marked this conversation as resolved.
Show resolved Hide resolved
wjayesh marked this conversation as resolved.
Show resolved Hide resolved
If a [local code repository checkout](#detecting-local-code-repository-checkouts) is detected when running a pipeline, ZenML will store a reference to the current commit for the pipeline run, so you'll be able to know exactly which code was used. Note that this reference is only tracked if your local checkout is clean (i.e. it does not contain any untracked or uncommitted files). This is to ensure that your pipeline is actually running with the exact code stored at the specific code repository commit.

## Tips and best practices
### Tips and best practices

It is also important to take some additional points into consideration:

Expand Down
120 changes: 120 additions & 0 deletions docs/book/how-to/customize-docker-builds/use-a-prebuilt-image.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
description: "Skip building an image for your ZenML pipeline altogether."
---

# Use a prebuilt image for pipeline execution
wjayesh marked this conversation as resolved.
Show resolved Hide resolved

When running a pipeline on a remote Stack, ZenML builds a Docker image with a base ZenML image and adds all of your project dependencies to it. Optionally, if a code repository is not registered and `allow_download_from_artifact_store` is not set to `True` in your `DockerSettings`, ZenML will also add your pipeline code to the image. This process might take significant time depending on how big your dependencies are, how powerful your local system is and how fast your internet connection is. This is because Docker must pull base layers and push the final image to your container registry. Although this process only happens once and is skipped if ZenML detects no change in your environment, it might still be a bottleneck slowing down your pipeline execution.

To save time and costs, you can choose to not build a Docker image every time your pipeline runs. This guide shows you how to do it using a prebuilt image, what you should include in your image for the pipeline to run successfully and other tips.

{% hint style="info" %}
Note that using this feature means that you won't be able to leverage any updates you make to your code or dependencies, outside of what your image already contains.
{% endhint %}

## How do you use this feature

wjayesh marked this conversation as resolved.
Show resolved Hide resolved
The [DockerSettings](../../../../docs/book/how-to/customize-docker-builds/docker-settings-on-a-pipeline.md#specify-docker-settings-for-a-pipeline) class in ZenML allows you to set a parent image to be used in your pipeline runs and the ability to skip building an image on top of it.
wjayesh marked this conversation as resolved.
Show resolved Hide resolved

Just set the `parent_image` attribute of the `DockerSettings` class to the image you want to use and set `skip_build` to `True`.
wjayesh marked this conversation as resolved.
Show resolved Hide resolved

```python
docker_settings = DockerSettings(
parent_image="my_registry.io/image_name:tag",
skip_build=True
)


@pipeline(settings={"docker": docker_settings})
def my_pipeline(...):
...
```

{% hint style="warning" %}
You should make sure that this image is pushed to a registry where the orchestrator/step operator/other components that require the image can pull it from, without any involvement by ZenML.
wjayesh marked this conversation as resolved.
Show resolved Hide resolved
{% endhint %}

## What the parent image should contain

When you run a pipeline with a pre-built image, skipping the build process, ZenML will not build any image on top of it. This means that the image you provide to the `parent_image` attribute of the `DockerSettings` class has to contain all the dependencies that are needed to run your pipeline, and optionally any code files if you don't have a code repository registered, and the `allow_download_from_artifact_store` flag is set to `False`.

{% hint style="info" %}
Note that this is different from the case where you [only specify a parent image](./docker-settings-on-a-pipeline.md#using-a-pre-built-parent-image) and don't want to `skip_build`. In the latter, ZenML still builds the image but does it on top of your parent image and not the base ZenML image.
{% endhint %}
{% hint style="info" %}
If you're using an image that was already built by ZenML in a previous pipeline run, you don't need to worry about what goes in it as long as it was built for the **same stack** as your current pipeline run. You can use it directly.
{% endhint %}

The following points are derived from how ZenML builds an image internally and will help you make your own images.

### Your stack requirements
schustmi marked this conversation as resolved.
Show resolved Hide resolved

A ZenML Stack can have different components and each comes with its own requirements. You need to ensure that your image contains them. The following is how you can get a list of stack requirements.

```python
from zenml.client import Client

stack_name = <YOUR_STACK>
# set your stack as active if it isn't already
Client().set_active_stack(stack_name)

# get the requirements for the active stack
active_stack = Client().active_stack
stack_requirements = active_stack.requirements()
```

### Integration requirements

For all integrations that you use in your pipeline, you need to have their dependencies installed too. You can get a list of them in the following way:

```python
from zenml.integrations.registry import integration_registry
from zenml.integrations.constants import HUGGINGFACE, PYTORCH

# define a list of all required integrations
required_integrations = [PYTORCH, HUGGINGFACE]

# Generate requirements for all required integrations
integration_requirements = set(
itertools.chain.from_iterable(
integration_registry.select_integration_requirements(
integration_name=integration,
target_os=OperatingSystemType.LINUX,
)
for integration in required_integrations
)
)
```

### Any project-specific requirements

For any other dependencies that your project relies on, you can then install all of these different requirements through a line in your `Dockerfile` that looks like the following. It assumes you have accumulated all the requirements in one file.

```Dockerfile
RUN pip install <ANY_ARGS> -r FILE
```

### Any system packages

If you have any `apt` packages that are needed for your application to function, be sure to include them too. This can be achieved in a `Dockerfile` as follows:

```Dockerfile
RUN apt-get update && apt-get install -y --no-install-recommends YOUR_APT_PACKAGES
```

### Your project code files

The files containing your pipeline and step code and all other necessary functions should be available in your execution environment.

- If you have a code repository registered, you don't need to include your code files in the image yourself. ZenML will download them from the repository to the appropriate location in the image.
wjayesh marked this conversation as resolved.
Show resolved Hide resolved

- If you don't have a code repository but `allow_download_from_artifact_store` is set to `True` in your `DockerSettings` (`True` by default), ZenML will upload your code to the artifact store and make it available to the image.

- If both of these options are disabled, you can include your code files in the image yourself. This approach is not recommended and you should use one of the above options.

Take a look at [which files are built into the image](../../../../docs/book/how-to/customize-docker-builds/which-files-are-built-into-the-image.md) page to learn more about what to include.


{% hint style="info" %}
Note that you also need Python, `pip` and `zenml` installed in your image.
{% endhint %}
3 changes: 2 additions & 1 deletion docs/book/toc.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,10 +102,11 @@
* [🐳 Customize Docker builds](how-to/customize-docker-builds/README.md)
* [Docker settings on a pipeline](how-to/customize-docker-builds/docker-settings-on-a-pipeline.md)
* [Docker settings on a step](how-to/customize-docker-builds/docker-settings-on-a-step.md)
* [Use a prebuilt image for pipeline execution](how-to/customize-docker-builds/use-a-prebuilt-image.md)
* [Specify pip dependencies and apt packages](how-to/customize-docker-builds/specify-pip-dependencies-and-apt-packages.md)
* [Use your own Dockerfiles](how-to/customize-docker-builds/use-your-own-docker-files.md)
* [Which files are built into the image](how-to/customize-docker-builds/which-files-are-built-into-the-image.md)
* [Use code repositories to automate Docker build reuse](how-to/customize-docker-builds/use-code-repositories-to-speed-up-docker-build-times.md)
* [How to reuse builds](how-to/customize-docker-builds/how-to-reuse-builds.md)
* [Define where an image is built](how-to/customize-docker-builds/define-where-an-image-is-built.md)
* [📔 Run remote pipelines from notebooks](how-to/run-remote-steps-and-pipelines-from-notebooks/README.md)
* [Limitations of defining steps in notebook cells](how-to/run-remote-steps-and-pipelines-from-notebooks/limitations-of-defining-steps-in-notebook-cells.md)
Expand Down
Loading