Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] Unable to disable cache for pipeline steps #10966

Closed
rmoesbergen opened this issue Jun 27, 2024 · 7 comments
Closed

[backend] Unable to disable cache for pipeline steps #10966

rmoesbergen opened this issue Jun 27, 2024 · 7 comments

Comments

@rmoesbergen
Copy link

Environment

  • How did you deploy Kubeflow Pipelines (KFP)?
    Using Kustomize, in GKE

  • KFP version:
    Pipelines version 2.2.0

  • KFP SDK version:
    SDK version 1.8.22 and 2.7.0 (both have this issue)

Steps to reproduce

I have a pipeline like this:

        train_op = (
            train_loader.create_op(
                job_name=job_name,
                account=account,
            )
            .set_caching_options(False)
        )
        train_op.execution_options.caching_strategy.max_cache_staleness = "P0D"

When compiling this pipeline, the Pod still gets these annotations

labels:
    app: kubeflow-job
    pipeline/runid: d2173b95-f465-47ac-a38a-470769c2064b
    pipelines.kubeflow.org/cache_enabled: 'true'
    pipelines.kubeflow.org/cache_id: ''
    pipelines.kubeflow.org/enable_caching: 'false'
  annotations:
    pipelines.kubeflow.org/execution_cache_key: 852f1ec5f95c01d9c0e62b85072fa8092f5f7933e73a08ec96e6ebb74229391e

and no matter what I try, kubeflow keeps caching the steps which makes no sense since our underlying data changes, but the parameters are the same. Also tried all of the suggestions here, including modifying the mutating admission webhook, but nothing works:

#4857
https://www.kubeflow.org/docs/components/pipelines/v1/overview/caching/
https://www.kubeflow.org/docs/components/pipelines/v2/caching/

The only thing that sort-of works is reverting kubeflow pipelines back to 2.0.5. The annotations are then still there, but somehow kubeflow ignores them and doesn't cache with that version.

Expected result

Kubeflow stops caching when I ask it to.


Impacted by this bug? Give it a 👍.

@tjhorner
Copy link

tjhorner commented Jul 9, 2024

There's a discrepancy between the SDK and the backend about what label to use to control the caching behavior.

The backend uses pipelines.kubeflow.org/cache_enabled:

KFPCacheEnabledLabelKey string = "pipelines.kubeflow.org/cache_enabled"

But the SDK uses pipelines.kubeflow.org/enable_caching:

# Caching option
op.add_pod_label('pipelines.kubeflow.org/enable_caching',
str(op.enable_caching).lower())

(here and in a few other locations.)

As a workaround, you can manually add the label pipelines.kubeflow.org/cache_enabled: 'false' to your pods, for example:

train_op.add_pod_label("pipelines.kubeflow.org/cache_enabled", "false")

The SDK should be updated to use the correct label.

@gregsheremeta
Copy link
Contributor

Hm, I'll look into this. I can verify that the annotations are not used in 2.0.5, which is coincidentally the version that I tend to run. The only thing that matters is in 2.0.5 the enableCache option in the proto. The check is here.

I'll take a look at 2.2.0 and report back.

@gregsheremeta
Copy link
Contributor

/assign @gregsheremeta

Copy link

github-actions bot commented Oct 9, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 9, 2024
@gregsheremeta
Copy link
Contributor

/remove-lifecycle stale

@google-oss-prow google-oss-prow bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 14, 2024
@gregsheremeta
Copy link
Contributor

gregsheremeta commented Oct 18, 2024

I tested this on KFP 2.3.0 with SDK 2.9.0 (both latest releases as of today), and I verified my previous comment -- it's only the enableCache option in the proto that controls this in KFP v2. And it works as expected for me.

I created a pipeline with a single component/task. First, I set caching to enabled via my_task.set_caching_options(True). Here is the python and the resulting yaml.

from kfp import dsl

@dsl.component(base_image="python:3.9")
def caching_enabled_component():
    pass

@dsl.pipeline(name='caching-example-pipeline')
def caching_example_enabled_pipeline():
    my_task = caching_enabled_component()
    my_task.set_caching_options(True)
# PIPELINE DEFINITION
# Name: caching-example-pipeline
components:
  comp-caching-enabled-component:
    executorLabel: exec-caching-enabled-component
deploymentSpec:
  executors:
    exec-caching-enabled-component:
      container:
        args:
        - --executor_input
        - '{{$}}'
        - --function_to_execute
        - caching_enabled_component
        command:
        - sh
        - -c
        - "\nif ! [ -x \"$(command -v pip)\" ]; then\n    python3 -m ensurepip ||\
          \ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\
          \ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.9.0'\
          \ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"\
          $0\" \"$@\"\n"
        - sh
        - -ec
        - 'program_path=$(mktemp -d)


          printf "%s" "$0" > "$program_path/ephemeral_component.py"

          _KFP_RUNTIME=true python3 -m kfp.dsl.executor_main                         --component_module_path                         "$program_path/ephemeral_component.py"                         "$@"

          '
        - "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\
          \ *\n\ndef caching_enabled_component():\n    pass\n\n"
        image: python:3.9
pipelineInfo:
  name: caching-example-pipeline
root:
  dag:
    tasks:
      caching-enabled-component:
        cachingOptions:
          enableCache: true
        componentRef:
          name: comp-caching-enabled-component
        taskInfo:
          name: caching-enabled-component
schemaVersion: 2.1.0
sdkVersion: kfp-2.9.0

When I upload and run this, the first run causes 3 pods to be created. When I run it a second time, the third pod (the one that runs my component) is skipped / cached. The driver log (second pod) outputs:
I1018 23:31:29.380282 19 driver.go:332] Use cache for task caching-enabled-component
And the UI renders the cache icon.

Next, I copied the pipeline to a new file, and I set caching to disabled via my_task.set_caching_options(False). Here is the python and the resulting yaml.

from kfp import dsl

@dsl.component(base_image="python:3.9")
def caching_disabled_component():
    pass

@dsl.pipeline(name='caching-example-pipeline')
def caching_example_disabled_pipeline():
    my_task = caching_disabled_component()
    my_task.set_caching_options(False)
# PIPELINE DEFINITION
# Name: caching-example-pipeline
components:
  comp-caching-disabled-component:
    executorLabel: exec-caching-disabled-component
deploymentSpec:
  executors:
    exec-caching-disabled-component:
      container:
        args:
        - --executor_input
        - '{{$}}'
        - --function_to_execute
        - caching_disabled_component
        command:
        - sh
        - -c
        - "\nif ! [ -x \"$(command -v pip)\" ]; then\n    python3 -m ensurepip ||\
          \ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\
          \ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.9.0'\
          \ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"\
          $0\" \"$@\"\n"
        - sh
        - -ec
        - 'program_path=$(mktemp -d)


          printf "%s" "$0" > "$program_path/ephemeral_component.py"

          _KFP_RUNTIME=true python3 -m kfp.dsl.executor_main                         --component_module_path                         "$program_path/ephemeral_component.py"                         "$@"

          '
        - "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\
          \ *\n\ndef caching_disabled_component():\n    pass\n\n"
        image: python:3.9
pipelineInfo:
  name: caching-example-pipeline
root:
  dag:
    tasks:
      caching-disabled-component:
        cachingOptions: {}
        componentRef:
          name: comp-caching-disabled-component
        taskInfo:
          name: caching-disabled-component
schemaVersion: 2.1.0
sdkVersion: kfp-2.9.0

No matter how many times I run this, all three pods are created. The component is never cached / the third pod is never skipped.

Note the diff between the two yamls (other than names) is:

41,43c41,42
<       caching-enabled-component:
<         cachingOptions:
<           enableCache: true
---
>       caching-disabled-component:
>         cachingOptions: {}

root.dag.tasks.COMPONENT.cachingOptions is what controls it, and that is the only thing being evaluated in driver where the caching decision is made.

Closing works-as-expected.

/close

Copy link

@gregsheremeta: Closing this issue.

In response to this:

I tested this on KFP 2.3.0 with SDK 2.9.0 (both latest releases as of today), and I verified my previous comment -- it's only the the enableCache option in the proto that controls this in KFP v2. And it works as expected for me.

I created a pipeline with a single component/task. First, I set caching to enabled via my_task.set_caching_options(True). Here is the python and the resulting yaml.

from kfp import dsl

@dsl.component(base_image="python:3.9")
def caching_enabled_component():
   pass

@dsl.pipeline(name='caching-example-pipeline')
def caching_example_enabled_pipeline():
   my_task = caching_enabled_component()
   my_task.set_caching_options(True)
# PIPELINE DEFINITION
# Name: caching-example-pipeline
components:
 comp-caching-enabled-component:
   executorLabel: exec-caching-enabled-component
deploymentSpec:
 executors:
   exec-caching-enabled-component:
     container:
       args:
       - --executor_input
       - '{{$}}'
       - --function_to_execute
       - caching_enabled_component
       command:
       - sh
       - -c
       - "\nif ! [ -x \"$(command -v pip)\" ]; then\n    python3 -m ensurepip ||\
         \ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\
         \ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.9.0'\
         \ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"\
         $0\" \"$@\"\n"
       - sh
       - -ec
       - 'program_path=$(mktemp -d)


         printf "%s" "$0" > "$program_path/ephemeral_component.py"

         _KFP_RUNTIME=true python3 -m kfp.dsl.executor_main                         --component_module_path                         "$program_path/ephemeral_component.py"                         "$@"

         '
       - "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\
         \ *\n\ndef caching_enabled_component():\n    pass\n\n"
       image: python:3.9
pipelineInfo:
 name: caching-example-pipeline
root:
 dag:
   tasks:
     caching-enabled-component:
       cachingOptions:
         enableCache: true
       componentRef:
         name: comp-caching-enabled-component
       taskInfo:
         name: caching-enabled-component
schemaVersion: 2.1.0
sdkVersion: kfp-2.9.0

When I upload and run this, the first run causes 3 pods to be created. When I run it a second time, the third pod (the one that runs my component) is skipped / cached. The driver log (second pod) outputs:
I1018 23:31:29.380282 19 driver.go:332] Use cache for task caching-enabled-component
And the UI renders the cache icon.

Next, I copied the pipeline to a new file, and I set caching to disabled via my_task.set_caching_options(False). Here is the python and the resulting yaml.

from kfp import dsl

@dsl.component(base_image="python:3.9")
def caching_disabled_component():
   pass

@dsl.pipeline(name='caching-example-pipeline')
def caching_example_disabled_pipeline():
   my_task = caching_disabled_component()
   my_task.set_caching_options(False)
# PIPELINE DEFINITION
# Name: caching-example-pipeline
components:
 comp-caching-disabled-component:
   executorLabel: exec-caching-disabled-component
deploymentSpec:
 executors:
   exec-caching-disabled-component:
     container:
       args:
       - --executor_input
       - '{{$}}'
       - --function_to_execute
       - caching_disabled_component
       command:
       - sh
       - -c
       - "\nif ! [ -x \"$(command -v pip)\" ]; then\n    python3 -m ensurepip ||\
         \ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\
         \ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.9.0'\
         \ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"\
         $0\" \"$@\"\n"
       - sh
       - -ec
       - 'program_path=$(mktemp -d)


         printf "%s" "$0" > "$program_path/ephemeral_component.py"

         _KFP_RUNTIME=true python3 -m kfp.dsl.executor_main                         --component_module_path                         "$program_path/ephemeral_component.py"                         "$@"

         '
       - "\nimport kfp\nfrom kfp import dsl\nfrom kfp.dsl import *\nfrom typing import\
         \ *\n\ndef caching_disabled_component():\n    pass\n\n"
       image: python:3.9
pipelineInfo:
 name: caching-example-pipeline
root:
 dag:
   tasks:
     caching-disabled-component:
       cachingOptions: {}
       componentRef:
         name: comp-caching-disabled-component
       taskInfo:
         name: caching-disabled-component
schemaVersion: 2.1.0
sdkVersion: kfp-2.9.0

No matter how many times I run this, all three pods are created. The component is never cached / the third pod is never skipped.

Note the diff between the two yamls (other than names) is:

41,43c41,42
<       caching-enabled-component:
<         cachingOptions:
<           enableCache: true
---
>       caching-disabled-component:
>         cachingOptions: {}

root.dag.tasks.COMPONENT.cachingOptions is what controls it, and that is the only thing being evaluated in driver where the caching decision is made.

Closing works-as-expected.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants