Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FeatureRequest]KFP execution cache #2904

Closed
7 tasks done
rui5i opened this issue Jan 23, 2020 · 20 comments · Fixed by #3602
Closed
7 tasks done

[FeatureRequest]KFP execution cache #2904

rui5i opened this issue Jan 23, 2020 · 20 comments · Fixed by #3602
Assignees
Labels
area/execution_cache kind/feature status/triaged Whether the issue has been explicitly triaged

Comments

@rui5i
Copy link
Contributor

rui5i commented Jan 23, 2020

  1. Mutate webhook
  • webhook service
  • execution_key generator
  • pod modification controller
  • max_cache_staleness support
  1. Mysql
  • persist execution_key
  1. Releasing
  • standalone deployment
  • mkp deployment
@rui5i
Copy link
Contributor Author

rui5i commented Jan 23, 2020

/assign @rui5i

@rmgogogo rmgogogo added status/triaged Whether the issue has been explicitly triaged kind/feature labels Feb 6, 2020
@rmgogogo
Copy link
Contributor

rmgogogo commented Feb 6, 2020

not sure which is the first phase target, from SDK-to-API, right?

@rui5i rui5i changed the title [FeatureRequest]KFP retry on certain step [FeatureRequest]KFP execution cache Feb 6, 2020
@rui5i
Copy link
Contributor Author

rui5i commented Feb 6, 2020

Sorry for the confusing. Update the issue content. We change the project to "KFP execution cache" and retry from a certain step will become a underlying CUJ. There will not be any new APIs. The design doc will be sent out really soon.

@elikatsis
Copy link
Member

Hello @rui5i,

That's a really nice feature!
Any update on the design doc? I think it should be an item in the upcoming KFP community meeting.

Thanks!

@rui5i
Copy link
Contributor Author

rui5i commented Feb 11, 2020

Hi @elikatsis ,

Thanks for your feedback! The design doc currently is under KFP team review. I am happy to present on the upcoming KFP community meeting!

@elikatsis
Copy link
Member

Ping!
I see there is no entry in the agenda. Any update on that?

@Padarn
Copy link
Contributor

Padarn commented Mar 27, 2020

Also interested in any docs on this, or updates on the release. I saw some meeting notes here: https://docs.google.com/document/d/1KB5KD8TvcrnxQX0xluHRnUdRYkM2-5-vB2V1wlxS4GY/edit#heading=h.gt0qfhljl8xo but its not clear to me if a decision has been made

@rui5i
Copy link
Contributor Author

rui5i commented Mar 27, 2020

Also interested in any docs on this, or updates on the release. I saw some meeting notes here: https://docs.google.com/document/d/1KB5KD8TvcrnxQX0xluHRnUdRYkM2-5-vB2V1wlxS4GY/edit#heading=h.gt0qfhljl8xo but its not clear to me if a decision has been made

Hi, thanks for asking! The link you provided is our caching design doc. We are trying to make it available on 0.3.1. I'll let you know after it's release.

@Padarn
Copy link
Contributor

Padarn commented Mar 27, 2020

Awesome. Feel free to ask if there is anything can help from community.

@elikatsis
Copy link
Member

Hi again,
nice job with the implementation so far! Keep it up!

I wanted to ask, are there any plans on what will you be showing as step's logs in KFP UI? Or the cached steps will appear with empty logs?
I haven't spotted any change related to that, so I assume, correct me if I'm wrong, that if we were to deploy the feature and cache a step at this very moment, logs would be empty.

@rui5i
Copy link
Contributor Author

rui5i commented Mar 31, 2020

Hi again,
nice job with the implementation so far! Keep it up!

I wanted to ask, are there any plans on what will you be showing as step's logs in KFP UI? Or the cached steps will appear with empty logs?
I haven't spotted any change related to that, so I assume, correct me if I'm wrong, that if we were to deploy the feature and cache a step at this very moment, logs would be empty.

Hi @elikatsis ,

Thanks for checking in! Currently, if a step's result is taken from cache, then the step log will show "This step output is taken from cache." https://github.com/kubeflow/pipelines/blob/master/backend/src/cache/server/mutation.go#L119. In the future we may explore if it's possible to show the link of previous run/step.

@rui5i
Copy link
Contributor Author

rui5i commented Apr 10, 2020

Kubeflow Pipelines step caching is now released in 0.4.0 and after. Close this issue.

@rui5i rui5i closed this as completed Apr 10, 2020
@Bobgy
Copy link
Contributor

Bobgy commented Apr 15, 2020

/reopen

TODO:

I will finish the UI integration

@k8s-ci-robot
Copy link
Contributor

@Bobgy: Reopened this issue.

In response to this:

/reopen

TODO:

  • Add an UI indication a step is cached
    /assign @Bobgy

I will finish the UI integration

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Bobgy
Copy link
Contributor

Bobgy commented Apr 23, 2020

@rui5i @Ark-kun I took a look at a cached pod and read through the mutation hook's code, it seems we don't have any information UI can use to indicate a pod is cached. Is that correct?

If so, where should we add these information?

@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 24, 2020

Looks like we did not add any extra labels, so the labels/annotations of reused pods are same.
We could add a few annotations here: https://github.com/kubeflow/pipelines/blob/master/backend/src/cache/server/mutation.go#L126 For example we can include the original start time and end time.

However, given a pod, it's pretty easy to detect that it was skipped (it does not have any Argo containers):

Command: []string{`echo`, `"This step output is taken from cache."`},

There is also a way to detect the skipped pods based on the WorkflowStatus alone: All output artifacts have the pod name in the URI. But for skipped pods, the pod name does not match the URIs. This genius idea belongs to @rui5i. (And now there are always some output artifacts since we've enabled log archiving).

@Bobgy
Copy link
Contributor

Bobgy commented Apr 24, 2020

@Ark-kun I'm a little worried about if we are relying too much on argo details:

  • we need to let UI understand argo artifact path structure, there is a part that's the workflow name
  • UI needs to understand cache server modifies argo outputs to reuse past outputs
  • There are no other systems modifying argo outputs

The last two points probably is also dangerous to cache server...

It really sounds to me we should contribute the caching solution to argo workflow natively and add it to workflow status. (just personal gut feeling, I guess that's not practical now)

In the mean time, I'll let UI use the hack first to first get it working.

@Bobgy
Copy link
Contributor

Bobgy commented Apr 24, 2020

Digging through some related argo issues, I understand argo isn't really taking the feature request.

@Bobgy
Copy link
Contributor

Bobgy commented Apr 24, 2020

@Ark-kun Can I assume we only cache successful steps?

@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 29, 2020

It really sounds to me we should contribute the caching solution to argo workflow natively and add it to workflow status. (just personal gut feeling, I guess that's not practical now)

I'm positive about that. Half a year ago I even started the project to add caching to Argo, but did not finish it. The caching requires some persistence and Argo did not have any DB at that time.

@Ark-kun Can I assume we only cache successful steps?

We only reuse the successfull steps.

In the future we'll start reusing still-running steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/execution_cache kind/feature status/triaged Whether the issue has been explicitly triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants