-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: feat(api/sdk/backend): implement PipelineConfig and setting ttl on pipelines. Fixes: #10899 #10942
WIP: feat(api/sdk/backend): implement PipelineConfig and setting ttl on pipelines. Fixes: #10899 #10942
Conversation
Hi @gregsheremeta. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi, I need some pointers from folks much more familiar with the DSL and kfp-kubernetes and protobuf. See my initial PR message. @connor-mccarthy @chensun @james-jwu @zijianjoy please help or suggest a DSL / protobuf / kfp-kubernetes expert who can assist. Thanks. |
This should not go in the k8s platform spec, by the name it may give that impression, but when we look at the proto file for it here we can see that the primary use case for this is to provide sdk api for manipulating k8s native objects (e.g. pod spec, pvc, secret, configmaps, etc). The ttl should be a pipeline abstraction, the k8s specific implementation should be handled compiler side which you have already done. I think a better location would be to add this to the deployment_spec which seems like it's intended for pipeline level configurations |
Hm, ok, that line of thinking makes sense. Let me try that. Thanks! |
Signed-off-by: Greg Sheremeta <gshereme@redhat.com>
b4293da
to
1090395
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -1874,6 +1877,10 @@ def create_pipeline_spec( | |||
# Schema version 2.1.0 is required for kfp-pipeline-spec>0.1.13 | |||
pipeline_spec.schema_version = '2.1.0' | |||
|
|||
# pipeline-level config options | |||
if pipeline_config.completed_pipeline_ttl_seconds is not None and pipeline_config.completed_pipeline_ttl_seconds > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May we raise an error if <= 0?
How will the user know that the argument was ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I think the behavior should be:
-1 = 0 = infinite timeout = same as not specifying anything (per https://cloud.google.com/apis/design/design_patterns#integer_types)
< -1 = should return an invalid argument error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some error handling message would be nice to have here. For instance,
if ttl_seconds < =0:
raise ValueError("Invalid TTL value: {}. The value must be 1 or greater.".format(ttl_seconds))
@chensun / @connor-mccarthy could we please get an ack on this approach, nameply the proto addition here so we can proceed with the above? |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Just left two nitpick comments. Totally optional.
@@ -27,7 +28,8 @@ def pipeline(func: Optional[Callable] = None, | |||
name: Optional[str] = None, | |||
description: Optional[str] = None, | |||
pipeline_root: Optional[str] = None, | |||
display_name: Optional[str] = None) -> Callable: | |||
display_name: Optional[str] = None, | |||
pipeline_config: pipeline_config.PipelineConfig = None) -> Callable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional nitpick: Add the new pipeline_config
parameter to the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Greg Sheremeta <gshereme@redhat.com>
New changes are detected. LGTM label has been removed. |
@gregsheremeta: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
// begin platform level / workflow level config | ||
|
||
// duration in seconds after which the pipeline platform will do garbage collection | ||
optional int32 completed_pipeline_ttl = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as per @droctothorpe comment from yesterday's meeting, what do we think about providing a comment here about logging? something a long the lines of:
"Note: when persisted logging is unavailable, any tasks that are garbage collected will result in the loss of the associated logs for those tasks in the UI."
adding this here will surface them in api docs, fyi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: this should be a Duration
ref: https://cloud.google.com/apis/design/naming_convention#time_and_duration
To summarize my understanding of @zijianjoy 's comments from July 3rd KFP call (last topic discussed): There are various levels where we could implement garbage collection of underlying pods:
As well as being able to do 2,3,4 via the SDK or UI, or some combination thereof. What exists to day is to do (4) via the sdk. This PR allows users to do do (2) via the sdk. As it is very annoying for users to do this for every task. The implementation is extensible that we could easily add the capability to do (3)/(2) as well though this would likely require additional db columns for pipelines (for detecting ttl, then simply checking for this during compilation). For (4) we could simply have a config option (also check for this during compilation). I think this addition is simple enough that we need not worry about growing complexity of adding support for these varying levels in the future. @zijianjoy / @chensun please let us know your thoughts on the above |
// begin platform level / workflow level config | ||
|
||
// duration in seconds after which the pipeline platform will do garbage collection | ||
optional int32 completed_pipeline_ttl = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would suggest changing this type to duration
. Duration
explicitly represents time intervals, making the intent of the field more apparent
I think TTL after completion at pod level makes most sense. This can be configured at deployment time. And if we want to override the deployment setting for individual pipelines/tasks, setting it via kfp-kubernetes extension and explicitly calling out it's TTL for pod would be my preference. My rational is I think TTL after completion at pipeline (and pipeline run) level is kind of misleading. We're not cleaning up pipeline run records from the database, post TTL runs still shows up when listing runs via UI or SDK. The only thing get garbage collected is their pods (or maybe some Argo workflow record in addition to the pods). |
Thanks @chensun, great points! I agree that
However while the end result is pod clean up, the implementation uses the backing engine's TTL api. I.e. it is not directly interacting with k8s native resources, but rather via the engine's crd api. Based on the kuberenetes-extension proto, it seems to me to be highly geared towards kubernetes native components at the task pod level. Adding a field that As you mentioned in Argo's case the Workflow resource is also cleaned up. So something like I think @gregsheremeta plans to explore the kubernetes-extension approach again, but maybe we can use a different name for the api field so as it is more clear of its intentions, regardless of whether it's in While we look into this it would be useful to know how strongly opinionated you are towards this change going into |
I've extracted the PipelineConfig addition to its own PR, #11112. Regarding the TTL portion of this PR, I don't see anything in Argo Workflows or kubernetes to allow setting TTLs or self-destructs on individual pods. For pod-level TTL, I think we would have to implement a custom controller, or find one that already exists that we could reuse. Alternatively, we can stick with pipeline-level TTL, but we'd have to clean up database rows (runs, maybe other stuff?) in addition to setting the Workflow TTL field. This option sounds preferable to me. Closing this PR while I research the above and wait for #11112 to land. |
Maybe we can disambiguate between etcd TTL (for runs) and KFP database TTL (for runs). I think it's important to be able to configure both separately. We use the KFP database for long term storage, reproducibility, and model governance. As a result, we need either long-term or indefinite TTL in the KFP database. In constrast, when it comes to etcd, reduced TTL is advisable. A large volume of non-GCed Argo Workflow custom resources can lead to K8s API latency and can even crash a cluster. Don't ask me how I know. Okay, fine: we were testing a scheduled run that created a new run / workflow every 10 seconds and there was no TTL. etcd / Workflow CR TTL was configurable in v1. IMHO, that gap should be closed in v2, and is probably a smaller, self-contained issue that can be decoupled from database TTL and pod-specific TTL (both of which are meaningful issues in their own right ofc). Side note: something to be aware of with pod TTL is that if Workflow TTL is less than pod TTL, the pod will be GCed earlier than anticipated since the pods have an owner reference to the Workflow CR, i.e. deletions cascade. Also, FWIW, the people concerned about database cleanup are usually platform admins, not end users. As a result, end users will likely not really care about database TTL. With that in mind, it might make more sense to handle database cleanup through an external process that platform admins can control. Apologies if any of the above is stating the obvious! |
This was in the back of my mind, but I didn't think it was important :) Happy to hear a different perspective, though. That said, I strongly dislike when we leak implementation details into the KFP SDK. I already dislike that we have to leak setting a TTL into the SDK. I'll dislike it even more if we make it even leakier by going from |
Design Doc: Kubeflow Pipelines - Design - Implement TTL for pipeline runs - Google Docs |
Will continue the convo in the design doc. |
I don't think DB Cleanup should be something to set in SDK, but an env var in the Persistence Agent. My idea is a flag to keep run history on database with a default value set to |
Description of your changes:
Fixes: #10899
Note: This is functioning end to end. This is currently WIP only because I'd like to get an ack from a maintainer on the approach, at which point I'll split the proto changes into a separate PR, and then rebase this on top of that once it's merged. I'll then also add tests.
In KFP v1, there was a function
kfp.dsl.PipelineConf().set_ttl_seconds_after_finished(seconds)
that garbage collected completed pods after a specified time. However, in KFP v2, this function has been deprecated. The entirePipelineConf
is deprecated, actually, and I don't see a replacement. This PR adds a replacement.I add a section to the protobuf to store pipeline-level config options. The right place appears to be
PipelineDeploymentConfig
. That actually gets cast into thepipeline_spec
struct later, which is a bit confusing (and also causes ints to get kind-of-cast to floats because of https://stackoverflow.com/questions/51818125/how-to-use-ints-in-a-protobuf-struct), but ultimately it works.I add a new
PipelineConfig
class in the DSL. This class will store pipeline-level config options just like the old PipelineConf did.set_completed_pipeline_ttl_seconds
is the first config (and the only one in there for now).I modify
@dsl.pipeline
to take aPipelineConfig
instance as a parameter. The compiler is enhanced to read those config options and apply them to the pipeline proto that the compiler generates.I enhance the backend Argo Workflows compiler to look for known config options in the
PipelineDeploymentConfig
section of the proto.completedPipelineTtl
is the only one for now. If found, it's added to theWorkflow
CR in the correct place (Workflow.Spec.TTLStrategy
).Upon using this enhancement, I can see that a Workflow is created with the proper TTL set. And as expected, after
set_completed_pipeline_ttl_seconds
seconds, the Workflow and all Pods are deleted by Argo Workflow's garbage collection mechanism.test pipeline:
resulting pipeline spec:
resulting Workflow:
workflow controller logs:
@chensun can you please review or recommend an alternate reviewer?
below is old -- keeping it for posterity
I am guessing that the platform config is the right place to put it (need guidance, please).In this WIP, when I add the following to the end of a pipeline, the Workflow (and its pods) are deleted after 2 minutes:(completedPipelineTtl
is the new field I added in the proto. Previously there was only a map of configs for the tasks, so it used to look like this:)
Now the hard part: I'm not sure how to proceed with changing the SDK/DSL to get it to output that pipeline spec.It seems like using kfp-kubernetes is the correct way to go because this is a platform config option we want to set. However, there are no examples of doing pipeline-scope config -- all of the current kfp-kubernetes functions work on tasks, not pipelines. I poked at using something like thispipelines/kubernetes_platform/python/kfp/kubernetes/common.py
Lines 19 to 23 in bf5104f
but for Pipelines, with the hopes of it looking like this in the DSL:But, the Pipeline class is pretty barren and I don't see a way to get atplatform_config
from it.I think I'm blocked until I get some pointers from folks much more familiar with the DSL and kfp-kubernetes.@connor-mccarthy @chensun please help or suggest a DSL expert who can.