Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VolumeOp doesn't support GC after workflow deletion #1779

Closed
hongye-sun opened this issue Aug 8, 2019 · 21 comments · Fixed by #4831
Closed

VolumeOp doesn't support GC after workflow deletion #1779

hongye-sun opened this issue Aug 8, 2019 · 21 comments · Fixed by #4831
Assignees
Labels
Milestone

Comments

@hongye-sun
Copy link
Contributor

The PVC created from VolumeOp is not automatically recycled after workflow is deleted.

I wonder if owner reference can work with PVC when it is set with workflow's reference. If it works, we can provide an option in volumeop to set it when creating the PVC.

If it doesn't work, we should provide a way to either use Argo's volume API or support a delete PVC op in exit handler.

@prcastro
Copy link
Contributor

What is the practical consequence of this? I'm having an issue with bound PVCs from pipelines that ran long ago.

@elikatsis
Copy link
Member

Hi all,
I'm working on a solution using owner references that seems to be working just fine!
I'll follow up with a PR soon.

/assign elikatsis

@elikatsis
Copy link
Member

As I mentioned in the previous comment, I've been working on solving this issue using OwnerReferences.

An OwnerReference requires two basic fields that refer to the actual Kubernetes resource: name and uid.
Argo enables access to those using Argo (v2.3.0) variables: {{workflow.name}}, {{workflow.uid}}.

PR #1720, however, changed {{workflow.uid}} so that it points to runID instead, making it impossible to add OwnerReferences.

I'm not sure I understand the rationale behind this. [ cc @IronPan ]
Why don't we use {{workflow.labels.pipeline/runid}}?
[ cc @Ark-kun since it's used in the DSL in RUN_ID_PLACEHOLDER ]
It feels very odd having {{workflow.uid}} modified and not function as it's supposed to (i.e., be replaced with workflow's UUID), especially when there's a valid way to do what we desire.

@karlschriek
Copy link

I would be very interested in this. We have workflows where we sometimes need to dump out fairly large files to disk, but they are only relevant while the component is executing. Currently we have to always manually delete the PVCs afterwards

@elikatsis
Copy link
Member

@Ark-kun @IronPan friendly ping 😄

@PatrickXYS
Copy link
Member

Willing to help, and will send a PR for adding support a delete PVC op.

@elikatsis
Copy link
Member

elikatsis commented Apr 22, 2020

@Ark-kun @IronPan

Hi there! Pinging again for an answer to this comment: #1779 (comment)

Since it's been some time (KFP has had many releases, KF ships with a KFP version containing #1720), can we move to using a different Argo magic string and not override the existing {{workflow.uid}}?
Is it OK if I submit a PR fixing that behavior?

If we go with {{workflow.labels.pipeline/runid}} rather than a new {{kfp.run.id}}, there will also be no problem with incompatibilities between SDK and apiserver.

@Ark-kun
Copy link
Contributor

Ark-kun commented Apr 28, 2020

I generally agree with the suggestion.

@stale
Copy link

stale bot commented Jul 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jul 27, 2020
@PatrickXYS
Copy link
Member

I don't think it's fixed now, please keep open until we fix.

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jul 27, 2020
@marrrcin
Copy link

Is there any workaround to remove the actual volume (disk) after the pipeline finishes?
Using

my_volume = dsl.VolumeOp(
        name="create-volume",
        resource_name="asd",
        size="10Gi",
        modes=dsl.VOLUME_MODE_RWO,
    )
dsl.ResourceOp(
        name="delete-volume",
        k8s_resource=my_volume.k8s_resource,
        action="delete"
    )

still leaves the unused disk in GCE.

@stale
Copy link

stale bot commented Nov 15, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Nov 15, 2020
@jackwhelpton
Copy link
Contributor

This is still a sorely-needed feature for us, would be good to keep the ticket open until it is resolved.

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Nov 15, 2020
@Gsantomaggio
Copy link
Member

Agree with @jackwhelpton
I would reopen it.
Thank you

@talhairfanbentley
Copy link

Hey, did anyone found any workaround for this?

@PatrickXYS
Copy link
Member

@kubeflow/wg-pipeline-leads I think this is a well-known issue and all the cloud-provider should get the same issue.

As I can see, we met similar issue here. Can you put some efforts on this issue and I can help as well?

/cc @Bobgy

@Bobgy
Copy link
Contributor

Bobgy commented Nov 21, 2020

/assign @chensun
Can you take a look?
Looks like this is requested by a lot of people

@elikatsis
Copy link
Member

Since we have now upgraded Argo client version used by the backend (apart from the deployment) we can take advantage of the setOwnerReference field of the ResourceTemplate (source).

Using this we will be able to set an owner reference of the workflow to PVCs created via the SDK, without having to deal with the [odd] design described above (#1779 (comment)).

@Bobgy I'm preparing a PR for this and will open it pretty soon :)

@PatrickXYS
Copy link
Member

PatrickXYS commented Nov 23, 2020

@elikatsis Feel free to cc me the PR

Would like to help as well

elikatsis added a commit to arrikto/kubeflow-pipelines that referenced this issue Nov 26, 2020
Argo supports a field in the ResourceTemplate that makes the controller
add an owner reference of the workflow to the created resource since
v2.4.0 [1].
With the upgrade of Argo client [2] and deployment [3] we are now able
to exploit it.

We set it to 'false' by default on all ResourceOps (actually, leave it
empty) and 'true' for VolumeOps. This allows the garbage collection of
PVCs upon workflow cleanup without any further change [4].

[1] https://github.com/argoproj/argo/blob/v2.4.0/pkg/apis/workflow/v1alpha1/workflow_types.go#L1044-L1045
[2] kubeflow#4498
[3] kubeflow#3537
[4] kubeflow#1779

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
elikatsis added a commit to arrikto/kubeflow-pipelines that referenced this issue Nov 26, 2020
…ow#1779

Argo supports a field in the ResourceTemplate that makes the controller
add an owner reference of the workflow to the created resource since
v2.4.0 [1].
With the upgrade of Argo client [2] and deployment [3] we are now able
to exploit it.

We set it to 'false' by default on all ResourceOps (actually, leave it
empty) and 'true' for VolumeOps. This allows the garbage collection of
PVCs upon workflow cleanup without any further change [4].

[1] https://github.com/argoproj/argo/blob/v2.4.0/pkg/apis/workflow/v1alpha1/workflow_types.go#L1044-L1045
[2] kubeflow#4498
[3] kubeflow#3537
[4] kubeflow#1779

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
elikatsis added a commit to arrikto/kubeflow-pipelines that referenced this issue Dec 17, 2020
…ow#1779

Argo supports a field in the ResourceTemplate that makes the controller
add an owner reference of the workflow to the created resource since
v2.4.0 [1].
With the upgrade of Argo client [2] and deployment [3] we are now able
to exploit it.

We set it to 'false' by default on all ResourceOps (actually, leave it
empty) and 'true' for VolumeOps. This allows the garbage collection of
PVCs upon workflow cleanup without any further change [4].

[1] https://github.com/argoproj/argo/blob/v2.4.0/pkg/apis/workflow/v1alpha1/workflow_types.go#L1044-L1045
[2] kubeflow#4498
[3] kubeflow#3537
[4] kubeflow#1779

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
elikatsis added a commit to arrikto/kubeflow-pipelines that referenced this issue Jan 8, 2021
…ow#1779

Argo supports a field in the ResourceTemplate that makes the controller
add an owner reference of the workflow to the created resource since
v2.4.0 [1].
With the upgrade of Argo client [2] and deployment [3] we are now able
to exploit it.

We set it to 'false' by default on all ResourceOps (actually, leave it
empty) and 'true' for VolumeOps. This allows the garbage collection of
PVCs upon workflow cleanup without any further change [4].

[1] https://github.com/argoproj/argo/blob/v2.4.0/pkg/apis/workflow/v1alpha1/workflow_types.go#L1044-L1045
[2] kubeflow#4498
[3] kubeflow#3537
[4] kubeflow#1779

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
elikatsis added a commit to arrikto/kubeflow-pipelines that referenced this issue Jan 14, 2021
…ow#1779

Argo supports a field in the ResourceTemplate that makes the controller
add an owner reference of the workflow to the created resource since
v2.4.0 [1].
With the upgrade of Argo client [2] and deployment [3] we are now able
to exploit it.

We set it to 'false' by default on all ResourceOps (actually, leave it
empty).

Setting the field to 'true' for VolumeOps allows the garbage collection
of PVCs upon workflow cleanup [4].

[1] https://github.com/argoproj/argo/blob/v2.4.0/pkg/apis/workflow/v1alpha1/workflow_types.go#L1044-L1045
[2] kubeflow#4498
[3] kubeflow#3537
[4] kubeflow#1779

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
@RoyerRamirez
Copy link

Hi, I'm following along. Have we found a fix for this issue? We currently run many jobs in parallel, and having to go back to delete the disks after they are used can be a pain. Not to mention, it can get costly over time if users don't clean up their environment.

@elikatsis
Copy link
Member

@RoyerRamirez we've got #4831 targeting this but it's not merged yet.

elikatsis added a commit to arrikto/kubeflow-pipelines that referenced this issue Mar 12, 2021
…ow#1779

Argo supports a field in the ResourceTemplate that makes the controller
add an owner reference of the workflow to the created resource since
v2.4.0 [1].
With the upgrade of Argo client [2] and deployment [3] we are now able
to exploit it.

We set it to 'false' by default on all ResourceOps (actually, leave it
empty).

Setting the field to 'true' for VolumeOps allows the garbage collection
of PVCs upon workflow cleanup [4].

[1] https://github.com/argoproj/argo/blob/v2.4.0/pkg/apis/workflow/v1alpha1/workflow_types.go#L1044-L1045
[2] kubeflow#4498
[3] kubeflow#3537
[4] kubeflow#1779

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
google-oss-robot pushed a commit that referenced this issue Mar 12, 2021
…4831)

Argo supports a field in the ResourceTemplate that makes the controller
add an owner reference of the workflow to the created resource since
v2.4.0 [1].
With the upgrade of Argo client [2] and deployment [3] we are now able
to exploit it.

We set it to 'false' by default on all ResourceOps (actually, leave it
empty).

Setting the field to 'true' for VolumeOps allows the garbage collection
of PVCs upon workflow cleanup [4].

[1] https://github.com/argoproj/argo/blob/v2.4.0/pkg/apis/workflow/v1alpha1/workflow_types.go#L1044-L1045
[2] #4498
[3] #3537
[4] #1779

Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this issue Oct 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment