Fix cleanup with remote work dir #3836

bentsherman · 2023-04-05T16:32:01Z

Fixes an issue with cleanup on a remote work directory, which is caused by the plugin manager being shutdown before the cleanup.

I moved the cleanup code to happen after publishing is done but before the plugin manager and cache db are shutdown. However, I haven't been able to verify that the cleanup works. When I run the hello pipeline on AWS Batch with -trace nextflow, I don't get the error message anymore, but also the task directories aren't deleted.

Currently I have this code:

            log.trace "Cleaning-up workdir"
            cache.eachRecord { HashCode hash, TraceRecord record ->
                log.trace "Cleaning task: ${hash.toString()}"

                def deleted = cache.removeTaskEntry(hash)
                if( deleted ) {
                    // delete folder
                    FileHelper.deletePath(FileHelper.asPath(record.workDir))
                }
            }
            log.trace "Clean workdir complete"

The log shows the outer trace messages, but not the inner message for each task. It's like the task records aren't there during the iteration.

The only difference with the old code is that the cleanup now uses the session's cache db reference rather than opening the cache itself. That was necessary because at that point the session's cache had been closed. Also, the old version used openForRead() instead of open(). But that shouldn't matter right?

@pditommaso what do you suggest?

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman · 2023-04-06T16:34:14Z

I think the issue is that closing the cache db also waits for it to finish writing anything. So I will try closing and re-opening the cache like before.

bentsherman · 2023-04-06T16:55:49Z

That did the trick. Cleanup works!

So now the Session destroy is as follows:

shutdown publishing
shutdown executors
save, close cache db
clean work dir if enabled (it will re-open cache db)
shutdown plugin manager

@pditommaso all yours to review and merge

bentsherman · 2023-05-10T15:42:03Z

@pditommaso bump

bentsherman · 2023-06-29T20:47:01Z

@pditommaso here are the recent commits I found related to the shutdown logic:

Fix thread pool race condition on shutdown: 8d2b058
Fix issue with empty report file: 9cc4f07

The first commit moved the file transfer shutdown from the start of session destroy to after it. This change caused issues when publishing to S3, because the plugin manager was being shutdown before waiting for all the files to be published.

The second commit fixed this problem by creating separate thread pools for publishing files and the Tower archiver.

Basically, I think the introduction of the Tower archiver created a circular dependency between waiting for file transfers and shutting down the plugin system, and it was resolved by giving the Tower archiver its own file transfer pool.

This PR does not change any of the file transfer shutdown. It only moves the work dir cleanup to before the plugin manager, which is necessary to make it work with cloud storage. This change also puts it before deleting the script classes dir and updating the history file, neither of which are related to the cache DB.

I feel confident that we can merge this change. A good stress test would be to run nf-core rnaseq test on Tower with archiving and cleanup enabled, that should test all of the shutdown logic. Why don't we merge it, run the test, and if we find an error then we'll fix it?

pditommaso · 2023-10-05T14:48:43Z

Not inclined to merge this. The cleanup has been implemented as an experimental feature. when used with object storage with large real-world pipelines can hang the execution for a long time and result in extra charges. Don't think we should promote this behaviour.

bentsherman · 2023-10-05T17:57:46Z

Fair enough. It will be replaced by automatic task cleanup anyway.

Shians · 2025-01-07T02:36:51Z

I would like this to be reconsidered. I am using AWS Batch with Fargate and S3 to process some large VCF files into a smaller size that I pull down for local use. The storage costs for the files in the staging areas far exceed the cost of waiting for them to be cleaned up. I think it should be left to the users whether they want to incur additional execution time costs to reduce storage costs.

CharlesARoy · 2025-01-17T19:41:42Z

@Shians I'm in the same boat and would love to see this functionality implemented (as would many other people from what I can see), but I don't think it's likely in the near future.

If you use one main folder in S3 as the Nextflow work directory, you could create a Lifecycle Rule for the bucket (on the AWS side) that automatically deletes files with a given prefix after some period of time (minimum is 1 day). This would even work if you use a versioned S3 bucket (which probably adds another layer of complexity for a Nextflow-based implementation).

In my case, I create a Nextflow work folder in each project folder, so there isn't a single prefix I can use to limit the application of a lifecycle rule. Consequently, I'd love to be able to refer to Nextflow work files using object tags. It seems that Nextflow already automatically tags the metadata files with nextflow.io/metadata = true, e.g.:

However, (and this seems like an oversight), Nextflow fails to tag the non-metadata files (i.e. the process input files) it creates in the work folder. If this were fixed/updated, I’d recommend tagging the non-metadata files differently so that they can be handled distinctly from the metadata files by lifecycle rules.

@bentsherman or @pditommaso do you have any insight as to how Nextflow tags the S3 objects that it does and what code would need to be modified so that it would also tag the non-metadata files in the work folder?

pditommaso · 2025-01-21T15:02:31Z

S3 tagging is supported via the Fusion file system. See fusion tags here

Fix cleanup with remote work dir [ci fast]

e2e2036

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman added storage/aws storage/azure storage/gcp labels Apr 5, 2023

bentsherman requested a review from pditommaso April 5, 2023 16:32

This was linked to issues Apr 5, 2023

Azure, WARN: Failed to cleanup work dir #2683

Open

Failed to cleanup work dir by aws-batch nf-amazon plugin #3145

Closed

Wait for cache writes to finish before cleanup

21a2fc6

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman marked this pull request as ready for review April 6, 2023 16:52

bentsherman mentioned this pull request Apr 6, 2023

Work directories not clearing after jobs complete, causing maxed out vm storage before pipeline completes #3838

Closed

pditommaso force-pushed the master branch from 68c35f1 to 36b9e22 Compare June 4, 2023 20:18

pditommaso force-pushed the master branch from 38c9931 to 295bc1f Compare June 13, 2023 08:44

pditommaso force-pushed the master branch from e377812 to fcdeec0 Compare July 29, 2023 11:16

bentsherman mentioned this pull request Aug 8, 2023

Unable to clean GCP work directories fails with "Unable to start plugin 'nf-google' " error #2317

Closed

pditommaso force-pushed the master branch 2 times, most recently from 81f7cb7 to 8a43489 Compare August 20, 2023 20:13

bentsherman added the ready-to-merge label Sep 18, 2023

pditommaso removed the ready-to-merge label Oct 5, 2023

bentsherman closed this Oct 5, 2023

bentsherman deleted the 3645-fix-cleanup-s3 branch January 16, 2025 20:03

CharlesARoy mentioned this pull request Jan 17, 2025

"cleanup" option does not remove staged files from S3 #5091

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cleanup with remote work dir #3836

Fix cleanup with remote work dir #3836

bentsherman commented Apr 5, 2023

bentsherman commented Apr 6, 2023

bentsherman commented Apr 6, 2023

bentsherman commented May 10, 2023

bentsherman commented Jun 29, 2023

pditommaso commented Oct 5, 2023

bentsherman commented Oct 5, 2023

Shians commented Jan 7, 2025

CharlesARoy commented Jan 17, 2025

pditommaso commented Jan 21, 2025

Fix cleanup with remote work dir #3836

Fix cleanup with remote work dir #3836

Conversation

bentsherman commented Apr 5, 2023

bentsherman commented Apr 6, 2023

bentsherman commented Apr 6, 2023

bentsherman commented May 10, 2023

bentsherman commented Jun 29, 2023

pditommaso commented Oct 5, 2023

bentsherman commented Oct 5, 2023

Shians commented Jan 7, 2025

CharlesARoy commented Jan 17, 2025

pditommaso commented Jan 21, 2025