Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cleanup with remote work dir #3836

Closed
wants to merge 2 commits into from
Closed

Conversation

bentsherman
Copy link
Member

Closes #3645 and #2683

Fixes an issue with cleanup on a remote work directory, which is caused by the plugin manager being shutdown before the cleanup.

I moved the cleanup code to happen after publishing is done but before the plugin manager and cache db are shutdown. However, I haven't been able to verify that the cleanup works. When I run the hello pipeline on AWS Batch with -trace nextflow, I don't get the error message anymore, but also the task directories aren't deleted.

Currently I have this code:

            log.trace "Cleaning-up workdir"
            cache.eachRecord { HashCode hash, TraceRecord record ->
                log.trace "Cleaning task: ${hash.toString()}"

                def deleted = cache.removeTaskEntry(hash)
                if( deleted ) {
                    // delete folder
                    FileHelper.deletePath(FileHelper.asPath(record.workDir))
                }
            }
            log.trace "Clean workdir complete"

The log shows the outer trace messages, but not the inner message for each task. It's like the task records aren't there during the iteration.

The only difference with the old code is that the cleanup now uses the session's cache db reference rather than opening the cache itself. That was necessary because at that point the session's cache had been closed. Also, the old version used openForRead() instead of open(). But that shouldn't matter right?

@pditommaso what do you suggest?

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman
Copy link
Member Author

I think the issue is that closing the cache db also waits for it to finish writing anything. So I will try closing and re-opening the cache like before.

@bentsherman bentsherman marked this pull request as ready for review April 6, 2023 16:52
@bentsherman
Copy link
Member Author

That did the trick. Cleanup works!

So now the Session destroy is as follows:

  • shutdown publishing
  • shutdown executors
  • save, close cache db
  • clean work dir if enabled (it will re-open cache db)
  • shutdown plugin manager

@pditommaso all yours to review and merge

@bentsherman
Copy link
Member Author

@pditommaso bump

@bentsherman
Copy link
Member Author

@pditommaso here are the recent commits I found related to the shutdown logic:

  1. Fix thread pool race condition on shutdown: 8d2b058
  2. Fix issue with empty report file: 9cc4f07

The first commit moved the file transfer shutdown from the start of session destroy to after it. This change caused issues when publishing to S3, because the plugin manager was being shutdown before waiting for all the files to be published.

The second commit fixed this problem by creating separate thread pools for publishing files and the Tower archiver.

Basically, I think the introduction of the Tower archiver created a circular dependency between waiting for file transfers and shutting down the plugin system, and it was resolved by giving the Tower archiver its own file transfer pool.

This PR does not change any of the file transfer shutdown. It only moves the work dir cleanup to before the plugin manager, which is necessary to make it work with cloud storage. This change also puts it before deleting the script classes dir and updating the history file, neither of which are related to the cache DB.

I feel confident that we can merge this change. A good stress test would be to run nf-core rnaseq test on Tower with archiving and cleanup enabled, that should test all of the shutdown logic. Why don't we merge it, run the test, and if we find an error then we'll fix it?

@pditommaso
Copy link
Member

Not inclined to merge this. The cleanup has been implemented as an experimental feature. when used with object storage with large real-world pipelines can hang the execution for a long time and result in extra charges. Don't think we should promote this behaviour.

@bentsherman
Copy link
Member Author

Fair enough. It will be replaced by automatic task cleanup anyway.

@bentsherman bentsherman closed this Oct 5, 2023
@Shians
Copy link

Shians commented Jan 7, 2025

I would like this to be reconsidered. I am using AWS Batch with Fargate and S3 to process some large VCF files into a smaller size that I pull down for local use. The storage costs for the files in the staging areas far exceed the cost of waiting for them to be cleaned up. I think it should be left to the users whether they want to incur additional execution time costs to reduce storage costs.

@bentsherman bentsherman deleted the 3645-fix-cleanup-s3 branch January 16, 2025 20:03
@CharlesARoy
Copy link

@Shians I'm in the same boat and would love to see this functionality implemented (as would many other people from what I can see), but I don't think it's likely in the near future.

If you use one main folder in S3 as the Nextflow work directory, you could create a Lifecycle Rule for the bucket (on the AWS side) that automatically deletes files with a given prefix after some period of time (minimum is 1 day). This would even work if you use a versioned S3 bucket (which probably adds another layer of complexity for a Nextflow-based implementation).

In my case, I create a Nextflow work folder in each project folder, so there isn't a single prefix I can use to limit the application of a lifecycle rule. Consequently, I'd love to be able to refer to Nextflow work files using object tags. It seems that Nextflow already automatically tags the metadata files with nextflow.io/metadata = true, e.g.:
Image on 2025-01-17 11 34 50 AM

However, (and this seems like an oversight), Nextflow fails to tag the non-metadata files (i.e. the process input files) it creates in the work folder. If this were fixed/updated, I’d recommend tagging the non-metadata files differently so that they can be handled distinctly from the metadata files by lifecycle rules.

@bentsherman or @pditommaso do you have any insight as to how Nextflow tags the S3 objects that it does and what code would need to be modified so that it would also tag the non-metadata files in the work folder?

@pditommaso
Copy link
Member

S3 tagging is supported via the Fusion file system. See fusion tags here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants