-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow squashfs image as gtdbtk database. #785
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this doesn't look too bad if it's sort of processed as an 'alternative' compression format to `.tar.gz.
Some things already:
- Missing changelog update
- Missing description somewhere in
usage.md
of the functionality (even if a user does not need to make an active action). - Please put the new process in it's own file (nf-core convention)
I would be curious to know from some Nextflow experts if they see any flaws in this in regards to cloud computing or something like that... will the containerOptions be able to correctly evaluate the variable and database_image stuff etc, as I'm not that familiar with that directive? @FriederikeHanssen @maxulysse are my go-tos (but maybe they can pass the question on to someone else)
Can I get some context on why the containerOptions are necessary in the first place? |
Does this only work with Singularity? And if so, is that desirable/do we need guard rails to fail gently if running with e.g. Docker or Conda? |
@FriederikeHanssen,
Sorry, @FriederikeHanssen, that was a non-justification, but rethinking it I realized that I don't need to mount each depth-2 path of the database. I only need to mount the depth-1 path. 32157ad The |
In case database is a file-system image, inspect the image to discover level-2 paths. Output is a string of mount options for discovered paths.
@jfy133, I don't know how to update parameter documentation, but I have updated file nextflow_schema.json. Will the former file be updated automatically once the PR has been merged? |
Passing GTDB-Tk database as squash-fs requires singularity.
@prototaxites, this should work with both Singularity and Apptainer. I will leave Docker as an exercise. ;-) I have added a check of container engine. Conda does not conflict, but it needs to be combined with a supported container engine. |
It is sufficient to discover 1st level directory.
Yes, that's the file you need to update! Sorry, I always forget to explain about that 😅 . You can use |
@nf-core-bot fix linting |
I think we gotta make sure this works with docker since all but singularity and conda use the docker image and convert it to their respective input (apptainer, podman, charliecloud etc.) It could break stuff for a bunch of people. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check that Docker works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The module changes should come via a module update, so this PR cannot proceed until this is done!
This is also sort of a good thing, as it will give the wider community a chance to give input whether they think this is a good system/idea.
EDIT: basically what Rike said 😅
@@ -8,7 +8,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
### `Added` | |||
|
|||
- [#777](https://github.com/nf-core/mag/pull/777) - Improved input validation through additional JSON keywords and error messages (by @agusinac) | |||
|
|||
- [#785](https://github.com/nf-core/mag/pull/785) - Allow squash-fs image as GTDB-Tk database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- [#785](https://github.com/nf-core/mag/pull/785) - Allow squash-fs image as GTDB-Tk database. | |
- [#785](https://github.com/nf-core/mag/pull/785) - Allow squash-fs image as input for GTDB-Tk database (by @muniheart ) |
/* Mount the database image and inspect directory structure. Generate a container mount directive for each | ||
* depth-2 file and directory. This is for compatability with GTDBTK_DB_PREPARATION which removes first path | ||
* component during tar extraction. | ||
*/ | ||
process GTDBTK_IMAGE_INSPECT { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* Mount the database image and inspect directory structure. Generate a container mount directive for each | |
* depth-2 file and directory. This is for compatability with GTDBTK_DB_PREPARATION which removes first path | |
* component during tar extraction. | |
*/ | |
process GTDBTK_IMAGE_INSPECT { | |
/* | |
* Mount the database image and inspect directory structure. Generate a container mount directive for each | |
* depth-2 file and directory. This is for compatibility with GTDBTK_DB_PREPARATION which removes first path | |
* component during tar extraction. | |
*/ | |
process GTDBTK_IMAGE_INSPECT { |
@@ -3,10 +3,12 @@ process GTDBTK_CLASSIFYWF { | |||
label 'process_medium' | |||
conda "${moduleDir}/environment.yml" | |||
container "${workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? 'https://depot.galaxyproject.org/singularity/gtdbtk:2.4.0--pyhdfd78af_1' : 'biocontainers/gtdbtk:2.4.0--pyhdfd78af_1'}" | |||
containerOptions "$mountOpts" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you cannot modify an nf-core module within a pipeline like this.
you will need to first make a PR to nf-core/modules
, get a review there, then we run in the pipeline nextflow modules update gtdbtk/classifywf
to update the module here.
Note I'm already making a couple of updates to this module too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question, is it frowned upon to just patch a module with nf-core modules patch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends. Is it something that everyone that uses the module needs? -> update nf-core/modules. Is it a one of for some reason? -> patch it in the pipeline. Patching is definitely better than copying the module to a local one to get a tiny change in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you cannot modify an nf-core module within a pipeline like this.
you will need to first make a PR to
nf-core/modules
, get a review there, then we run in the pipelinenextflow modules update gtdbtk/classifywf
to update the module here.Note I'm already making a couple of updates to this module too!
@jfy133, which branch of nf-core/modules
should I fork? I don't see branch dev
and nf-core docs doesn't specify branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thought - rather than modify the GTDB-Tk module, containerOptions can be set in conf/modules.conf
:
process {
withName: GTDBTK_WF {
containerOptions = { ${ db.getExtension() == "sqfs" && workflow.containerEngine == 'singularity' ? "whatever_mount_options" : "" }
}
}
Can easily be extended to accommodate all the containerisation solutions. Module might need a bit of finagling still to find the right location of the database but overall seems cleaner.
|
||
input: | ||
tuple val(meta) , path("bins/*") | ||
tuple val(db_name), path("database/*") | ||
val mountOpts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't modify nf-core module.
The squash-fs image built from GTDB-Tk database must have top-level directory removed. This obviates the need for image inspection. Mount options for the squash-fs image are passed to GTDBTK_CLASSIFYWF in a configuration file.
I have closed #8230 in favor of setting mount options in configuration file as described here:
where
|
I have one other thought about this, which is does this work with the updated GTDB-Tk module inputs in the PR from @jfy133 here? The new PR doesn't hard-code the database path to |
@muniheart This PR looks OK code wise to me I think as it's just an if/else statement now, but please add more documentation with examples of how to use and configure the pipeline at the bottom of |
Thank you, @jfy133. I have updated |
Can you merge in the recent changes from |
Co-authored-by: Jim Downie <19718667+prototaxites@users.noreply.github.com>
Thank you, @jfy133. I have updated
@jfy133 & @prototaxites, it appears that #788 has broken compatibility with squash-fs image. Input |
@FriederikeHanssen this PR does not have a conflict with docker base |
@jfy133 & @prototaxites I don't see any way to proceed with this PR, given #788. Any suggestions? |
Sorry about the intermediate update to the module @muniheart :( the old method was not working properly for some people either (and was not really a good 'nextflowy' way of doing things in the first place :( ), so as it was bug I had to priorities that. If I understand correctly, your solution should still work - you just need to change the name of the |
AH I just saw this... I think you can still refer to the input Unless I misunderstand? |
@nf-core-bot fix linting |
Hi all, have been musing about how this can still be added - there are two ways I can see this working, which may or may not work from a technical perspective. There is one assumption I'm making:
So the options are:
|
Hi @jfy133 & @prototaxites:
That is my observation.
I have tried this and I observe that, inside the container's shell, I see the contents of the src directory, not the contents of the image fs. Doesn't matter if the src directory is empty.
The fundamental difference between the antimash code and my case is that antimash references the mount-source with an absolute path outside the container.
@jfy133, would you be willing to modify the module such that, in the case
as database location? |
Spare thought: could we modify the process so that the input format for the db is an env() with name GTDBTK_DATA_PATH? That avoids the file exists check entirely and simplifies the module by removing the find line to set the required environment variable. https://www.nextflow.io/docs/latest/process.html#input-environment-variables-env |
@prototaxites, the find is required because the database has a top-level directory, e.g. /release_220, that GTDB-Tk can't handle, so removing the find would place the burden of discovering it's name on the user. |
I think the ENV method is the less 'invasive' method (assuming that it works). The only problem is we should make it very clear that it is only for special cases - it is not recommended for general use using that method, as then you don't stage the database in the work dir as Nextflow wants you too. I have a feeling this may break resume when using that system. But this way the only module code change is indeed just putting and if statement that if the path(dB) is true, then run find, otherwise we don't set the variable in the module - and the user is expected to set the ENV in their module config and ensure that that path is present on all of their worker nodes etc. Optional input is straight forward with The containerOptions hack for antiSMASH is extremely dirty so like others have said - I really don't recommend setting those in the module itself. |
I had to add the find precisely because people were having problems specifying the correct path :/ |
Hypothetically, that could be moved to the GTDB db channel preparation step inside a map operator - |
Yes true! But we are getting further and further away from how most modules work, which sort of defeats the point of standardisation... |
With this PR, the GTDBTK database can be a squash-fs image with extension '.sqfs'. I haven't yet added the ability to create an image. To create an image from a database in tar.gz format, see #779 (comment).
I welcome suggestions on how to add a separate workflow to build a squash-fs image from a tar-ball.