Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow squashfs image as gtdbtk database. #785

Open
wants to merge 32 commits into
base: dev
Choose a base branch
from

Conversation

muniheart
Copy link

With this PR, the GTDBTK database can be a squash-fs image with extension '.sqfs'. I haven't yet added the ability to create an image. To create an image from a database in tar.gz format, see #779 (comment).

I welcome suggestions on how to add a separate workflow to build a squash-fs image from a tar-ball.

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this doesn't look too bad if it's sort of processed as an 'alternative' compression format to `.tar.gz.

Some things already:

  • Missing changelog update
  • Missing description somewhere in usage.md of the functionality (even if a user does not need to make an active action).
  • Please put the new process in it's own file (nf-core convention)

I would be curious to know from some Nextflow experts if they see any flaws in this in regards to cloud computing or something like that... will the containerOptions be able to correctly evaluate the variable and database_image stuff etc, as I'm not that familiar with that directive? @FriederikeHanssen @maxulysse are my go-tos (but maybe they can pass the question on to someone else)

@FriederikeHanssen
Copy link
Contributor

Can I get some context on why the containerOptions are necessary in the first place?

@prototaxites
Copy link
Contributor

Does this only work with Singularity? And if so, is that desirable/do we need guard rails to fail gently if running with e.g. Docker or Conda?

@muniheart
Copy link
Author

muniheart commented Apr 4, 2025

Can I get some context on why the containerOptions are necessary in the first place?

@FriederikeHanssen, according to gtdbtk.nf

        // The classifywf module expects a list of the _contents_ of the GTDB
        // database, not just the directory itself (I'm not sure why). But
        // for now we generate this list before putting into a channel,
        // then grouping again to pass to the module.
        // Then make up meta id to match expected channel cardinality for GTDBTK

To avoid the mess of staging files that are bind-mounted to a container ( see @jfy133 's comment: #779 ), this PR creates a mount directive for each database file at depth-2.

Sorry, @FriederikeHanssen, that was a non-justification, but rethinking it I realized that I don't need to mount each depth-2 path of the database. I only need to mount the depth-1 path. 32157ad

The containerOptions is required because the image file system has a depth-1 path, in my case /release220, that must be skipped for compatibility with GTDBTK_DB_PREPARATION and GTDBTK_CLASSIFYWF. This path, used in the mount option to specify the desired root directory of the image, is not known a priori. It is discovered by, GTDBTK_IMAGE_INSPECT and passed to GTDBTK_CLASSIFYWF where it is referenced in containerOptions.

@muniheart
Copy link
Author

@jfy133, I don't know how to update parameter documentation, but I have updated file nextflow_schema.json. Will the former file be updated automatically once the PR has been merged?

@muniheart
Copy link
Author

Does this only work with Singularity? And if so, is that desirable/do we need guard rails to fail gently if running with e.g. Docker or Conda?

@prototaxites, this should work with both Singularity and Apptainer. I will leave Docker as an exercise. ;-) I have added a check of container engine. Conda does not conflict, but it needs to be combined with a supported container engine.

It is sufficient to discover 1st level directory.
@jfy133
Copy link
Member

jfy133 commented Apr 7, 2025

@jfy133, I don't know how to update parameter documentation, but I have updated file nextflow_schema.json. Will the former file be updated automatically once the PR has been merged?

Yes, that's the file you need to update! Sorry, I always forget to explain about that 😅 .

You can use nf-core pipelines schema build to get a GUI to make this a bit easier :)

@jfy133
Copy link
Member

jfy133 commented Apr 7, 2025

@nf-core-bot fix linting

@FriederikeHanssen
Copy link
Contributor

I will leave Docker as an exercise.

I think we gotta make sure this works with docker since all but singularity and conda use the docker image and convert it to their respective input (apptainer, podman, charliecloud etc.) It could break stuff for a bunch of people.

Copy link
Contributor

@FriederikeHanssen FriederikeHanssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check that Docker works

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module changes should come via a module update, so this PR cannot proceed until this is done!

This is also sort of a good thing, as it will give the wider community a chance to give input whether they think this is a good system/idea.

EDIT: basically what Rike said 😅

@@ -8,7 +8,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### `Added`

- [#777](https://github.com/nf-core/mag/pull/777) - Improved input validation through additional JSON keywords and error messages (by @agusinac)

- [#785](https://github.com/nf-core/mag/pull/785) - Allow squash-fs image as GTDB-Tk database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [#785](https://github.com/nf-core/mag/pull/785) - Allow squash-fs image as GTDB-Tk database.
- [#785](https://github.com/nf-core/mag/pull/785) - Allow squash-fs image as input for GTDB-Tk database (by @muniheart )

Comment on lines 1 to 5
/* Mount the database image and inspect directory structure. Generate a container mount directive for each
* depth-2 file and directory. This is for compatability with GTDBTK_DB_PREPARATION which removes first path
* component during tar extraction.
*/
process GTDBTK_IMAGE_INSPECT {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* Mount the database image and inspect directory structure. Generate a container mount directive for each
* depth-2 file and directory. This is for compatability with GTDBTK_DB_PREPARATION which removes first path
* component during tar extraction.
*/
process GTDBTK_IMAGE_INSPECT {
/*
* Mount the database image and inspect directory structure. Generate a container mount directive for each
* depth-2 file and directory. This is for compatibility with GTDBTK_DB_PREPARATION which removes first path
* component during tar extraction.
*/
process GTDBTK_IMAGE_INSPECT {

@@ -3,10 +3,12 @@ process GTDBTK_CLASSIFYWF {
label 'process_medium'
conda "${moduleDir}/environment.yml"
container "${workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? 'https://depot.galaxyproject.org/singularity/gtdbtk:2.4.0--pyhdfd78af_1' : 'biocontainers/gtdbtk:2.4.0--pyhdfd78af_1'}"
containerOptions "$mountOpts"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cannot modify an nf-core module within a pipeline like this.

you will need to first make a PR to nf-core/modules, get a review there, then we run in the pipeline nextflow modules update gtdbtk/classifywf to update the module here.

Note I'm already making a couple of updates to this module too!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question, is it frowned upon to just patch a module with nf-core modules patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends. Is it something that everyone that uses the module needs? -> update nf-core/modules. Is it a one of for some reason? -> patch it in the pipeline. Patching is definitely better than copying the module to a local one to get a tiny change in

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cannot modify an nf-core module within a pipeline like this.

you will need to first make a PR to nf-core/modules, get a review there, then we run in the pipeline nextflow modules update gtdbtk/classifywf to update the module here.

Note I'm already making a couple of updates to this module too!

@jfy133, which branch of nf-core/modules should I fork? I don't see branch dev and nf-core docs doesn't specify branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thought - rather than modify the GTDB-Tk module, containerOptions can be set in conf/modules.conf:

process {
    withName: GTDBTK_WF {
        containerOptions = { ${ db.getExtension() == "sqfs" && workflow.containerEngine == 'singularity' ? "whatever_mount_options" : "" }
    }
}

Can easily be extended to accommodate all the containerisation solutions. Module might need a bit of finagling still to find the right location of the database but overall seems cleaner.


input:
tuple val(meta) , path("bins/*")
tuple val(db_name), path("database/*")
val mountOpts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val mountOpts
val mount_opts

Should be snake case 1

Footnotes

  1. https://nf-co.re/docs/guidelines/components/modules#naming-conventions

Don't modify nf-core module.
The squash-fs image built from GTDB-Tk database must have top-level directory removed.  This obviates the need for image inspection.  Mount options for the squash-fs image are passed to GTDBTK_CLASSIFYWF in a configuration file.
@muniheart
Copy link
Author

I have closed #8230 in favor of setting mount options in configuration file as described here:

process {
        withName: GTDBTK_CLASSIFYWF {
                containerOptions = "-B $params.gtdb_db:\$NXF_TASK_WORKDIR/database:image-src=<image_root>"
        }
}

where <image_root> can be discovered with squashfs-tools using,

 unsquashfs -d'' -l -max-depth 1 image.squashfs

@prototaxites
Copy link
Contributor

I have one other thought about this, which is does this work with the updated GTDB-Tk module inputs in the PR from @jfy133 here? The new PR doesn't hard-code the database path to ${workdir}/database so it might need to be finagled to mount the squashfs at the right location in the container for the module to pass the directory path.

@jfy133
Copy link
Member

jfy133 commented Apr 9, 2025

@muniheart This PR looks OK code wise to me I think as it's just an if/else statement now, but please add more documentation with examples of how to use and configure the pipeline at the bottom of usage.md (e.g. under a section called 'A note on using squashfs image input for GTDB' or something

@muniheart
Copy link
Author

Thank you, @jfy133. I have updated usage.md.

@prototaxites
Copy link
Contributor

Can you merge in the recent changes from dev and check that this is still working for you?

@muniheart
Copy link
Author

muniheart commented Apr 9, 2025

Thank you, @jfy133. I have updated

Can you merge in the recent changes from dev and check that this is still working for you?

@jfy133 & @prototaxites, it appears that #788 has broken compatibility with squash-fs image. Input db is now required to be an absolute path, outside the file-system of the image.

@muniheart
Copy link
Author

muniheart commented Apr 9, 2025

I will leave Docker as an exercise.

I think we gotta make sure this works with docker since all but singularity and conda use the docker image and convert it to their respective input (apptainer, podman, charliecloud etc.) It could break stuff for a bunch of people.

@FriederikeHanssen this PR does not have a conflict with docker base containers images. As I see it, docker engine lacks the flexibility of singularity/apptainer in mounting file-system images. Maybe someone more familiar with docker engine will show us how to build a docker volume as a file-system image from the GTDB-Tk database and mount it through containerOptions.

@muniheart
Copy link
Author

@jfy133 & @prototaxites I don't see any way to proceed with this PR, given #788. Any suggestions?

@jfy133
Copy link
Member

jfy133 commented Apr 11, 2025

Sorry about the intermediate update to the module @muniheart :( the old method was not working properly for some people either (and was not really a good 'nextflowy' way of doing things in the first place :( ), so as it was bug I had to priorities that.

If I understand correctly, your solution should still work - you just need to change the name of the /database part of the mounting options in the custom config - no?

@jfy133
Copy link
Member

jfy133 commented Apr 11, 2025

Thank you, @jfy133. I have updated

Can you merge in the recent changes from dev and check that this is still working for you?

@jfy133 & @prototaxites, it appears that #788 has broken compatibility with squash-fs image. Input db is now required to be an absolute path, outside the file-system of the image.

AH I just saw this...

I think you can still refer to the input db in the mounting options as seen here with antismash_dir : https://github.com/nf-core/modules/blob/8b06d86f6a82b6203f239ad409f606fdf71ec697/modules/nf-core/antismash/antismashlite/main.nf#L10C1-L17C1

Unless I misunderstand?
(u

@jfy133
Copy link
Member

jfy133 commented Apr 11, 2025

@nf-core-bot fix linting

@prototaxites
Copy link
Contributor

Hi all, have been musing about how this can still be added - there are two ways I can see this working, which may or may not work from a technical perspective.

There is one assumption I'm making:

  • We can't pass a direct path to some location mounted in the container as the GTDB database as the Nextflow path() input specifier will check if it exists before the process begins

So the options are:

  1. Using some groovy code, create an empty directory with a known name somewhere in the work directory, and pass this as the database input. Then mount the Squashfs file on top of that. This only works if Singularity will let you layer a Squashfs file over a symlinked directory - I don't know if this is the case. Or potentially if you have a known path you're creating in the work directory, you could create the directory, mount the Squashfs there, and then pass that to the process - this might get around the file exists check as we have explicitly created the directory locally.

  2. Use the above suggested by @jfy133. It will require a change to the GTDB-Tk module such that if the database is not provided, the env var pointing to the path location is unset, and thus GTDB-Tk looks in its install directory. You mount the Squashfs file to point where it should be in the install directory. This requires that you know where that install directory is, which might change by container/conda env/Python version - so it's a moving target for the local configuration.

@muniheart
Copy link
Author

Hi @jfy133 & @prototaxites:

There is one assumption I'm making:

* We can't pass a direct path to some location mounted in the container as the GTDB database as the Nextflow `path()` input specifier will check if it exists before the process begins

That is my observation.

So the options are:

1. Using some groovy code, create an empty directory with a known name somewhere in the work directory, and pass this as the database input. Then mount the Squashfs file on top of that. This only works if Singularity will let you layer a Squashfs file over a symlinked directory - I don't know if this is the case. Or potentially if you have a known path you're creating in the work directory, you could create the directory, mount the Squashfs there, and then pass that to the process - this might get around the file exists check as we have explicitly created the directory locally.

I have tried this and I observe that, inside the container's shell, I see the contents of the src directory, not the contents of the image fs. Doesn't matter if the src directory is empty.

2. Use the above suggested by @jfy133.

The fundamental difference between the antimash code and my case is that antimash references the mount-source with an absolute path outside the container.

    It will require a change to the GTDB-Tk module such that if the database is not provided, the env var pointing to the path location is unset, and thus GTDB-Tk looks in its install directory. You mount the Squashfs file to point where it should be in the install directory. This requires that you know where that install directory is, which might change by container/conda env/Python version - so it's a moving target for the local configuration.

@jfy133, would you be willing to modify the module such that, in the case db is null, the module would use at least one of:

  • the value of an env variable, $GTDBTK_DATA_PATH if you like, that could be passed in containerOptions

    $ containerOptions "--env GTDB_DB_PATH=database -B ${params.gtdb_db}:\$NXF_TASK_WORKDIR/database:image-src=/"

  • a default value, e.g. ./database

as database location?

@prototaxites
Copy link
Contributor

Spare thought: could we modify the process so that the input format for the db is an env() with name GTDBTK_DATA_PATH? That avoids the file exists check entirely and simplifies the module by removing the find line to set the required environment variable.

https://www.nextflow.io/docs/latest/process.html#input-environment-variables-env

@muniheart
Copy link
Author

Spare thought: could we modify the process so that the input format for the db is an env() with name GTDBTK_DATA_PATH? That avoids the file exists check entirely and simplifies the module by removing the find line to set the required environment variable.

https://www.nextflow.io/docs/latest/process.html#input-environment-variables-env

@prototaxites, the find is required because the database has a top-level directory, e.g. /release_220, that GTDB-Tk can't handle, so removing the find would place the burden of discovering it's name on the user.

@jfy133
Copy link
Member

jfy133 commented Apr 13, 2025

Hi @jfy133 & @prototaxites:

There is one assumption I'm making:

* We can't pass a direct path to some location mounted in the container as the GTDB database as the Nextflow `path()` input specifier will check if it exists before the process begins

That is my observation.

So the options are:

1. Using some groovy code, create an empty directory with a known name somewhere in the work directory, and pass this as the database input. Then mount the Squashfs file on top of that. This only works if Singularity will let you layer a Squashfs file over a symlinked directory - I don't know if this is the case. Or potentially if you have a known path you're creating in the work directory, you could create the directory, mount the Squashfs there, and then pass that to the process - this might get around the file exists check as we have explicitly created the directory locally.

I have tried this and I observe that, inside the container's shell, I see the contents of the src directory, not the contents of the image fs. Doesn't matter if the src directory is empty.

2. Use the above suggested by @jfy133.

The fundamental difference between the antimash code and my case is that antimash references the mount-source with an absolute path outside the container.

    It will require a change to the GTDB-Tk module such that if the database is not provided, the env var pointing to the path location is unset, and thus GTDB-Tk looks in its install directory. You mount the Squashfs file to point where it should be in the install directory. This requires that you know where that install directory is, which might change by container/conda env/Python version - so it's a moving target for the local configuration.

@jfy133, would you be willing to modify the module such that, in the case db is null, the module would use at least one of:

  • the value of an env variable, $GTDBTK_DATA_PATH if you like, that could be passed in containerOptions

    $ containerOptions "--env GTDB_DB_PATH=database -B ${params.gtdb_db}:\$NXF_TASK_WORKDIR/database:image-src=/"

  • a default value, e.g. ./database

as database location?

I think the ENV method is the less 'invasive' method (assuming that it works). The only problem is we should make it very clear that it is only for special cases - it is not recommended for general use using that method, as then you don't stage the database in the work dir as Nextflow wants you too. I have a feeling this may break resume when using that system.

But this way the only module code change is indeed just putting and if statement that if the path(dB) is true, then run find, otherwise we don't set the variable in the module - and the user is expected to set the ENV in their module config and ensure that that path is present on all of their worker nodes etc. Optional input is straight forward with [] so no modification required there in the module as you say.

The containerOptions hack for antiSMASH is extremely dirty so like others have said - I really don't recommend setting those in the module itself.

@jfy133
Copy link
Member

jfy133 commented Apr 13, 2025

Spare thought: could we modify the process so that the input format for the db is an env() with name GTDBTK_DATA_PATH? That avoids the file exists check entirely and simplifies the module by removing the find line to set the required environment variable.

https://www.nextflow.io/docs/latest/process.html#input-environment-variables-env

I had to add the find precisely because people were having problems specifying the correct path :/

@prototaxites
Copy link
Contributor

Spare thought: could we modify the process so that the input format for the db is an env() with name GTDBTK_DATA_PATH? That avoids the file exists check entirely and simplifies the module by removing the find line to set the required environment variable.
https://www.nextflow.io/docs/latest/process.html#input-environment-variables-env

I had to add the find precisely because people were having problems specifying the correct path :/

Hypothetically, that could be moved to the GTDB db channel preparation step inside a map operator - db.listFiles().find { it.name == “metadata” }.getParent() or similar?

@jfy133
Copy link
Member

jfy133 commented Apr 13, 2025

Spare thought: could we modify the process so that the input format for the db is an env() with name GTDBTK_DATA_PATH? That avoids the file exists check entirely and simplifies the module by removing the find line to set the required environment variable.
https://www.nextflow.io/docs/latest/process.html#input-environment-variables-env

I had to add the find precisely because people were having problems specifying the correct path :/

Hypothetically, that could be moved to the GTDB db channel preparation step inside a map operator - db.listFiles().find { it.name == “metadata” }.getParent() or similar?

Yes true!

But we are getting further and further away from how most modules work, which sort of defeats the point of standardisation...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants