Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containerized code only for remote mpi mapping hpc type #5667

Conversation

unkcpz
Copy link
Member

@unkcpz unkcpz commented Sep 26, 2022

Fixes #5627

This is the implementation separated from #5507.

@unkcpz unkcpz force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch 7 times, most recently from 99ba4c0 to cdb81de Compare September 27, 2022 12:33
@unkcpz unkcpz requested a review from sphuber September 27, 2022 12:35
@unkcpz unkcpz force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch from 13d2d0c to c81d416 Compare September 28, 2022 09:37
filepath_executable: doubler.py
filepath_files: PLACEHOLDER_PORTABLE_PATH_DOUBLER
image: docker://clearlinux/numpy-mp
engine_command: singularity exec --bind $PWD:$PWD --env APPEND_PATH=$PWD {image}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--env APPEND_PATH=$PWD is needed otherwise doubler.py will not be in the PATH.

default_calc_job_plugin: core.templatereplacer
filepath_executable: doubler.py
filepath_files: PLACEHOLDER_PORTABLE_PATH_DOUBLER
image: docker://clearlinux/numpy-mp
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the nightly test, the image is truly fetched to the CI test server.

@unkcpz unkcpz force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch 2 times, most recently from e16c467 to 1cd1c64 Compare September 28, 2022 13:24
@unkcpz
Copy link
Member Author

unkcpz commented Sep 28, 2022

Fix the nightly test for portable containerized code, I find it can be very useful.
I think we are very close to finishing this. @sphuber can you have a look at this and then I'll merge and update the document from #5507

@unkcpz
Copy link
Member Author

unkcpz commented Sep 28, 2022

Just keep as a note here, while I prepare the nightly test for portable containerized code, a python script using numpy to do the operation. I find it might be useful if the use case can be more generalized, where the current portable containerized code requires setting the script and the executable and it binds to the code entity. There can be a way that the container provides the environment and the executable is python, the script can be then passed as input.

So it is basically an environment-specific running code but needs some general CalcJob to support.

@unkcpz unkcpz force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch 2 times, most recently from eba7e14 to 826848c Compare September 28, 2022 14:34
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @unkcpz . Looks all good to me, I would just really change the term image to image_name since it really just references the image name. Can you fix that and then add the docs? Then I can go through those as well

aiida/orm/nodes/data/code/abstract.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/code/containerized.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/code/containerized.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/code/containerized.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/code/containerized.py Outdated Show resolved Hide resolved
@unkcpz
Copy link
Member Author

unkcpz commented Oct 4, 2022

@sphuber thanks a lot for the review.

I would just really change the term image to image_name since it really just references the image name.

I change all the "image" to "image_name" including the template of the engine_command. The other reason I didn't change it was I thought "image" is more general to include the exact image name and also other formats such as the path to .sif file the URL of dockerhub link etc. But I think the image name is also very general and good to be used here.

Can you fix that and then add the docs?

Added.

@unkcpz unkcpz requested a review from sphuber October 4, 2022 15:41
@unkcpz unkcpz force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch from 7de14a5 to 3e6bdfa Compare October 4, 2022 15:48
@unkcpz
Copy link
Member Author

unkcpz commented Oct 13, 2022

@sphuber @ltalirz, can you find time to have a look at the documentation I added?

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @unkcpz . I gave it a first pass

docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
Copy link
Member Author

@unkcpz unkcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sphuber thanks very much for the review, the phrase really improved a lot, way more formal than the draft version. I also adapt some by adding more details.
I'll trigger the second pass when doc tests are all passed.

docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
label: numpy-double-portable
description: "containerized portable code of numpy multiply by 2"
default_calc_job_plugin: core.templatereplacer
filepath_executable: doubler.py
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, that's a good point. I'll add it.


Installed containerized code
============================

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, added. But the Markdown table is not well indented, do I need to do it by hand or is there an online tool for it?

@unkcpz unkcpz force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch from d3e4e40 to 79c5205 Compare October 14, 2022 13:55
@unkcpz unkcpz requested a review from sphuber October 14, 2022 14:28
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @unkcpz !

After discussing with @unkcpz in some detail, I believe we both came to the conclusion that there is currently no strong use case for the "portable" containerized code.
If the use case is to have a handful of static files and executables, that is easily done by updating the container image.
If it is about a more dynamic generation of different scripts, then setting up a separate code for each new version of the scripts is not the solution (and this use case is already covered by aiida-dynamic-workflows).

We therefore agree that the "portable" version of the containerized code can be deleted, reducing both the maintenance burden on our side, as well as making terminology easier to understand for users [1].

@sphuber What do you think?

[1] E.g., for containerized codes, "installed" can easily be misunderstood as meaning that the container image is installed on the computer.
However, the distinction of whether the image is present or not is actually not made at all in this implementation (which is fine to me).
In this sense, both "installed" and "portable" containerized codes are in fact equally portable - both of them can run on any computer that has the required image & container runtime.

@@ -0,0 +1,160 @@
.. _how-to:run-containerized_codes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, running a containerized code should be so similar to running a regular code that we don't need a separate how-to for it.

As a first step, this separate section is ok, but in my view eventually this should be condensed drastically and integrated into run_codes.

docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
Installed containerized code
============================

The :class:`aiida.orm.nodes.data.code.containerized.InstalledContainerizedCode` data plugin allows running a code inside a container that is available on a computer that is configured in AiiDA through the :class:`~aiida.orm.computers.Computer` class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need a brief explanation of what "available" means here

  • Image URI for a container registry that is accessible from the computer
  • Image file on the file system of the computer

docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
Note that this example requires Singularity to be installed on the target computer and that the ``filepath_executable`` is already part of the container image.

.. note::
The first time that the containerized code is run, the image will be automatically pulled from Docker Hub, which may takes a while if the image is large.
Copy link
Member

@ltalirz ltalirz Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this may seem trivial, this (that the image will be pulled at first use) is important information and needs to be mentioned early

@sphuber
Copy link
Contributor

sphuber commented Oct 14, 2022

Thanks a lot @unkcpz !

After discussing with @unkcpz in some detail, I believe we both came to the conclusion that there is currently no strong use case for the "portable" containerized code. If the use case is to have a handful of static files and executables, that is easily done by updating the container image. If it is about a more dynamic generation of different scripts, then setting up a separate code for each new version of the scripts is not the solution (and this use case is already covered by aiida-dynamic-workflows).

We therefore agree that the "portable" version of the containerized code can be deleted, reducing both the maintenance burden on our side, as well as making terminology easier to understand for users [1].

@sphuber What do you think?

[1] E.g., for containerized codes, "installed" can easily be misunderstood as meaning that the container image is installed on the computer. However, the distinction of whether the image is present or not is actually not made at all in this implementation (which is fine to me). In this sense, both "installed" and "portable" containerized codes are in fact equally portable - both of them can run on any computer that has the required image & container runtime.

To be honest, I haven't really had the chance to experiment with this myself and I haven't really worked a lot with containers myself, so I didn't critically review the split in that much detail. I assumed that it apparently made sense and that these were two different use-cases. But as you explain it now, it makes perfect sense to only support the single form for now. We can always add on later if it turns out this really is useful. So all good for me to reduce the complexity 👍

@unkcpz
Copy link
Member Author

unkcpz commented Oct 17, 2022

@ltalirz thanks! I am so desperate to admit the portable containerized code can be removed. I really spend time in bring up the fake use cases. But you are right, the aiida-dynamic-workflow with the containerized installed code is the correct way for what I thought to realize.

I remove the portable container code and update the doc upon your review.

@unkcpz unkcpz force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch 2 times, most recently from 6083972 to f85ae37 Compare October 17, 2022 17:03
@unkcpz unkcpz requested a review from ltalirz October 18, 2022 06:59
@sphuber sphuber force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch 3 times, most recently from 9888994 to 19c38c1 Compare October 18, 2022 11:43
This implementation of the `AbstractCode` interface allows running a
`CalcJob` within a container on a target computer. The data plugin
stores the name of the container image, the executable within that
container, the command with which the container should be launched and
the `Computer` on which the container can be run.
@sphuber sphuber force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch from 19c38c1 to d620d3c Compare October 18, 2022 11:52
@sphuber
Copy link
Contributor

sphuber commented Oct 18, 2022

@ltalirz I have discussed a final time with @unkcpz and we think that the current implementation is good enough to be merged. We just made one final simplification and named the plugin simply ContainerizedCode with entry point core.code.containerized (there was no need to keep the Installed infix).

I have also taken your suggestions for the docs into account and have simplified the instructions. It is true that the process of setting up the code and using it, is no different from the other code types (which was the whole point of the design) so it should be integrated neatly into the "How-to run a code" section. Nevertheless, the setup will require some more detailed information specific to containerized codes, such as what technologies are supported and how to define the engine_command settings for each of those. I have currently added those to the ContainerizedCode topic.

I was planning to add a reference in the "How to run a code" section, but I realized that that documentation still refers to the deprecated code setup verdi code setup. I will submit a PR soon after this, to update that section where I will then also seamlessly integrate the containerized option. So I think for now the docs in this PR are good enough. Let me know what you think @unkcpz

@sphuber
Copy link
Contributor

sphuber commented Oct 18, 2022

Note that at some point, we will have to update the topic section with more detailed information for each of the supported containerization technologies, such as examples of how to define the engine_command. For now, we just provide the example based on Singularity

@ltalirz
Copy link
Member

ltalirz commented Oct 18, 2022

Cheers, I'll give a quick last check now then.

named the plugin simply ContainerizedCode with entry point core.code.containerized

I would have suggested exactly the same

sphuber
sphuber previously approved these changes Oct 18, 2022
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot @unkcpz and @sphuber

The PR is looking good; here are some final comments from my side

aiida/orm/nodes/data/code/containerized.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/code/containerized.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/code/containerized.py Outdated Show resolved Hide resolved
docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved
docs/source/howto/run_containerized_codes.rst Outdated Show resolved Hide resolved

computer = load_computer('some-computer')
computer.set_use_double_quotes(True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth mentioning somewhere here how the implementation interacts with the withmpi option, ideally with an example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unkcpz any idea what information would be relevant here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is to give an example of how the command line for an MPI job could look like, which illustrates that the container runtime itself is called through MPI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sphuber can you add a test for the withmpi=True, I think it is a critical test.

I'll address other comments after the group meeting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the test and added an example of the runline that is generated when MPI is enabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Seb, I think your explanation is very clear.


* `Singularity <https://singularity-docs.readthedocs.io/en/latest/>`__
* `Sarus <https://sarus.readthedocs.io/en/stable/>`__
* `NVIDIA/enroot <https://github.com/NVIDIA/enroot/>`__
Copy link
Member

@ltalirz ltalirz Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on second thought, vanilla enroot may fall into the docker category after all (not supported), since you would need to call mpirun inside the container...

are there any plans to support such scenarios?

P.S. It is actually already possible to use enroot containers with AiiDA via the pyxis plugin for slurm, which adds some extra slurm pragmas like

#SBATCH --container-image ...
#SBATCH --container-mounts

One can then tell these to AiiDa as extra scheduler parameters.
It's not great from a provenance perspective, though, since the mounts, image, etc. are all recorded in a slurm-specific way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any plans to support such scenarios?

I don't know. Since we just spent a lot of time to reduce the scope of this PR, I would be wary of increasing it again on a whim. Should we maybe move enroot to the category of unsupported/untested cases and work on this in a later PR when it has been properly tested?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the mention of enroot for now until we figure out if and how it can be supported.

Supported container technologies
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``ContainerizedCode`` is compatible with a variety of containerization technologies:
Copy link
Member

@ltalirz ltalirz Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to make this subsection most useful, please include for each engine an example of what to put into the engine_command field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and I already put this in my comment. I didn't add these examples since they weren't there in the original PR and I have no idea what they should be. I guess if @unkcpz already has them, we can quickly add them. But if that requires more work, maybe this can be done at a later stage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the example for Sarus which I tested.

Copy link
Member Author

@unkcpz unkcpz Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add the command to line below in new commit.

docs/source/topics/data_types.rst Show resolved Hide resolved
* `NVIDIA/enroot <https://github.com/NVIDIA/enroot/>`__
* `Shifter <https://github.com/NERSC/shifter>`__

Using `Docker <https://www.docker.com/>`__ directly is currently not supported, since Docker has to run with root permissions and it cannot be launched as a normal MPI program to propagate execution context to the container application.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the MPI part, but I don't understand what the permissions for running docker has to do with AiiDA compatibility?

Either the user has the rights to run docker or they don't - I don't think this is relevant for the integration.

Suggested change
Using `Docker <https://www.docker.com/>`__ directly is currently not supported, since Docker has to run with root permissions and it cannot be launched as a normal MPI program to propagate execution context to the container application.
Using the `Docker <https://www.docker.com/>`__ container runtime is currently not supported, since Docker cannot be launched as a MPI program to propagate execution context to the container application.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unkcpz fine to remove the mention of the root permissions?

Copy link
Member Author

@unkcpz unkcpz Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the MPI part, but I don't understand what the permissions for running docker has to do with AiiDA compatibility?

This is for the file created inside the container and file generated inside the container. If in the Dockerfile no USER is specified, the container with run with the UID=0 the root user. The file it created can not be cleaned up by clean workdir for example.
I'll try rephrase it to make it more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being, I have rephrased it as


Using `Docker <https://www.docker.com/>`__ directly is currently not supported because:

* Docker needs to be run as the root user and the files created in the working directory inside the container will be owned by root, which prevents AiiDA from deleting those files after execution.
* Docker cannot be launched as a normal MPI program to propagate execution context to the container application.

Support may be added at a later time.

is that correct and intelligible?

Copy link
Member Author

@unkcpz unkcpz Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the default case I'd say but there is workaround and not so exact. Maybe better use the phrase from their official documentation is better?

"The Docker daemon always runs as the root user."

So I'd write as:

The Docker daemon always runs as the root user and the files created in the working directory inside the container will usually be owned by root if uid is not specified in the image, which prevents AiiDA from deleting those files after execution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I modify it in new commit.

@sphuber
Copy link
Contributor

sphuber commented Oct 18, 2022

Ok, great I think this should be ready for a final review then @ltalirz I think we addressed all your final comments.

@sphuber sphuber requested review from ltalirz and sphuber October 18, 2022 15:40
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's go!


computer = load_computer('some-computer')
computer.set_use_double_quotes(True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Seb, I think your explanation is very clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Containerized codes
3 participants