containerized code only for remote mpi mapping hpc type #5667

unkcpz · 2022-09-26T16:56:31Z

This is the implementation separated from #5507.

The singularity nightly test is added and passed.
Also test for the portable type containerized code I have it implemented but better to be merged here after Templatereplacer parser make more general #5669

unkcpz · 2022-09-28T09:39:19Z

.github/config/doubler-singularity-portable.yaml

+filepath_executable: doubler.py
+filepath_files: PLACEHOLDER_PORTABLE_PATH_DOUBLER
+image: docker://clearlinux/numpy-mp
+engine_command: singularity exec --bind $PWD:$PWD --env APPEND_PATH=$PWD {image}


--env APPEND_PATH=$PWD is needed otherwise doubler.py will not be in the PATH.

unkcpz · 2022-09-28T09:40:02Z

.github/config/doubler-singularity-portable.yaml

+default_calc_job_plugin: core.templatereplacer
+filepath_executable: doubler.py
+filepath_files: PLACEHOLDER_PORTABLE_PATH_DOUBLER
+image: docker://clearlinux/numpy-mp


During the nightly test, the image is truly fetched to the CI test server.

unkcpz · 2022-09-28T13:28:06Z

Fix the nightly test for portable containerized code, I find it can be very useful.
I think we are very close to finishing this. @sphuber can you have a look at this and then I'll merge and update the document from #5507

unkcpz · 2022-09-28T13:35:53Z

Just keep as a note here, while I prepare the nightly test for portable containerized code, a python script using numpy to do the operation. I find it might be useful if the use case can be more generalized, where the current portable containerized code requires setting the script and the executable and it binds to the code entity. There can be a way that the container provides the environment and the executable is python, the script can be then passed as input.

So it is basically an environment-specific running code but needs some general CalcJob to support.

sphuber

Thanks @unkcpz . Looks all good to me, I would just really change the term image to image_name since it really just references the image name. Can you fix that and then add the docs? Then I can go through those as well

aiida/orm/nodes/data/code/abstract.py

aiida/orm/nodes/data/code/containerized.py

unkcpz · 2022-10-04T15:41:18Z

@sphuber thanks a lot for the review.

I would just really change the term image to image_name since it really just references the image name.

I change all the "image" to "image_name" including the template of the engine_command. The other reason I didn't change it was I thought "image" is more general to include the exact image name and also other formats such as the path to .sif file the URL of dockerhub link etc. But I think the image name is also very general and good to be used here.

Can you fix that and then add the docs?

Added.

unkcpz · 2022-10-13T10:32:28Z

@sphuber @ltalirz, can you find time to have a look at the documentation I added?

sphuber

Thanks @unkcpz . I gave it a first pass

docs/source/howto/run_containerized_codes.rst

unkcpz

@sphuber thanks very much for the review, the phrase really improved a lot, way more formal than the draft version. I also adapt some by adding more details.
I'll trigger the second pass when doc tests are all passed.

docs/source/howto/run_containerized_codes.rst

unkcpz · 2022-10-14T12:52:39Z

docs/source/howto/run_containerized_codes.rst

+        label: numpy-double-portable
+        description: "containerized portable code of numpy multiply by 2"
+        default_calc_job_plugin: core.templatereplacer
+        filepath_executable: doubler.py


True, that's a good point. I'll add it.

unkcpz · 2022-10-14T13:35:05Z

docs/source/howto/run_containerized_codes.rst

+
+Installed containerized code
+============================
+


True, added. But the Markdown table is not well indented, do I need to do it by hand or is there an online tool for it?

ltalirz

Thanks a lot @unkcpz !

After discussing with @unkcpz in some detail, I believe we both came to the conclusion that there is currently no strong use case for the "portable" containerized code.
If the use case is to have a handful of static files and executables, that is easily done by updating the container image.
If it is about a more dynamic generation of different scripts, then setting up a separate code for each new version of the scripts is not the solution (and this use case is already covered by aiida-dynamic-workflows).

We therefore agree that the "portable" version of the containerized code can be deleted, reducing both the maintenance burden on our side, as well as making terminology easier to understand for users [1].

@sphuber What do you think?

[1] E.g., for containerized codes, "installed" can easily be misunderstood as meaning that the container image is installed on the computer.
However, the distinction of whether the image is present or not is actually not made at all in this implementation (which is fine to me).
In this sense, both "installed" and "portable" containerized codes are in fact equally portable - both of them can run on any computer that has the required image & container runtime.

ltalirz · 2022-10-14T14:42:52Z

docs/source/howto/run_containerized_codes.rst

@@ -0,0 +1,160 @@
+.. _how-to:run-containerized_codes:


Ideally, running a containerized code should be so similar to running a regular code that we don't need a separate how-to for it.

As a first step, this separate section is ok, but in my view eventually this should be condensed drastically and integrated into run_codes.

docs/source/howto/run_containerized_codes.rst

ltalirz · 2022-10-14T16:07:00Z

docs/source/howto/run_containerized_codes.rst

+Installed containerized code
+============================
+
+The :class:`aiida.orm.nodes.data.code.containerized.InstalledContainerizedCode` data plugin allows running a code inside a container that is available on a computer that is configured in AiiDA through the :class:`~aiida.orm.computers.Computer` class.


We will need a brief explanation of what "available" means here

Image URI for a container registry that is accessible from the computer

Image file on the file system of the computer

docs/source/howto/run_containerized_codes.rst

ltalirz · 2022-10-14T16:12:16Z

docs/source/howto/run_containerized_codes.rst

+Note that this example requires Singularity to be installed on the target computer and that the ``filepath_executable`` is already part of the container image.
+
+.. note::
+    The first time that the containerized code is run, the image will be automatically pulled from Docker Hub, which may takes a while if the image is large.


Although this may seem trivial, this (that the image will be pulled at first use) is important information and needs to be mentioned early

sphuber · 2022-10-14T20:08:12Z

Thanks a lot @unkcpz !

After discussing with @unkcpz in some detail, I believe we both came to the conclusion that there is currently no strong use case for the "portable" containerized code. If the use case is to have a handful of static files and executables, that is easily done by updating the container image. If it is about a more dynamic generation of different scripts, then setting up a separate code for each new version of the scripts is not the solution (and this use case is already covered by aiida-dynamic-workflows).

We therefore agree that the "portable" version of the containerized code can be deleted, reducing both the maintenance burden on our side, as well as making terminology easier to understand for users [1].

@sphuber What do you think?

[1] E.g., for containerized codes, "installed" can easily be misunderstood as meaning that the container image is installed on the computer. However, the distinction of whether the image is present or not is actually not made at all in this implementation (which is fine to me). In this sense, both "installed" and "portable" containerized codes are in fact equally portable - both of them can run on any computer that has the required image & container runtime.

To be honest, I haven't really had the chance to experiment with this myself and I haven't really worked a lot with containers myself, so I didn't critically review the split in that much detail. I assumed that it apparently made sense and that these were two different use-cases. But as you explain it now, it makes perfect sense to only support the single form for now. We can always add on later if it turns out this really is useful. So all good for me to reduce the complexity 👍

unkcpz · 2022-10-17T17:01:13Z

@ltalirz thanks! I am so desperate to admit the portable containerized code can be removed. I really spend time in bring up the fake use cases. But you are right, the aiida-dynamic-workflow with the containerized installed code is the correct way for what I thought to realize.

I remove the portable container code and update the doc upon your review.

This implementation of the `AbstractCode` interface allows running a `CalcJob` within a container on a target computer. The data plugin stores the name of the container image, the executable within that container, the command with which the container should be launched and the `Computer` on which the container can be run.

sphuber · 2022-10-18T12:01:17Z

@ltalirz I have discussed a final time with @unkcpz and we think that the current implementation is good enough to be merged. We just made one final simplification and named the plugin simply ContainerizedCode with entry point core.code.containerized (there was no need to keep the Installed infix).

I have also taken your suggestions for the docs into account and have simplified the instructions. It is true that the process of setting up the code and using it, is no different from the other code types (which was the whole point of the design) so it should be integrated neatly into the "How-to run a code" section. Nevertheless, the setup will require some more detailed information specific to containerized codes, such as what technologies are supported and how to define the engine_command settings for each of those. I have currently added those to the ContainerizedCode topic.

I was planning to add a reference in the "How to run a code" section, but I realized that that documentation still refers to the deprecated code setup verdi code setup. I will submit a PR soon after this, to update that section where I will then also seamlessly integrate the containerized option. So I think for now the docs in this PR are good enough. Let me know what you think @unkcpz

sphuber · 2022-10-18T12:03:04Z

Note that at some point, we will have to update the topic section with more detailed information for each of the supported containerization technologies, such as examples of how to define the engine_command. For now, we just provide the example based on Singularity

ltalirz · 2022-10-18T12:41:34Z

Cheers, I'll give a quick last check now then.

named the plugin simply ContainerizedCode with entry point core.code.containerized

I would have suggested exactly the same

ltalirz

thanks a lot @unkcpz and @sphuber

The PR is looking good; here are some final comments from my side

aiida/orm/nodes/data/code/containerized.py

docs/source/howto/run_containerized_codes.rst

ltalirz · 2022-10-18T12:51:47Z

docs/source/topics/data_types.rst

+
+        computer = load_computer('some-computer')
+        computer.set_use_double_quotes(True)
+


I think it is worth mentioning somewhere here how the implementation interacts with the withmpi option, ideally with an example.

@unkcpz any idea what information would be relevant here?

What I mean is to give an example of how the command line for an MPI job could look like, which illustrates that the container runtime itself is called through MPI.

@sphuber can you add a test for the withmpi=True, I think it is a critical test.

I'll address other comments after the group meeting.

I have added the test and added an example of the runline that is generated when MPI is enabled.

thanks Seb, I think your explanation is very clear.

ltalirz · 2022-10-18T13:00:49Z

docs/source/topics/data_types.rst

+
+* `Singularity <https://singularity-docs.readthedocs.io/en/latest/>`__
+* `Sarus <https://sarus.readthedocs.io/en/stable/>`__
+* `NVIDIA/enroot <https://github.com/NVIDIA/enroot/>`__


on second thought, vanilla enroot may fall into the docker category after all (not supported), since you would need to call mpirun inside the container...

are there any plans to support such scenarios?

P.S. It is actually already possible to use enroot containers with AiiDA via the pyxis plugin for slurm, which adds some extra slurm pragmas like

#SBATCH --container-image ... #SBATCH --container-mounts

One can then tell these to AiiDa as extra scheduler parameters.
It's not great from a provenance perspective, though, since the mounts, image, etc. are all recorded in a slurm-specific way.

are there any plans to support such scenarios?

I don't know. Since we just spent a lot of time to reduce the scope of this PR, I would be wary of increasing it again on a whim. Should we maybe move enroot to the category of unsupported/untested cases and work on this in a later PR when it has been properly tested?

I removed the mention of enroot for now until we figure out if and how it can be supported.

ltalirz · 2022-10-18T13:05:26Z

docs/source/topics/data_types.rst

+Supported container technologies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``ContainerizedCode`` is compatible with a variety of containerization technologies:


In order to make this subsection most useful, please include for each engine an example of what to put into the engine_command field.

I agree, and I already put this in my comment. I didn't add these examples since they weren't there in the original PR and I have no idea what they should be. I guess if @unkcpz already has them, we can quickly add them. But if that requires more work, maybe this can be done at a later stage.

I'll add the example for Sarus which I tested.

I add the command to line below in new commit.

docs/source/topics/data_types.rst

ltalirz · 2022-10-18T13:11:59Z

docs/source/topics/data_types.rst

+* `NVIDIA/enroot <https://github.com/NVIDIA/enroot/>`__
+* `Shifter <https://github.com/NERSC/shifter>`__
+
+Using `Docker <https://www.docker.com/>`__ directly is currently not supported, since Docker has to run with root permissions and it cannot be launched as a normal MPI program to propagate execution context to the container application.


I get the MPI part, but I don't understand what the permissions for running docker has to do with AiiDA compatibility?

Either the user has the rights to run docker or they don't - I don't think this is relevant for the integration.

Suggested change

Using `Docker <https://www.docker.com/>`__ directly is currently not supported, since Docker has to run with root permissions and it cannot be launched as a normal MPI program to propagate execution context to the container application.

Using the `Docker <https://www.docker.com/>`__ container runtime is currently not supported, since Docker cannot be launched as a MPI program to propagate execution context to the container application.

@unkcpz fine to remove the mention of the root permissions?

I get the MPI part, but I don't understand what the permissions for running docker has to do with AiiDA compatibility?

This is for the file created inside the container and file generated inside the container. If in the Dockerfile no USER is specified, the container with run with the UID=0 the root user. The file it created can not be cleaned up by clean workdir for example.
I'll try rephrase it to make it more clear.

For the time being, I have rephrased it as

Using `Docker <https://www.docker.com/>`__ directly is currently not supported because: * Docker needs to be run as the root user and the files created in the working directory inside the container will be owned by root, which prevents AiiDA from deleting those files after execution. * Docker cannot be launched as a normal MPI program to propagate execution context to the container application. Support may be added at a later time.

is that correct and intelligible?

That is the default case I'd say but there is workaround and not so exact. Maybe better use the phrase from their official documentation is better?

"The Docker daemon always runs as the root user."

So I'd write as:

The Docker daemon always runs as the root user and the files created in the working directory inside the container will usually be owned by root if uid is not specified in the image, which prevents AiiDA from deleting those files after execution.

Okay, I modify it in new commit.

sphuber · 2022-10-18T15:40:27Z

Ok, great I think this should be ready for a final review then @ltalirz I think we addressed all your final comments.

ltalirz

let's go!

ltalirz · 2022-10-18T15:55:35Z

docs/source/topics/data_types.rst

+
+        computer = load_computer('some-computer')
+        computer.set_use_double_quotes(True)
+


thanks Seb, I think your explanation is very clear.

This was referenced Sep 26, 2022

Design/containerized code new #5665

Closed

Attmpt to decouple this_code setup #5664

Merged

unkcpz force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch 7 times, most recently from 99ba4c0 to cdb81de Compare September 27, 2022 12:33

unkcpz requested a review from sphuber September 27, 2022 12:35

unkcpz force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch from 13d2d0c to c81d416 Compare September 28, 2022 09:37

unkcpz commented Sep 28, 2022

View reviewed changes

unkcpz force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch 2 times, most recently from e16c467 to 1cd1c64 Compare September 28, 2022 13:24

unkcpz force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch 2 times, most recently from eba7e14 to 826848c Compare September 28, 2022 14:34

sphuber requested changes Oct 4, 2022

View reviewed changes

unkcpz requested a review from sphuber October 4, 2022 15:41

unkcpz force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch from 7de14a5 to 3e6bdfa Compare October 4, 2022 15:48

sphuber requested changes Oct 13, 2022

View reviewed changes

unkcpz commented Oct 14, 2022

View reviewed changes

unkcpz force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch from d3e4e40 to 79c5205 Compare October 14, 2022 13:55

unkcpz requested a review from sphuber October 14, 2022 14:28

ltalirz reviewed Oct 14, 2022

View reviewed changes

unkcpz force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch 2 times, most recently from 6083972 to f85ae37 Compare October 17, 2022 17:03

unkcpz requested a review from ltalirz October 18, 2022 06:59

sphuber force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch 3 times, most recently from 9888994 to 19c38c1 Compare October 18, 2022 11:43

sphuber force-pushed the design/new/containerized-code-only-for-remote-mpi-mapping-hpc-type branch from 19c38c1 to d620d3c Compare October 18, 2022 11:52

sphuber previously approved these changes Oct 18, 2022

View reviewed changes

ltalirz reviewed Oct 18, 2022

View reviewed changes

Address PR review comments

0620b73

sphuber dismissed their stale review via 0620b73 October 18, 2022 14:42

after review: rephrase docker exclu and and sarus engine command

c3298dd

sphuber requested review from ltalirz and sphuber October 18, 2022 15:40

sphuber approved these changes Oct 18, 2022

View reviewed changes

ltalirz approved these changes Oct 18, 2022

View reviewed changes

sphuber merged commit 4b0f8b8 into aiidateam:main Oct 18, 2022

sphuber mentioned this pull request Oct 18, 2022

containerized code: setting and running containerized code (docker, sarus, singularity, conda) #5507

Closed

5 tasks


		computer = load_computer('some-computer')
		computer.set_use_double_quotes(True)

	Using `Docker <https://www.docker.com/>`__ directly is currently not supported, since Docker has to run with root permissions and it cannot be launched as a normal MPI program to propagate execution context to the container application.
	Using the `Docker <https://www.docker.com/>`__ container runtime is currently not supported, since Docker cannot be launched as a MPI program to propagate execution context to the container application.


		Installed containerized code
		============================

containerized code only for remote mpi mapping hpc type #5667

containerized code only for remote mpi mapping hpc type #5667

Conversation

unkcpz commented Sep 26, 2022 • edited by sphuber Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unkcpz commented Sep 28, 2022

unkcpz commented Sep 28, 2022 • edited Loading

sphuber left a comment

Choose a reason for hiding this comment

unkcpz commented Oct 4, 2022 • edited Loading

unkcpz commented Oct 13, 2022

sphuber left a comment

Choose a reason for hiding this comment

unkcpz left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltalirz left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltalirz Oct 14, 2022 • edited Loading

Choose a reason for hiding this comment

sphuber commented Oct 14, 2022

unkcpz commented Oct 17, 2022

sphuber commented Oct 18, 2022

sphuber commented Oct 18, 2022

ltalirz commented Oct 18, 2022

ltalirz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltalirz Oct 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltalirz Oct 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unkcpz Oct 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unkcpz Oct 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unkcpz Oct 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sphuber commented Oct 18, 2022

ltalirz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unkcpz commented Sep 26, 2022 •

edited by sphuber

Loading

unkcpz commented Sep 28, 2022 •

edited

Loading

unkcpz commented Oct 4, 2022 •

edited

Loading

unkcpz left a comment •

edited

Loading

ltalirz left a comment •

edited

Loading

ltalirz Oct 14, 2022 •

edited

Loading

ltalirz Oct 18, 2022 •

edited

Loading

ltalirz Oct 18, 2022 •

edited

Loading

unkcpz Oct 18, 2022 •

edited

Loading

unkcpz Oct 18, 2022 •

edited

Loading

unkcpz Oct 18, 2022 •

edited

Loading