-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
containerized code only for remote mpi mapping hpc type #5667
containerized code only for remote mpi mapping hpc type #5667
Conversation
99ba4c0
to
cdb81de
Compare
13d2d0c
to
c81d416
Compare
filepath_executable: doubler.py | ||
filepath_files: PLACEHOLDER_PORTABLE_PATH_DOUBLER | ||
image: docker://clearlinux/numpy-mp | ||
engine_command: singularity exec --bind $PWD:$PWD --env APPEND_PATH=$PWD {image} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--env APPEND_PATH=$PWD
is needed otherwise doubler.py
will not be in the PATH
.
default_calc_job_plugin: core.templatereplacer | ||
filepath_executable: doubler.py | ||
filepath_files: PLACEHOLDER_PORTABLE_PATH_DOUBLER | ||
image: docker://clearlinux/numpy-mp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During the nightly test, the image is truly fetched to the CI test server.
e16c467
to
1cd1c64
Compare
Just keep as a note here, while I prepare the nightly test for portable containerized code, a python script using So it is basically an environment-specific running code but needs some general |
eba7e14
to
826848c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @unkcpz . Looks all good to me, I would just really change the term image
to image_name
since it really just references the image name. Can you fix that and then add the docs? Then I can go through those as well
@sphuber thanks a lot for the review.
I change all the "image" to "image_name" including the template of the
Added. |
7de14a5
to
3e6bdfa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @unkcpz . I gave it a first pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sphuber thanks very much for the review, the phrase really improved a lot, way more formal than the draft version. I also adapt some by adding more details.
I'll trigger the second pass when doc tests are all passed.
label: numpy-double-portable | ||
description: "containerized portable code of numpy multiply by 2" | ||
default_calc_job_plugin: core.templatereplacer | ||
filepath_executable: doubler.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, that's a good point. I'll add it.
|
||
Installed containerized code | ||
============================ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, added. But the Markdown table is not well indented, do I need to do it by hand or is there an online tool for it?
d3e4e40
to
79c5205
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @unkcpz !
After discussing with @unkcpz in some detail, I believe we both came to the conclusion that there is currently no strong use case for the "portable" containerized code.
If the use case is to have a handful of static files and executables, that is easily done by updating the container image.
If it is about a more dynamic generation of different scripts, then setting up a separate code for each new version of the scripts is not the solution (and this use case is already covered by aiida-dynamic-workflows
).
We therefore agree that the "portable" version of the containerized code can be deleted, reducing both the maintenance burden on our side, as well as making terminology easier to understand for users [1].
@sphuber What do you think?
[1] E.g., for containerized codes, "installed" can easily be misunderstood as meaning that the container image is installed on the computer.
However, the distinction of whether the image is present or not is actually not made at all in this implementation (which is fine to me).
In this sense, both "installed" and "portable" containerized codes are in fact equally portable - both of them can run on any computer that has the required image & container runtime.
@@ -0,0 +1,160 @@ | |||
.. _how-to:run-containerized_codes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, running a containerized code should be so similar to running a regular code that we don't need a separate how-to for it.
As a first step, this separate section is ok, but in my view eventually this should be condensed drastically and integrated into run_codes
.
Installed containerized code | ||
============================ | ||
|
||
The :class:`aiida.orm.nodes.data.code.containerized.InstalledContainerizedCode` data plugin allows running a code inside a container that is available on a computer that is configured in AiiDA through the :class:`~aiida.orm.computers.Computer` class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need a brief explanation of what "available" means here
- Image URI for a container registry that is accessible from the computer
- Image file on the file system of the computer
Note that this example requires Singularity to be installed on the target computer and that the ``filepath_executable`` is already part of the container image. | ||
|
||
.. note:: | ||
The first time that the containerized code is run, the image will be automatically pulled from Docker Hub, which may takes a while if the image is large. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this may seem trivial, this (that the image will be pulled at first use) is important information and needs to be mentioned early
To be honest, I haven't really had the chance to experiment with this myself and I haven't really worked a lot with containers myself, so I didn't critically review the split in that much detail. I assumed that it apparently made sense and that these were two different use-cases. But as you explain it now, it makes perfect sense to only support the single form for now. We can always add on later if it turns out this really is useful. So all good for me to reduce the complexity 👍 |
@ltalirz thanks! I am so desperate to admit the portable containerized code can be removed. I really spend time in bring up the fake use cases. But you are right, the aiida-dynamic-workflow with the containerized installed code is the correct way for what I thought to realize. I remove the portable container code and update the doc upon your review. |
6083972
to
f85ae37
Compare
9888994
to
19c38c1
Compare
This implementation of the `AbstractCode` interface allows running a `CalcJob` within a container on a target computer. The data plugin stores the name of the container image, the executable within that container, the command with which the container should be launched and the `Computer` on which the container can be run.
19c38c1
to
d620d3c
Compare
@ltalirz I have discussed a final time with @unkcpz and we think that the current implementation is good enough to be merged. We just made one final simplification and named the plugin simply I have also taken your suggestions for the docs into account and have simplified the instructions. It is true that the process of setting up the code and using it, is no different from the other code types (which was the whole point of the design) so it should be integrated neatly into the "How-to run a code" section. Nevertheless, the setup will require some more detailed information specific to containerized codes, such as what technologies are supported and how to define the I was planning to add a reference in the "How to run a code" section, but I realized that that documentation still refers to the deprecated code setup |
Note that at some point, we will have to update the topic section with more detailed information for each of the supported containerization technologies, such as examples of how to define the |
Cheers, I'll give a quick last check now then.
I would have suggested exactly the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
computer = load_computer('some-computer') | ||
computer.set_use_double_quotes(True) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth mentioning somewhere here how the implementation interacts with the withmpi
option, ideally with an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@unkcpz any idea what information would be relevant here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is to give an example of how the command line for an MPI job could look like, which illustrates that the container runtime itself is called through MPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sphuber can you add a test for the withmpi=True
, I think it is a critical test.
I'll address other comments after the group meeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the test and added an example of the runline that is generated when MPI is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Seb, I think your explanation is very clear.
docs/source/topics/data_types.rst
Outdated
|
||
* `Singularity <https://singularity-docs.readthedocs.io/en/latest/>`__ | ||
* `Sarus <https://sarus.readthedocs.io/en/stable/>`__ | ||
* `NVIDIA/enroot <https://github.com/NVIDIA/enroot/>`__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on second thought, vanilla enroot
may fall into the docker
category after all (not supported), since you would need to call mpirun
inside the container...
are there any plans to support such scenarios?
P.S. It is actually already possible to use enroot containers with AiiDA via the pyxis plugin for slurm, which adds some extra slurm pragmas like
#SBATCH --container-image ...
#SBATCH --container-mounts
One can then tell these to AiiDa as extra scheduler parameters.
It's not great from a provenance perspective, though, since the mounts, image, etc. are all recorded in a slurm-specific way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any plans to support such scenarios?
I don't know. Since we just spent a lot of time to reduce the scope of this PR, I would be wary of increasing it again on a whim. Should we maybe move enroot to the category of unsupported/untested cases and work on this in a later PR when it has been properly tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the mention of enroot for now until we figure out if and how it can be supported.
Supported container technologies | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The ``ContainerizedCode`` is compatible with a variety of containerization technologies: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to make this subsection most useful, please include for each engine an example of what to put into the engine_command
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and I already put this in my comment. I didn't add these examples since they weren't there in the original PR and I have no idea what they should be. I guess if @unkcpz already has them, we can quickly add them. But if that requires more work, maybe this can be done at a later stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the example for Sarus which I tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add the command to line below in new commit.
docs/source/topics/data_types.rst
Outdated
* `NVIDIA/enroot <https://github.com/NVIDIA/enroot/>`__ | ||
* `Shifter <https://github.com/NERSC/shifter>`__ | ||
|
||
Using `Docker <https://www.docker.com/>`__ directly is currently not supported, since Docker has to run with root permissions and it cannot be launched as a normal MPI program to propagate execution context to the container application. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the MPI part, but I don't understand what the permissions for running docker has to do with AiiDA compatibility?
Either the user has the rights to run docker or they don't - I don't think this is relevant for the integration.
Using `Docker <https://www.docker.com/>`__ directly is currently not supported, since Docker has to run with root permissions and it cannot be launched as a normal MPI program to propagate execution context to the container application. | |
Using the `Docker <https://www.docker.com/>`__ container runtime is currently not supported, since Docker cannot be launched as a MPI program to propagate execution context to the container application. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@unkcpz fine to remove the mention of the root permissions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the MPI part, but I don't understand what the permissions for running docker has to do with AiiDA compatibility?
This is for the file created inside the container and file generated inside the container. If in the Dockerfile
no USER
is specified, the container with run with the UID=0
the root user. The file it created can not be cleaned up by clean workdir for example.
I'll try rephrase it to make it more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the time being, I have rephrased it as
Using `Docker <https://www.docker.com/>`__ directly is currently not supported because:
* Docker needs to be run as the root user and the files created in the working directory inside the container will be owned by root, which prevents AiiDA from deleting those files after execution.
* Docker cannot be launched as a normal MPI program to propagate execution context to the container application.
Support may be added at a later time.
is that correct and intelligible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the default case I'd say but there is workaround and not so exact. Maybe better use the phrase from their official documentation is better?
"The Docker daemon always runs as the root user."
So I'd write as:
The Docker daemon always runs as the root user and the files created in the working directory inside the container will usually be owned by root if
uid
is not specified in the image, which prevents AiiDA from deleting those files after execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I modify it in new commit.
Ok, great I think this should be ready for a final review then @ltalirz I think we addressed all your final comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's go!
|
||
computer = load_computer('some-computer') | ||
computer.set_use_double_quotes(True) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Seb, I think your explanation is very clear.
Fixes #5627
This is the implementation separated from #5507.