-
Notifications
You must be signed in to change notification settings - Fork 213
Description
Putting here some thoughts following the work I did on easybuilders/easybuild-easyblocks#3917
I think we in general do not want mpi
related commands to fail due to lack of resources.
EG: it is already the case that many EC files sets environment variables to allow oversubscription of resources at steps that invoke mpirun
or similar.
I've given this some thoughts and i feel like implementing this would require 2 decision:
- Oversubscription vs(or in conjunction) to cpu-pinning
- Have this behavior on by default or on-demand
Oversubscription vs(or in conjunction) to cpu-pinning
I think both would have their pros and cons.
Oversubscription would allow us to defer choosing the cores to the resource manager but it might not be possible to enforce max_parallel
For example in HPL the test will not run with less than 4 processes.
Let's say we set max_parallel = 2
and allow for oversubscription. If the system running EB has more than 2 CPU running
mpirun -n 4 --map-by :oversubscribe ...
will actually end up using more cpus than what was requested with max_parallel
I am still exploring the capabilities of --map-by
and --hostfiles
for OpenMPI, but i am not sure if it is possible to tell mpirun
to never use more than a specific number of processors even if more are available without explicitly pinning the cores
On the other hand we could do something similar to what i did in easybuilders/easybuild-easyblocks#3917 so if one of the following is true we default to pinning
- mpi is not available
self.cfg.parallel
is lower than a specified requested value
There are a few cons here:
- We have to know how CPUs are numbered. I am assuming we defer to
hwloc
so we might always be able to do abind-to core
and than give a sequential number (EGreq=7
max_parallel = 3
-->--cpu-set 0,1,2,0,1,2,0
) (in the HPL PR at the time of writing this i am binding all processes to 0 as i am not sure this is always reliable) - We risk binding to a CPU that is already in use by the machine instead of one that is currently free
I will try to investigate this more and try to come up with (hopefully not more than 2) viable implementation for this, but if there is someone more experienced that could give some ideas/feedback it would be really appreaciated
Where could this be implemented
- If we decide this behavior should always be on this could be done at TC loading by setting the appropriate env variables depending on the MPI family.
- An alternative could be to modify https://github.com/easybuilders/easybuild-framework/blob/develop/easybuild/tools/toolchain/mpi.py#L273 to add an extra parameter
oversubscribe
, which means we would than need to properly enforce that every MPI command is generated through this helper function