MPI_Intercomm_merge not working after second ULFM shrink. #11724

Robyroc · 2023-05-30T10:02:48Z

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v5.0.x branch, hash 907350a

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Git clone, compiled with ../configure --with-ft=ulfm

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

f34a7ce21e7a89aba44fa396eb84de1bebde3ad3 3rd-party/openpmix (v4.2.4rc1-16-gf34a7ce2)
c4925aa5cc03159da04b94dcda9f559329150724 3rd-party/prrte (v3.0.1rc2-17-gc4925aa5cc)
c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)

Please describe the system on which you are running

Operating system/version: Red Hat 8.6
Computer hardware: Arm Cavium ThunderX2
Network type: single node

Details of the problem

The following code, which uses ULFM functionalities to recover from two faults, hangs in a deadlock.

#include <assert.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

#include "mpi-ext.h"

#define FIRST_FAIL 1
#define SECOND_FAIL 0

int main(int argc, char** argv)
{
    MPI_Init(&argc, &argv);
    MPI_Comm comm, parent, temp, inter, inter2, unordered;
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_get_parent(&parent);
    int len, rc, error;
    char errstr[MPI_MAX_ERROR_STRING];
    if (parent == MPI_COMM_NULL)
    {
        MPI_Comm_dup(MPI_COMM_WORLD, &comm);
        MPI_Comm_set_errhandler(comm, MPI_ERRORS_RETURN);
        MPI_Barrier(comm);
        printf("%d, after first barrier\n", rank);
        if (rank == FIRST_FAIL)
            raise(SIGINT);
        rc = MPI_Barrier(comm);

        MPI_Error_string(rc, errstr, &len);
        printf("%d: Second barrier: %s\n", rank, errstr);
        MPIX_Comm_shrink(comm, &temp);
        MPI_Comm_free(&comm);
        char** spawn_argv = (char**)malloc(sizeof(char*) * 2);
        spawn_argv[0] = (char*)malloc(sizeof(char) * 2);
        sprintf(spawn_argv[0], "%d", FIRST_FAIL);
        spawn_argv[1] = NULL;
        MPI_Comm_spawn(argv[0], spawn_argv, 1, MPI_INFO_NULL, 0, temp, &inter, &error);
        MPI_Intercomm_merge(inter, 0, &unordered);
        MPI_Comm_split(unordered, 0, rank, &comm);
        MPI_Comm_free(&unordered);
        MPI_Comm_set_errhandler(comm, MPI_ERRORS_RETURN);
        MPI_Barrier(comm);
        printf("%d, after second barrier\n", rank);

        if (rank == SECOND_FAIL)
            raise(SIGINT);
        rc = MPI_Barrier(comm);
        MPI_Error_string(rc, errstr, &len);
        printf("%d: Last barrier: %s\n", rank, errstr);
        MPIX_Comm_shrink(comm, &temp);
        MPI_Comm_free(&comm);
        sprintf(spawn_argv[0], "%d", SECOND_FAIL);
        MPI_Comm_spawn(argv[0], spawn_argv, 1, MPI_INFO_NULL, 0, temp, &inter2, &error);
        MPI_Intercomm_merge(inter2, 0, &unordered);
        MPI_Comm_split(unordered, 0, rank, &comm);
        MPI_Comm_free(&unordered);
        MPI_Comm_set_errhandler(comm, MPI_ERRORS_RETURN);
        MPI_Barrier(comm);
        printf("%d, after last barrier\n", rank);
    }
    else
    {
        rank = atoi(argv[1]);
        printf("%d: respawned\n", rank);
        if (rank == FIRST_FAIL)
        {
            MPI_Intercomm_merge(parent, 0, &unordered);
            MPI_Comm_split(unordered, 0, rank, &comm);
            MPI_Comm_free(&unordered);
            MPI_Comm_set_errhandler(comm, MPI_ERRORS_RETURN);
            MPI_Barrier(comm);
            printf("%d, after second barrier\n", rank);

            rc = MPI_Barrier(comm);
            MPI_Error_string(rc, errstr, &len);
            printf("%d: Last barrier: %s\n", rank, errstr);
            MPIX_Comm_shrink(comm, &temp);
            MPI_Comm_free(&comm);
            char** spawn_argv = (char**)malloc(sizeof(char*) * 2);
            spawn_argv[0] = (char*)malloc(sizeof(char) * 2);
            sprintf(spawn_argv[0], "%d", SECOND_FAIL);
            spawn_argv[1] = NULL;
            MPI_Comm_spawn(argv[0], spawn_argv, 1, MPI_INFO_NULL, 0, temp, &inter2, &error);
            MPI_Intercomm_merge(inter2, 0, &unordered);
            MPI_Comm_split(unordered, 0, rank, &comm);
            MPI_Comm_free(&unordered);
            MPI_Comm_set_errhandler(comm, MPI_ERRORS_RETURN);
            MPI_Barrier(comm);
            printf("%d, after last barrier\n", rank);
        }
        else if (rank == SECOND_FAIL)
        {
            MPI_Intercomm_merge(parent, 0, &unordered);
            MPI_Comm_split(unordered, 0, rank, &comm);
            MPI_Comm_free(&unordered);
            MPI_Comm_set_errhandler(comm, MPI_ERRORS_RETURN);
            MPI_Barrier(comm);
            printf("%d, after last barrier\n", rank);
        }
    }
    MPI_Finalize();
    return 0;
}

In particular, a run with -np 4 --with-ft ulfm produces the following output:

1, after first barrier
0, after first barrier
2, after first barrier
3, after first barrier
2: Second barrier: MPI_ERR_PROC_FAILED: Process Failure
3: Second barrier: MPI_ERR_PROC_FAILED: Process Failure
0: Second barrier: MPI_ERR_PROC_FAILED: Process Failure
1: respawned
0, after second barrier
3, after second barrier
1, after second barrier
2, after second barrier
1: Last barrier: MPI_ERR_PROC_FAILED: Process Failure
2: Last barrier: MPI_ERR_PROC_FAILED: Process Failure
3: Last barrier: MPI_ERR_PROC_FAILED: Process Failure
0: respawned

and then terminates due to the time limit. Trying to analyse the problem a bit deeper, it seems that the MPI_Intercomm_merge between the second spawned process and all the others does not complete. The processes are spawned correctly (as shown by the "respawned" lines). Moreover, I managed to make the example work sometimes changing the ranks of the processes to fail.
I should perform an MPIX_Agree right after the barrier to be sure that all the processes see the same outcome, but they still see the same outcome as shown in the output.
Is there something wrong with my example or is it a problem with the current v5.0.x branch?
Thanks in advance

The text was updated successfully, but these errors were encountered:

bosilca · 2023-05-31T19:46:28Z

The problem seems to be rooted into incorrect MODEX information collected by the processes on the second spawn about the processes from the first spawn. This translate on an incorrect selection of the collective module on the second intercomm_merge for all processes on the second spawn, and thus to a deadlock.

Until a fix is found you should disable the HAN collective (--mca coll ^han). Please note that it is possible that other components use the MODEX location information (which, as described above, is incorrect for processes on the second spawn), in which case they will also make flawed decisions.

rhc54 · 2023-05-31T20:25:04Z

Could you perhaps clarify what is wrong with the modex information? Are the proc IDs incorrectly matched with the payload data? Or something else?

bosilca · 2023-05-31T20:26:25Z

The relative location of processes is incorrect (more precisely local processes are marked as non local).

rhc54 · 2023-05-31T20:34:09Z

Hmmm...node location isn't part of the modex, so I'm assuming there is some other info being accessed here? I'm trying to figure out what info is incorrect and when. Is it after a failure occurs and is recovered from? Is it upon comm_spawn and before any failures? What does OMPI use to mark procs as non-local (hostname, nodeID...)?

I'm trying to understand if this is a PMIx problem or something in OMPI - so far, I'm not able to determine it. It sounds to me from the description that the problem occurs after a ULFM event, which makes sense - the job-level info isn't being updated, and so the original proc locations are still in the datastore. Someone would need to figure out how to do that - otherwise, ULFM will always result in incorrect data after an event.

bosilca · 2023-05-31T20:36:03Z

It has nothing to do with failures. I'm tracking this issue down to ompi/dpm/dpm.c. The processes from the second spawn have empty peer_ranks so all the processes on the parent communicator are marked with the default non-local tag. This is incorrect as in my case all processes are located on the same node.

rhc54 · 2023-05-31T20:47:53Z

Ah, okay - sounds like the problem is elsewhere then. I'll try to look and see why spawn isn't generating the ranks. Can you point me to the line numbers in dpm.c where this is done?

bosilca · 2023-05-31T20:57:15Z

As far as I see the process names are correct. The list is something like this job1;1.1:job1;1.2:job2;2.1:job3:3.*:1, which is what is should be according to the application: two processes from the first instance, one from the second instance (aka the first spawn), and then the current process from the third instance (aka the second spawn) with the * identifier followed by the number of processes, 1 in this instance. However, starting from line 409 in dpm.c I don't think the code is correct, as it builds the list of local procs from the jobid if the local process, which will then, at line 448, not mark any local process as such.

That being said, the rest of the processes (the original two and the one form the first spawn) correctly mark all processes as being on the same node, but I did not figured out how.

rhc54 · 2023-05-31T21:05:04Z

I can try to take a gander a little later today, assuming you haven't solved it by then.

bosilca · 2023-05-31T21:08:33Z

Please do so, I have another urgent matter to take care of right now, I will only get back to this later today.

rhc54 · 2023-06-01T01:47:34Z

Afraid I cannot get this code to run at all, no matter what I do - it just segfaults immediately upon calling MPI_Comm_spawn. Can someone provider a reproducer minus the faults, since those don't seem to be part of the problem?

bosilca · 2023-06-01T02:29:11Z

Try with the following patch.

diff --git a/src/mca/rmaps/base/rmaps_base_map_job.c b/src/mca/rmaps/base/rmaps_base_map_job.c
index b151914a8c..9b0adcf4e0 100644
--- a/src/mca/rmaps/base/rmaps_base_map_job.c
+++ b/src/mca/rmaps/base/rmaps_base_map_job.c
@@ -302,7 +302,7 @@ void prte_rmaps_base_map_job(int fd, short args, void *cbdata)
     /* we always inherit a parent's oversubscribe flag unless the job assigned it */
     if (NULL != parent
         && !(PRTE_MAPPING_SUBSCRIBE_GIVEN & PRTE_GET_MAPPING_DIRECTIVE(jdata->map->mapping))) {
-        if (PRTE_MAPPING_NO_OVERSUBSCRIBE & PRTE_GET_MAPPING_DIRECTIVE(parent->map->mapping)) {
+        if ((NULL != parent->map) && (PRTE_MAPPING_NO_OVERSUBSCRIBE & PRTE_GET_MAPPING_DIRECTIVE(parent->map->mapping))) {
             PRTE_SET_MAPPING_DIRECTIVE(jdata->map->mapping, PRTE_MAPPING_NO_OVERSUBSCRIBE);
         } else {
             PRTE_UNSET_MAPPING_DIRECTIVE(jdata->map->mapping, PRTE_MAPPING_NO_OVERSUBSCRIBE);

rhc54 · 2023-06-01T02:59:39Z

I think it is something else in the app itself. I can run a simple spawn program (only executes one comm_spawn) just fine. However, when I try to run the example app from above, it crashes on the first spawn attempt.

I'm not saying that the above change might not be required for the second spawn - I cannot get that far to know. All I can say is that the example app is crashing in the first spawn, while my test program that contains only one spawn works just fine.

Robyroc · 2023-06-01T09:32:48Z

@rhc54 Afraid I cannot get this code to run at all, no matter what I do - it just segfaults immediately upon calling MPI_Comm_spawn. Can someone provider a reproducer minus the faults, since those don't seem to be part of the problem?

I created a fault-free program featuring the same problem, substituting the Shrinks with splits. The code is available here. I also tried a different approach by just performing two consecutive spawns (here), but that does not show the problem and terminates correctly.
This makes me think that rather than being a ULFM issue, it is more on the fact that the split (and the shrink which may be designed on top of that) does not propagate some needed information. Is this reasonable?

bosilca · 2023-06-01T13:22:54Z

The problem is in spawn, where partial information is exchanged between processes in different jobid (because of spawn), and they cannot correctly place on the hardware hierarchy their peers. As I suggested above if you run without some specific modules (the one that need accurate placement information) the test runs to completion.

rhc54 · 2023-06-01T13:43:00Z

I confirm that the multi-spawn code (minus splits) works fine, at least for me on my box. Could be that I'm not using the "han" module - I'm not doing anything specific about it. I do hang on the fault-free code case, but I'm seeing the "peer_ranks" array to always be filled and that we correctly determine that the procs are all on the local node.

For me, one proc is hanging in ompi_comm_nextcid - if I'm reading it correctly, it looks like it is rank=0 in the initial job.

Anyway, I'm not able to debug further as this is getting a little arcane for me in the MPI space. I'm not seeing any problems in spawning the job, nor in PRRTE returning the correct locality info after all the spawns. Could be we are getting blocked in nextcid due to something about all the splits and merges going on, but I honestly have no idea. I got lost in all that MPI communicator spaghetti 🤷‍♂️

rhc54 · 2023-06-01T13:43:53Z

I should note that I updated my PMIx and PRRTE submodule pointers to the current HEAD of their respective master branches. I don't know why that would make a difference here, but something to consider.

The original code was merging the local modex with the modex of the local processes on the first jobid. This lead to incorrect, and mismatched, information among processes when joining multiple jobid processes (such as on the second spawn merged). This patch iterate over all the jobid on the list of "to connect" processes and adds their information to the local modex. Fixes open-mpi#11724. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

rhc54 · 2023-08-18T21:19:40Z

I thought perhaps I would circle around and look at this again to see if something needs to be done on the PMIx side of things. However, reading thru it again, it appears that (at least for me) the confusion was caused by the "modex" term. Looking at the code change, there actually isn't anything being done that has anything to do with the "modex" operation. True, you use the OPAL modex recv macro to retrieve a piece of info - but it isn't info that is actually exchanged. It's just the list of local peers, and the locality for any member of that list.

None of which gets exchanged in the "modex". So it appears that whatever is going on (which I still cannot reproduce without crashing on the Mac), it has nothing to do with the PMIx support - and the given patch seems to fix the problems in the OMPI code.

I'm not quite sure why people say this won't scale well - it is all completely local operations happening independently in each participating proc. I suppose the loop-over-a-loop logic might be considered a tad ugly, but this isn't a performant operation anyway - and like I said, it doesn't involve any multi-process interactions.

The original code was merging the local modex with the modex of the local processes on the first jobid. This lead to incorrect, and mismatched, information among processes when joining multiple jobid processes (such as on the second spawn merged). This patch iterate over all the jobid on the list of "to connect" processes and adds their information to the local modex. Fixes open-mpi#11724. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

jsquyres assigned bosilca and abouteiller May 30, 2023

jsquyres added the Target: v5.0.x label May 30, 2023

jsquyres added this to the v5.0.0 milestone May 30, 2023

bosilca mentioned this issue Jun 6, 2023

Retrieve the modex of all jobs during accept/connect #11736

Merged

bosilca closed this as completed in #11736 Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI_Intercomm_merge not working after second ULFM shrink. #11724

MPI_Intercomm_merge not working after second ULFM shrink. #11724

Robyroc commented May 30, 2023

bosilca commented May 31, 2023

rhc54 commented May 31, 2023

bosilca commented May 31, 2023

rhc54 commented May 31, 2023

bosilca commented May 31, 2023

rhc54 commented May 31, 2023

bosilca commented May 31, 2023

rhc54 commented May 31, 2023

bosilca commented May 31, 2023

rhc54 commented Jun 1, 2023

bosilca commented Jun 1, 2023

rhc54 commented Jun 1, 2023

Robyroc commented Jun 1, 2023 •

edited

Loading

bosilca commented Jun 1, 2023

rhc54 commented Jun 1, 2023

rhc54 commented Jun 1, 2023

rhc54 commented Aug 18, 2023

MPI_Intercomm_merge not working after second ULFM shrink. #11724

MPI_Intercomm_merge not working after second ULFM shrink. #11724

Comments

Robyroc commented May 30, 2023

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

bosilca commented May 31, 2023

rhc54 commented May 31, 2023

bosilca commented May 31, 2023

rhc54 commented May 31, 2023

bosilca commented May 31, 2023

rhc54 commented May 31, 2023

bosilca commented May 31, 2023

rhc54 commented May 31, 2023

bosilca commented May 31, 2023

rhc54 commented Jun 1, 2023

bosilca commented Jun 1, 2023

rhc54 commented Jun 1, 2023

Robyroc commented Jun 1, 2023 • edited Loading

bosilca commented Jun 1, 2023

rhc54 commented Jun 1, 2023

rhc54 commented Jun 1, 2023

rhc54 commented Aug 18, 2023

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Robyroc commented Jun 1, 2023 •

edited

Loading