Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Intercomm_merge not working after second ULFM shrink. #11724

Closed
Robyroc opened this issue May 30, 2023 · 17 comments · Fixed by #11736
Closed

MPI_Intercomm_merge not working after second ULFM shrink. #11724

Robyroc opened this issue May 30, 2023 · 17 comments · Fixed by #11736
Assignees
Milestone

Comments

@Robyroc
Copy link

Robyroc commented May 30, 2023

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v5.0.x branch, hash 907350a

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Git clone, compiled with ../configure --with-ft=ulfm

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

f34a7ce21e7a89aba44fa396eb84de1bebde3ad3 3rd-party/openpmix (v4.2.4rc1-16-gf34a7ce2)
c4925aa5cc03159da04b94dcda9f559329150724 3rd-party/prrte (v3.0.1rc2-17-gc4925aa5cc)
c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: Red Hat 8.6
  • Computer hardware: Arm Cavium ThunderX2
  • Network type: single node

Details of the problem

The following code, which uses ULFM functionalities to recover from two faults, hangs in a deadlock.

#include <assert.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

#include "mpi-ext.h"

#define FIRST_FAIL 1
#define SECOND_FAIL 0

int main(int argc, char** argv)
{
    MPI_Init(&argc, &argv);
    MPI_Comm comm, parent, temp, inter, inter2, unordered;
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_get_parent(&parent);
    int len, rc, error;
    char errstr[MPI_MAX_ERROR_STRING];
    if (parent == MPI_COMM_NULL)
    {
        MPI_Comm_dup(MPI_COMM_WORLD, &comm);
        MPI_Comm_set_errhandler(comm, MPI_ERRORS_RETURN);
        MPI_Barrier(comm);
        printf("%d, after first barrier\n", rank);
        if (rank == FIRST_FAIL)
            raise(SIGINT);
        rc = MPI_Barrier(comm);

        MPI_Error_string(rc, errstr, &len);
        printf("%d: Second barrier: %s\n", rank, errstr);
        MPIX_Comm_shrink(comm, &temp);
        MPI_Comm_free(&comm);
        char** spawn_argv = (char**)malloc(sizeof(char*) * 2);
        spawn_argv[0] = (char*)malloc(sizeof(char) * 2);
        sprintf(spawn_argv[0], "%d", FIRST_FAIL);
        spawn_argv[1] = NULL;
        MPI_Comm_spawn(argv[0], spawn_argv, 1, MPI_INFO_NULL, 0, temp, &inter, &error);
        MPI_Intercomm_merge(inter, 0, &unordered);
        MPI_Comm_split(unordered, 0, rank, &comm);
        MPI_Comm_free(&unordered);
        MPI_Comm_set_errhandler(comm, MPI_ERRORS_RETURN);
        MPI_Barrier(comm);
        printf("%d, after second barrier\n", rank);

        if (rank == SECOND_FAIL)
            raise(SIGINT);
        rc = MPI_Barrier(comm);
        MPI_Error_string(rc, errstr, &len);
        printf("%d: Last barrier: %s\n", rank, errstr);
        MPIX_Comm_shrink(comm, &temp);
        MPI_Comm_free(&comm);
        sprintf(spawn_argv[0], "%d", SECOND_FAIL);
        MPI_Comm_spawn(argv[0], spawn_argv, 1, MPI_INFO_NULL, 0, temp, &inter2, &error);
        MPI_Intercomm_merge(inter2, 0, &unordered);
        MPI_Comm_split(unordered, 0, rank, &comm);
        MPI_Comm_free(&unordered);
        MPI_Comm_set_errhandler(comm, MPI_ERRORS_RETURN);
        MPI_Barrier(comm);
        printf("%d, after last barrier\n", rank);
    }
    else
    {
        rank = atoi(argv[1]);
        printf("%d: respawned\n", rank);
        if (rank == FIRST_FAIL)
        {
            MPI_Intercomm_merge(parent, 0, &unordered);
            MPI_Comm_split(unordered, 0, rank, &comm);
            MPI_Comm_free(&unordered);
            MPI_Comm_set_errhandler(comm, MPI_ERRORS_RETURN);
            MPI_Barrier(comm);
            printf("%d, after second barrier\n", rank);

            rc = MPI_Barrier(comm);
            MPI_Error_string(rc, errstr, &len);
            printf("%d: Last barrier: %s\n", rank, errstr);
            MPIX_Comm_shrink(comm, &temp);
            MPI_Comm_free(&comm);
            char** spawn_argv = (char**)malloc(sizeof(char*) * 2);
            spawn_argv[0] = (char*)malloc(sizeof(char) * 2);
            sprintf(spawn_argv[0], "%d", SECOND_FAIL);
            spawn_argv[1] = NULL;
            MPI_Comm_spawn(argv[0], spawn_argv, 1, MPI_INFO_NULL, 0, temp, &inter2, &error);
            MPI_Intercomm_merge(inter2, 0, &unordered);
            MPI_Comm_split(unordered, 0, rank, &comm);
            MPI_Comm_free(&unordered);
            MPI_Comm_set_errhandler(comm, MPI_ERRORS_RETURN);
            MPI_Barrier(comm);
            printf("%d, after last barrier\n", rank);
        }
        else if (rank == SECOND_FAIL)
        {
            MPI_Intercomm_merge(parent, 0, &unordered);
            MPI_Comm_split(unordered, 0, rank, &comm);
            MPI_Comm_free(&unordered);
            MPI_Comm_set_errhandler(comm, MPI_ERRORS_RETURN);
            MPI_Barrier(comm);
            printf("%d, after last barrier\n", rank);
        }
    }
    MPI_Finalize();
    return 0;
}

In particular, a run with -np 4 --with-ft ulfm produces the following output:

1, after first barrier
0, after first barrier
2, after first barrier
3, after first barrier
2: Second barrier: MPI_ERR_PROC_FAILED: Process Failure
3: Second barrier: MPI_ERR_PROC_FAILED: Process Failure
0: Second barrier: MPI_ERR_PROC_FAILED: Process Failure
1: respawned
0, after second barrier
3, after second barrier
1, after second barrier
2, after second barrier
1: Last barrier: MPI_ERR_PROC_FAILED: Process Failure
2: Last barrier: MPI_ERR_PROC_FAILED: Process Failure
3: Last barrier: MPI_ERR_PROC_FAILED: Process Failure
0: respawned

and then terminates due to the time limit. Trying to analyse the problem a bit deeper, it seems that the MPI_Intercomm_merge between the second spawned process and all the others does not complete. The processes are spawned correctly (as shown by the "respawned" lines). Moreover, I managed to make the example work sometimes changing the ranks of the processes to fail.
I should perform an MPIX_Agree right after the barrier to be sure that all the processes see the same outcome, but they still see the same outcome as shown in the output.
Is there something wrong with my example or is it a problem with the current v5.0.x branch?
Thanks in advance

@bosilca
Copy link
Member

bosilca commented May 31, 2023

The problem seems to be rooted into incorrect MODEX information collected by the processes on the second spawn about the processes from the first spawn. This translate on an incorrect selection of the collective module on the second intercomm_merge for all processes on the second spawn, and thus to a deadlock.

Until a fix is found you should disable the HAN collective (--mca coll ^han). Please note that it is possible that other components use the MODEX location information (which, as described above, is incorrect for processes on the second spawn), in which case they will also make flawed decisions.

@rhc54
Copy link
Contributor

rhc54 commented May 31, 2023

Could you perhaps clarify what is wrong with the modex information? Are the proc IDs incorrectly matched with the payload data? Or something else?

@bosilca
Copy link
Member

bosilca commented May 31, 2023

The relative location of processes is incorrect (more precisely local processes are marked as non local).

@rhc54
Copy link
Contributor

rhc54 commented May 31, 2023

Hmmm...node location isn't part of the modex, so I'm assuming there is some other info being accessed here? I'm trying to figure out what info is incorrect and when. Is it after a failure occurs and is recovered from? Is it upon comm_spawn and before any failures? What does OMPI use to mark procs as non-local (hostname, nodeID...)?

I'm trying to understand if this is a PMIx problem or something in OMPI - so far, I'm not able to determine it. It sounds to me from the description that the problem occurs after a ULFM event, which makes sense - the job-level info isn't being updated, and so the original proc locations are still in the datastore. Someone would need to figure out how to do that - otherwise, ULFM will always result in incorrect data after an event.

@bosilca
Copy link
Member

bosilca commented May 31, 2023

It has nothing to do with failures. I'm tracking this issue down to ompi/dpm/dpm.c. The processes from the second spawn have empty peer_ranks so all the processes on the parent communicator are marked with the default non-local tag. This is incorrect as in my case all processes are located on the same node.

@rhc54
Copy link
Contributor

rhc54 commented May 31, 2023

Ah, okay - sounds like the problem is elsewhere then. I'll try to look and see why spawn isn't generating the ranks. Can you point me to the line numbers in dpm.c where this is done?

@bosilca
Copy link
Member

bosilca commented May 31, 2023

As far as I see the process names are correct. The list is something like this job1;1.1:job1;1.2:job2;2.1:job3:3.*:1, which is what is should be according to the application: two processes from the first instance, one from the second instance (aka the first spawn), and then the current process from the third instance (aka the second spawn) with the * identifier followed by the number of processes, 1 in this instance. However, starting from line 409 in dpm.c I don't think the code is correct, as it builds the list of local procs from the jobid if the local process, which will then, at line 448, not mark any local process as such.

That being said, the rest of the processes (the original two and the one form the first spawn) correctly mark all processes as being on the same node, but I did not figured out how.

@rhc54
Copy link
Contributor

rhc54 commented May 31, 2023

I can try to take a gander a little later today, assuming you haven't solved it by then.

@bosilca
Copy link
Member

bosilca commented May 31, 2023

Please do so, I have another urgent matter to take care of right now, I will only get back to this later today.

@rhc54
Copy link
Contributor

rhc54 commented Jun 1, 2023

Afraid I cannot get this code to run at all, no matter what I do - it just segfaults immediately upon calling MPI_Comm_spawn. Can someone provider a reproducer minus the faults, since those don't seem to be part of the problem?

@bosilca
Copy link
Member

bosilca commented Jun 1, 2023

Try with the following patch.

diff --git a/src/mca/rmaps/base/rmaps_base_map_job.c b/src/mca/rmaps/base/rmaps_base_map_job.c
index b151914a8c..9b0adcf4e0 100644
--- a/src/mca/rmaps/base/rmaps_base_map_job.c
+++ b/src/mca/rmaps/base/rmaps_base_map_job.c
@@ -302,7 +302,7 @@ void prte_rmaps_base_map_job(int fd, short args, void *cbdata)
     /* we always inherit a parent's oversubscribe flag unless the job assigned it */
     if (NULL != parent
         && !(PRTE_MAPPING_SUBSCRIBE_GIVEN & PRTE_GET_MAPPING_DIRECTIVE(jdata->map->mapping))) {
-        if (PRTE_MAPPING_NO_OVERSUBSCRIBE & PRTE_GET_MAPPING_DIRECTIVE(parent->map->mapping)) {
+        if ((NULL != parent->map) && (PRTE_MAPPING_NO_OVERSUBSCRIBE & PRTE_GET_MAPPING_DIRECTIVE(parent->map->mapping))) {
             PRTE_SET_MAPPING_DIRECTIVE(jdata->map->mapping, PRTE_MAPPING_NO_OVERSUBSCRIBE);
         } else {
             PRTE_UNSET_MAPPING_DIRECTIVE(jdata->map->mapping, PRTE_MAPPING_NO_OVERSUBSCRIBE);

@rhc54
Copy link
Contributor

rhc54 commented Jun 1, 2023

I think it is something else in the app itself. I can run a simple spawn program (only executes one comm_spawn) just fine. However, when I try to run the example app from above, it crashes on the first spawn attempt.

I'm not saying that the above change might not be required for the second spawn - I cannot get that far to know. All I can say is that the example app is crashing in the first spawn, while my test program that contains only one spawn works just fine.

@Robyroc
Copy link
Author

Robyroc commented Jun 1, 2023

@rhc54 Afraid I cannot get this code to run at all, no matter what I do - it just segfaults immediately upon calling MPI_Comm_spawn. Can someone provider a reproducer minus the faults, since those don't seem to be part of the problem?

I created a fault-free program featuring the same problem, substituting the Shrinks with splits. The code is available here. I also tried a different approach by just performing two consecutive spawns (here), but that does not show the problem and terminates correctly.
This makes me think that rather than being a ULFM issue, it is more on the fact that the split (and the shrink which may be designed on top of that) does not propagate some needed information. Is this reasonable?

@bosilca
Copy link
Member

bosilca commented Jun 1, 2023

The problem is in spawn, where partial information is exchanged between processes in different jobid (because of spawn), and they cannot correctly place on the hardware hierarchy their peers. As I suggested above if you run without some specific modules (the one that need accurate placement information) the test runs to completion.

@rhc54
Copy link
Contributor

rhc54 commented Jun 1, 2023

I confirm that the multi-spawn code (minus splits) works fine, at least for me on my box. Could be that I'm not using the "han" module - I'm not doing anything specific about it. I do hang on the fault-free code case, but I'm seeing the "peer_ranks" array to always be filled and that we correctly determine that the procs are all on the local node.

For me, one proc is hanging in ompi_comm_nextcid - if I'm reading it correctly, it looks like it is rank=0 in the initial job.

Anyway, I'm not able to debug further as this is getting a little arcane for me in the MPI space. I'm not seeing any problems in spawning the job, nor in PRRTE returning the correct locality info after all the spawns. Could be we are getting blocked in nextcid due to something about all the splits and merges going on, but I honestly have no idea. I got lost in all that MPI communicator spaghetti 🤷‍♂️

@rhc54
Copy link
Contributor

rhc54 commented Jun 1, 2023

I should note that I updated my PMIx and PRRTE submodule pointers to the current HEAD of their respective master branches. I don't know why that would make a difference here, but something to consider.

bosilca added a commit to bosilca/ompi that referenced this issue Jun 6, 2023
The original code was merging the local modex with the modex of the
local processes on the first jobid. This lead to incorrect, and
mismatched, information among processes when joining multiple jobid
processes (such as on the second spawn merged).

This patch iterate over all the jobid on the list of "to connect"
processes and adds their information to the local modex.

Fixes open-mpi#11724.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
bosilca added a commit to bosilca/ompi that referenced this issue Aug 15, 2023
The original code was merging the local modex with the modex of the
local processes on the first jobid. This lead to incorrect, and
mismatched, information among processes when joining multiple jobid
processes (such as on the second spawn merged).

This patch iterate over all the jobid on the list of "to connect"
processes and adds their information to the local modex.

Fixes open-mpi#11724.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
@rhc54
Copy link
Contributor

rhc54 commented Aug 18, 2023

I thought perhaps I would circle around and look at this again to see if something needs to be done on the PMIx side of things. However, reading thru it again, it appears that (at least for me) the confusion was caused by the "modex" term. Looking at the code change, there actually isn't anything being done that has anything to do with the "modex" operation. True, you use the OPAL modex recv macro to retrieve a piece of info - but it isn't info that is actually exchanged. It's just the list of local peers, and the locality for any member of that list.

None of which gets exchanged in the "modex". So it appears that whatever is going on (which I still cannot reproduce without crashing on the Mac), it has nothing to do with the PMIx support - and the given patch seems to fix the problems in the OMPI code.

I'm not quite sure why people say this won't scale well - it is all completely local operations happening independently in each participating proc. I suppose the loop-over-a-loop logic might be considered a tad ugly, but this isn't a performant operation anyway - and like I said, it doesn't involve any multi-process interactions.

bosilca added a commit to bosilca/ompi that referenced this issue Feb 14, 2024
The original code was merging the local modex with the modex of the
local processes on the first jobid. This lead to incorrect, and
mismatched, information among processes when joining multiple jobid
processes (such as on the second spawn merged).

This patch iterate over all the jobid on the list of "to connect"
processes and adds their information to the local modex.

Fixes open-mpi#11724.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants