Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI-4: Broken P{send|recv}_init routines #10390

Closed
dalcinl opened this issue May 16, 2022 · 3 comments · Fixed by #10061
Closed

MPI-4: Broken P{send|recv}_init routines #10390

dalcinl opened this issue May 16, 2022 · 3 comments · Fixed by #10061

Comments

@dalcinl
Copy link
Contributor

dalcinl commented May 16, 2022

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

git branch v5.0.x

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

mkdir BUILD
cd BUILD
../configure --prefix=/home/devel/mpi/openmpi/5.0.0 --without-ofi --without-ucx --with-pmix=internal --enable-debug --enable-mem-debug --disable-man-pages --disable-sphinx && make -j 32 install

NOTE: I'm configuring without ofi and ucx, pmix is internal, hwloc is from system Fedora 36 version 2.5.0

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

$ git submodule status
 8c39d8e6a95d6fa78e765b5e86c324bd8a4ecd56 3rd-party/openpmix (v4.1.2-58-g8c39d8e6)
 f75647a0518b5a476011f543200fca1cf8600cb8 3rd-party/prrte (v2.0.2-99-gf75647a051)

Please describe the system on which you are running

  • Operating system/version: Linux 5.17.6 (Fedora 35)
  • Computer hardware: AMD Ryzen Threadripper PRO 3995WX 64-Cores
  • Network type: isolated

Details of the problem

I could not find any Open MPI-specific test for MPI-4 partitioned communication. Therefore, I wrote my own trivial reproducer:

#include <mpi.h>

int main(int argc, char *argv[])
{
    int buf[1] = {0};
    MPI_Request sreq,rreq;

    MPI_Init(&argc, &argv);

    MPI_Psend_init(buf, 1, 1, MPI_INT, 0, 0, MPI_COMM_SELF, MPI_INFO_NULL, &sreq);
    MPI_Precv_init(buf, 1, 1, MPI_INT, 0, 0, MPI_COMM_SELF, MPI_INFO_NULL, &rreq);

    MPI_Request_free(&sreq);
    MPI_Request_free(&rreq);
    MPI_Finalize();
    return 0;
}
$ mpicc --version
gcc (GCC) 11.3.1 20220421 (Red Hat 11.3.1-2)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ mpicc -g3 tmp.c

$ ./a.out 
[localhost:2807591] *** Process received signal ***
[localhost:2807591] Signal: Segmentation fault (11)
[localhost:2807591] Signal code: Address not mapped (1)
[localhost:2807591] Failing at address: (nil)
[localhost:2807591] [ 0] /lib64/libc.so.6(+0x55e30)[0x7f67508efe30]
[localhost:2807591] *** End of error message ***
Segmentation fault (core dumped)

Valgrind is not helpful:

$ valgrind -q ./a.out 
==2807606== Jump to the invalid address stated on the next line
==2807606==    at 0x0: ???
==2807606==    by 0x4011C8: main (tmp.c:10)
==2807606==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==2807606== 
[localhost:2807606] *** Process received signal ***
[localhost:2807606] Signal: Segmentation fault (11)
[localhost:2807606] Signal code: Invalid permissions (2)
[localhost:2807606] Failing at address: (nil)
[localhost:2807606] [ 0] /lib64/libc.so.6(+0x55e30)[0x4d71e30]
[localhost:2807606] *** End of error message ***
Segmentation fault (core dumped)

I cannot get why the jump address is 0x0. The symbol is definitely in the library:

$ nm /home/devel/mpi/openmpi/5.0.0/lib/libmpi.so | grep Psend_init
0000000000133ac5 W MPI_Psend_init
0000000000133ac5 T PMPI_Psend_init

Perhaps a compiler bug?

@awlauria
Copy link
Contributor

@mdosanjh

@mdosanjh
Copy link
Contributor

mdosanjh commented May 24, 2022

It looks like the initialization didn't make it into the conversion to ompi/instance/instance.c

I have a draft of a fix at #10061
I'll work on getting this fix polished and merged.

@jsquyres
Copy link
Member

@awlauria This feels like a 5.0 blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants