Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple failures running collective-big-count tests with OMPI main branch and 'ftagree' collective component #10191

Open
drwootton opened this issue Mar 30, 2022 · 0 comments

Comments

@drwootton
Copy link
Contributor

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI main branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built from current main branch (3/22/22)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

 git submodule status
 1b86a35db2816ee9c0f3a41988005a2ba7d29adb 3rd-party/openpmix (v1.1.3-3481-g1b86a35d)
 91f791e209ccbdfb4b8647900d292ef51d52f37d 3rd-party/prrte (psrvr-v2.0.0rc1-4319-g91f791e2)

Please describe the system on which you are running

  • Operating system/version:
  • RHEL 8.4
  • Computer hardware:
  • Single Power8 node
  • Network type:
  • Localhost

Details of the problem

I ran the set of self-checking tests from ompi-tests-public/collective-big-count with collective components specified as --mca coll ftagree,basic,sm,self,inter,libnbc

The following testcases had failures. The remaining testcases were successful:

  • test_allgather_uniform_count
  • test-alltoall-uniform_count
  • test-gather-uniform-count
  • test-scatter-uniform-count

The tests were compiled by running make in the directory containing the source files

The following environment variables were set for all tests:

BIGCOUNT_HOSTS          : -np 3
BIGCOUNT_MEMORY_PERCENT : 70
BIGCOUNT_MEMORY_DIFF    : 10

These errors look like the same errors I saw with the tuned collective component (#10190). I don't know if these are failures are in common code or whether they are triggered by same or similar problems in the different component's code

This command fails with a self-check error message followed by a SIGSEGV in MPI_Wait

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca coll ftagree,basic,sm,self,inter,libnbc /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count

This is the self-check error message and traceback

---------------------
Results from MPI_Iallgather(int x 6442450941 = 25769803764 or 24.0 GB):  MPI_IN_PLACE
Rank  0: ERROR: DI in     3489677312 of     6442450941 slots (  54.2 % wrong)
Rank  1: ERROR: DI in     3489677310 of     6442450941 slots (  54.2 % wrong)
Rank  2: ERROR: DI in     1342193665 of     6442450941 slots (  20.8 % wrong)
--------------------- Adjust count to fit in memory: 2147483647 x  60.0% = 1288490188
Root  : payload    61847529024  57.6 GB =  16 dt x 1288490188 count x   3 peers x   1.0 inflation
Peer  : payload    61847529024  57.6 GB =  16 dt x 1288490188 count x   3 peers x   1.0 inflation
Total : payload   185542587072 172.8 GB =  57.6 GB root +  57.6 GB x   2 local peers
---------------------
Results from MPI_Iallgather(double _Complex x 3865470564 = 61847529024 or 57.6 GB):  MPI_IN_PLACE
[c656f6n01:1822338] *** Process received signal ***
[c656f6n01:1822338] Signal: Segmentation fault (11)
[c656f6n01:1822338] Signal code: Address not mapped (1)
[c656f6n01:1822338] Failing at address: 0x1ff9a2999990
[c656f6n01:1822338] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:1822338] [ 1] /usr/lib64/libc.so.6(+0xb083c)[0x20000074083c]
[c656f6n01:1822338] [ 2] /u/dwootton/ompi-master/lib/libmpi.so.0(+0x3f85e8)[0x2000004785e8]
[c656f6n01:1822338] [ 3] /u/dwootton/ompi-master/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x2a4)[0x20000047c034]
[c656f6n01:1822338] [ 4] /u/dwootton/ompi-master/lib/libmpi.so.0(+0x3ea700)[0x20000046a700]
[c656f6n01:1822338] [ 5] /u/dwootton/ompi-master/lib/libmpi.so.0(+0x3ea7c0)[0x20000046a7c0]
[c656f6n01:1822338] [ 6] /u/dwootton/ompi-master/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x2e4)[0x20000046cca8]
[c656f6n01:1822338] [ 7] /u/dwootton/ompi-master/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x200)[0x2000009f076c]
[c656f6n01:1822338] [ 8] /u/dwootton/ompi-master/lib/libopen-pal.so.0(+0xf0890)[0x2000009f0890]
[c656f6n01:1822338] [ 9] /u/dwootton/ompi-master/lib/libopen-pal.so.0(+0xf0c08)[0x2000009f0c08]
[c656f6n01:1822338] [10] /u/dwootton/ompi-master/lib/libopen-pal.so.0(opal_progress+0x5c)[0x20000093d5b0]
[c656f6n01:1822338] [11] /u/dwootton/ompi-master/lib/libmpi.so.0(+0xded50)[0x20000015ed50]
[c656f6n01:1822338] [12] /u/dwootton/ompi-master/lib/libmpi.so.0(ompi_request_default_wait+0x38)[0x20000015edfc]
[c656f6n01:1822338] [13] /u/dwootton/ompi-master/lib/libmpi.so.0(MPI_Wait+0x194)[0x200000258e38]
[c656f6n01:1822338] [14] /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x100030d8]
[c656f6n01:1822338] [15] /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_allgather_uniform_count[0x100029d0]
[c656f6n01:1822338] [16] /usr/lib64/libc.so.6(+0x24c78)[0x2000006b4c78][c656f6n01:1822338] 
[17] /usr/lib64/libc.so.6(__libc_start_main+0xb4)[0x2000006b4e64]
[c656f6n01:1822338] *** End of error message ***

This command fails with a self-check error message followed by a SIGSEGV in MPI_Wait

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca coll ftagree,basic,sm,self,inter,libnbc /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_alltoall_uniform_count

This is the self-check error message and traceback

---------------------
Results from MPI_Alltoall(int x 6442450941 = 25769803764 or 24.0 GB): MPI_IN_PLACE
Rank  1: ERROR: DI in     2147483647 of     2147483647 slots ( 100.0 % wrong)
Rank  2: ERROR: DI in     4294967294 of     2147483647 slots ( 200.0 % wrong)
Rank  0: ERROR: DI in     5368676346 of     2147483647 slots ( 250.0 % wrong)
--------------------- Adjust count to fit in memory: 2147483647 x  60.0% = 1288490188
Root  : payload    61847529024  57.6 GB =  16 dt x 1288490188 count x   3 peers x   1.0 inflation
Peer  : payload    61847529024  57.6 GB =  16 dt x 1288490188 count x   3 peers x   1.0 inflation
Total : payload   185542587072 172.8 GB =  57.6 GB root +  57.6 GB x   2 local peers
[c656f6n01:1823554] *** Process received signal ***
[c656f6n01:1823554] Signal: Segmentation fault (11)
[c656f6n01:1823554] Signal code: Address not mapped (1)
[c656f6n01:1823554] Failing at address: 0x1ff9a2999990
[c656f6n01:1823554] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:1823554] [ 1] /usr/lib64/libc.so.6(+0xb083c)[0x20000074083c]
[c656f6n01:1823554] [ 2] /u/dwootton/ompi-master/lib/libmpi.so.0(mca_coll_base_alltoall_intra_basic_inplace+0x22c)[0x2000002b3c94]
[c656f6n01:1823554] [ 3] /u/dwootton/ompi-master/lib/libmpi.so.0(ompi_coll_base_alltoall_intra_basic_linear+0x8c)[0x2000002b5684]
[c656f6n01:1823554] [ 4] /u/dwootton/ompi-master/lib/libmpi.so.0(PMPI_Alltoall+0x538)[0x200000193cd4]
[c656f6n01:1823554] [ 5] /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_alltoall_uniform_count[0x10002dd0]
[c656f6n01:1823554] [ 6] /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_alltoall_uniform_count[0x1000289c]
[c656f6n01:1823554] [ 7] /usr/lib64/libc.so.6(+0x24c78)[0x2000006b4c78]
[c656f6n01:1823554] [ 8] /usr/lib64/libc.so.6(__libc_start_main+0xb4)[0x2000006b4e64]
[c656f6n01:1823554] *** End of error message ***

This command fails with a self-check error message followed by a double free or storage
corruption.

mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca coll ftagree,basic,sm,self,inter,libnbc /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_gather_uniform_count

This is the self-check error message and the traceback

---------------------
Results from MPI_Igather(double _Complex x 6442450941 = 103079215056 or 96.0 GB):
Rank  0: ERROR: DI in     4294967292 of     6442450941 slots (  66.7 % wrong)
double free or corruption (out)
[c656f6n01:1823979] *** Process received signal ***
[c656f6n01:1823979] Signal: Aborted (6)
[c656f6n01:1823979] Signal code:  (-6)
[c656f6n01:1823979] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x2000000604d8]
[c656f6n01:1823979] [ 1] /usr/lib64/libc.so.6(gsignal+0xd8)[0x2000006d44d8]
[c656f6n01:1823979] [ 2] /usr/lib64/libc.so.6(abort+0x164)[0x2000006b462c]
[c656f6n01:1823979] [ 3] /usr/lib64/libc.so.6(+0x908bc)[0x2000007208bc]
[c656f6n01:1823979] [ 4] /usr/lib64/libc.so.6(+0x9b828)[0x20000072b828]
[c656f6n01:1823979] [ 5] /usr/lib64/libc.so.6(+0x9e0ec)[0x20000072e0ec]
[c656f6n01:1823979] [ 6] ./test_gather_uniform_count[0x100030b0]
[c656f6n01:1823979] [ 7] ./test_gather_uniform_count[0x10002920]
[c656f6n01:1823979] [ 8] /usr/lib64/libc.so.6(+0x24c78)[0x2000006b4c78]
[c656f6n01:1823979] [ 9] /usr/lib64/libc.so.6(__libc_start_main+0xb4)[0x2000006b4e64]
[c656f6n01:1823979] *** End of error message ***
mpirun ${BIGCOUNT_HOSTS} -x BIGCOUNT_MEMORY_PERCENT=${BIGCOUNT_MEMORY_PERCENT} -x BIGCOUNT_MEMORY_DIFF=${BIGCOUNT_MEMORY_DIFF} --mca coll ftagree,basic,sm,self,inter,libnbc /u/dwootton/bigcount-master/BigCountUpstream/ompi-tests-public/collective-big-count/./test_gather_scatter

This is the self-check error message

---------------------
Results from MPI_Iscatter(int x 6442450941 = 25769803764 or 24.0 GB):
Rank  2: ERROR: DI in     2147483647 of     2147483647 slots ( 100.0 % wrong)
Rank  1: PASSED
Rank  0: PASSED
---------------------
Results from MPI_Iscatter(double _Complex x 6442450941 = 103079215056 or 96.0 GB):
Rank  1: PASSED
Rank  2: ERROR: DI in     2147483647 of     2147483647 slots ( 100.0 % wrong)
Rank  0: PASSED
=====> ^[[0;31mFAIL^[[m (2): Scatter (./test_scatter_uniform_count)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants