deadlock running netcdf test with openmpi 4.0.2 #7109

opoplawski · 2019-10-26T17:37:32Z

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

4.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Fedora packages

Please describe the system on which you are running

Operating system/version: Fedora 31 and 32
Network type: localhost

Details of the problem

netcdf tst_parallel3 program hangs. Backtrace shows:

(gdb) bt
#0  0x00007f90c197529b in sched_yield () from /lib64/libc.so.6
#1  0x00007f90c1ac8a05 in ompi_request_default_wait () from /usr/lib64/openmpi/lib/libmpi.so.40
#2  0x00007f90c1b2b35c in ompi_coll_base_sendrecv_actual () from /usr/lib64/openmpi/lib/libmpi.so.40
#3  0x00007f90c1b2bb73 in ompi_coll_base_allreduce_intra_recursivedoubling () from /usr/lib64/openmpi/lib/libmpi.so.40
#4  0x00007f90be96e9c5 in mca_fcoll_vulcan_file_write_all () from /usr/lib64/openmpi/lib/openmpi/mca_fcoll_vulcan.so
#5  0x00007f90be9fada0 in mca_common_ompio_file_write_at_all () from /usr/lib64/openmpi/lib/libmca_common_ompio.so.41
#6  0x00007f90beb0610b in mca_io_ompio_file_write_at_all () from /usr/lib64/openmpi/lib/openmpi/mca_io_ompio.so
#7  0x00007f90c1af033f in PMPI_File_write_at_all () from /usr/lib64/openmpi/lib/libmpi.so.40
#8  0x00007f90c1627d7b in H5FD_mpio_write () from /usr/lib64/openmpi/lib/libhdf5.so.103
#9  0x00007f90c14636ee in H5FD_write () from /usr/lib64/openmpi/lib/libhdf5.so.103
#10 0x00007f90c1442eb3 in H5F__accum_write () from /usr/lib64/openmpi/lib/libhdf5.so.103
#11 0x00007f90c1543729 in H5PB_write () from /usr/lib64/openmpi/lib/libhdf5.so.103
#12 0x00007f90c144d69c in H5F_block_write () from /usr/lib64/openmpi/lib/libhdf5.so.103
#13 0x00007f90c161cd10 in H5C_apply_candidate_list () from /usr/lib64/openmpi/lib/libhdf5.so.103
#14 0x00007f90c161ad02 in H5AC__run_sync_point () from /usr/lib64/openmpi/lib/libhdf5.so.103
#15 0x00007f90c161bd4f in H5AC__flush_entries () from /usr/lib64/openmpi/lib/libhdf5.so.103
#16 0x00007f90c13b154d in H5AC_flush () from /usr/lib64/openmpi/lib/libhdf5.so.103
#17 0x00007f90c1446761 in H5F__flush_phase2.part.0 () from /usr/lib64/openmpi/lib/libhdf5.so.103
#18 0x00007f90c1448e64 in H5F__flush () from /usr/lib64/openmpi/lib/libhdf5.so.103
#19 0x00007f90c144dc08 in H5F_flush_mounts_recurse () from /usr/lib64/openmpi/lib/libhdf5.so.103
#20 0x00007f90c144f171 in H5F_flush_mounts () from /usr/lib64/openmpi/lib/libhdf5.so.103
#21 0x00007f90c143e3a5 in H5Fflush () from /usr/lib64/openmpi/lib/libhdf5.so.103
#22 0x00007f90c1c178c0 in sync_netcdf4_file (h5=0x56527e439b10) at ../../libhdf5/hdf5file.c:222
#23 0x00007f90c1c1816e in NC4_enddef (ncid=<optimized out>) at ../../libhdf5/hdf5file.c:544
#24 0x00007f90c1bd94f3 in nc_enddef (ncid=65536) at ../../libdispatch/dfile.c:1004
#25 0x000056527d0def27 in test_pio (flag=0) at ../../nc_test4/tst_parallel3.c:206
#26 0x000056527d0de62c in main (argc=<optimized out>, argv=<optimized out>) at ../../nc_test4/tst_parallel3.c:91

processes are running full out.

Workaround is to set OMPI_MCA_fcoll=^vulcan.

The text was updated successfully, but these errors were encountered:

This is based on a bug reported on the mailing list using a netcdf testcase. The problem occurs if processes are using a custom file view, but on some of them it appears as if the default file view is being used. Because of that, the simple-grouping option lead to different number of aggregators used on different processes, and ultimately to a deadlock. This patch fixes the problem by not using the file_view size anymore for the calculation in the simple-grouping option, but the contiguous chunk size (which is identical on all processes). Fixes issue open-mpi#7109 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>

edgargabriel · 2019-10-29T17:46:03Z

@opoplawski thank you once again for the bug report. I found the issue that leads to the deadlock, a fix is currently being applied to master and should be available in v4.0.3. Just as a note (for documentation purposes), this bug should not be present in the 3.0.x and 3.1.x series.

This is based on a bug reported on the mailing list using a netcdf testcase. The problem occurs if processes are using a custom file view, but on some of them it appears as if the default file view is being used. Because of that, the simple-grouping option lead to different number of aggregators used on different processes, and ultimately to a deadlock. This patch fixes the problem by not using the file_view size anymore for the calculation in the simple-grouping option, but the contiguous chunk size (which is identical on all processes). Fixes issue open-mpi#7109 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu> (cherry picked from commit ad5d0df)

hppritcha · 2019-12-06T19:51:02Z

@opoplawski if you have a chance could you check if this problem is fixed in the 4.0.x release stream? Nightly tarballs that you can use for testing are at https://www.open-mpi.org/nightly/v4.0.x/

This is based on a bug reported on the mailing list using a netcdf testcase. The problem occurs if processes are using a custom file view, but on some of them it appears as if the default file view is being used. Because of that, the simple-grouping option lead to different number of aggregators used on different processes, and ultimately to a deadlock. This patch fixes the problem by not using the file_view size anymore for the calculation in the simple-grouping option, but the contiguous chunk size (which is identical on all processes). Fixes issue open-mpi#7109 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>

opoplawski · 2019-12-10T02:40:00Z

As near as I can tell, it appears to be fixed there. Thanks!

open-mpi/ompi#7109

This is based on a bug reported on the mailing list using a netcdf testcase. The problem occurs if processes are using a custom file view, but on some of them it appears as if the default file view is being used. Because of that, the simple-grouping option lead to different number of aggregators used on different processes, and ultimately to a deadlock. This patch fixes the problem by not using the file_view size anymore for the calculation in the simple-grouping option, but the contiguous chunk size (which is identical on all processes). Fixes issue open-mpi#7109 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>

hppritcha · 2020-01-13T21:18:18Z

closed by #7193

This is based on a bug reported on the mailing list using a netcdf testcase. The problem occurs if processes are using a custom file view, but on some of them it appears as if the default file view is being used. Because of that, the simple-grouping option lead to different number of aggregators used on different processes, and ultimately to a deadlock. This patch fixes the problem by not using the file_view size anymore for the calculation in the simple-grouping option, but the contiguous chunk size (which is identical on all processes). Fixes issue open-mpi#7109 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu> (cherry picked from commit ad5d0df)

open-mpi/ompi#7109 git-svn-id: file:///srv/repos/svn-packages/svn@372164 eb2447ed-0c53-47e4-bac8-5bc4a241df78

opoplawski mentioned this issue Oct 26, 2019

Timeout for nc_test4_run_par_test in 4.7.2 Unidata/netcdf-c#1500

Closed

edgargabriel self-assigned this Oct 29, 2019

edgargabriel added the bug label Oct 29, 2019

edgargabriel mentioned this issue Oct 29, 2019

common/ompio: fix calculation in simple-grouping option #7122

Merged

hppritcha added the Target: v4.0.x label Nov 8, 2019

kmuehlbauer mentioned this issue Nov 22, 2019

Patch to install ncexternl.h in v 4.7.2 conda-forge/libnetcdf-feedstock#91

Merged

5 tasks

edgargabriel mentioned this issue Nov 25, 2019

common/ompio: fix calculation in simple-grouping option #7193

Merged

anthraxx added a commit to anthraxx/arch-pkgbuilds that referenced this issue Dec 23, 2019

upgpkg: openmpi 4.0.2-3 fix deadlock running netcdf test with openmpi

f4ba8e3

open-mpi/ompi#7109

hppritcha closed this as completed Jan 13, 2020

edgargabriel mentioned this issue Feb 17, 2020

Hang in MPI_File_write_at_all in 4.0.2 #7414

Closed

svenstaro pushed a commit to archlinux/svntogit-packages that referenced this issue Jul 22, 2020

upgpkg: openmpi 4.0.2-3 fix deadlock running netcdf test with openmpi

c0a008c

open-mpi/ompi#7109 git-svn-id: file:///srv/repos/svn-packages/svn@372164 eb2447ed-0c53-47e4-bac8-5bc4a241df78

tpg2114 mentioned this issue Nov 24, 2021

Incorrect plot files with HDF5 AMReX-Codes/amrex#2491

Open

houjun mentioned this issue Nov 24, 2021

Parallel Write Hangs with OpenMPI 4.0.3 HDFGroup/hdf5#1227

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deadlock running netcdf test with openmpi 4.0.2 #7109

deadlock running netcdf test with openmpi 4.0.2 #7109

opoplawski commented Oct 26, 2019

edgargabriel commented Oct 29, 2019

hppritcha commented Dec 6, 2019

opoplawski commented Dec 10, 2019

hppritcha commented Jan 13, 2020

deadlock running netcdf test with openmpi 4.0.2 #7109

deadlock running netcdf test with openmpi 4.0.2 #7109

Comments

opoplawski commented Oct 26, 2019

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

edgargabriel commented Oct 29, 2019

hppritcha commented Dec 6, 2019

opoplawski commented Dec 10, 2019

hppritcha commented Jan 13, 2020