Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock running netcdf test with openmpi 4.0.2 #7109

Closed
opoplawski opened this issue Oct 26, 2019 · 4 comments
Closed

deadlock running netcdf test with openmpi 4.0.2 #7109

opoplawski opened this issue Oct 26, 2019 · 4 comments
Assignees

Comments

@opoplawski
Copy link
Contributor

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

4.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Fedora packages

Please describe the system on which you are running

  • Operating system/version: Fedora 31 and 32
  • Network type: localhost

Details of the problem

netcdf tst_parallel3 program hangs. Backtrace shows:

(gdb) bt
#0  0x00007f90c197529b in sched_yield () from /lib64/libc.so.6
#1  0x00007f90c1ac8a05 in ompi_request_default_wait () from /usr/lib64/openmpi/lib/libmpi.so.40
#2  0x00007f90c1b2b35c in ompi_coll_base_sendrecv_actual () from /usr/lib64/openmpi/lib/libmpi.so.40
#3  0x00007f90c1b2bb73 in ompi_coll_base_allreduce_intra_recursivedoubling () from /usr/lib64/openmpi/lib/libmpi.so.40
#4  0x00007f90be96e9c5 in mca_fcoll_vulcan_file_write_all () from /usr/lib64/openmpi/lib/openmpi/mca_fcoll_vulcan.so
#5  0x00007f90be9fada0 in mca_common_ompio_file_write_at_all () from /usr/lib64/openmpi/lib/libmca_common_ompio.so.41
#6  0x00007f90beb0610b in mca_io_ompio_file_write_at_all () from /usr/lib64/openmpi/lib/openmpi/mca_io_ompio.so
#7  0x00007f90c1af033f in PMPI_File_write_at_all () from /usr/lib64/openmpi/lib/libmpi.so.40
#8  0x00007f90c1627d7b in H5FD_mpio_write () from /usr/lib64/openmpi/lib/libhdf5.so.103
#9  0x00007f90c14636ee in H5FD_write () from /usr/lib64/openmpi/lib/libhdf5.so.103
#10 0x00007f90c1442eb3 in H5F__accum_write () from /usr/lib64/openmpi/lib/libhdf5.so.103
#11 0x00007f90c1543729 in H5PB_write () from /usr/lib64/openmpi/lib/libhdf5.so.103
#12 0x00007f90c144d69c in H5F_block_write () from /usr/lib64/openmpi/lib/libhdf5.so.103
#13 0x00007f90c161cd10 in H5C_apply_candidate_list () from /usr/lib64/openmpi/lib/libhdf5.so.103
#14 0x00007f90c161ad02 in H5AC__run_sync_point () from /usr/lib64/openmpi/lib/libhdf5.so.103
#15 0x00007f90c161bd4f in H5AC__flush_entries () from /usr/lib64/openmpi/lib/libhdf5.so.103
#16 0x00007f90c13b154d in H5AC_flush () from /usr/lib64/openmpi/lib/libhdf5.so.103
#17 0x00007f90c1446761 in H5F__flush_phase2.part.0 () from /usr/lib64/openmpi/lib/libhdf5.so.103
#18 0x00007f90c1448e64 in H5F__flush () from /usr/lib64/openmpi/lib/libhdf5.so.103
#19 0x00007f90c144dc08 in H5F_flush_mounts_recurse () from /usr/lib64/openmpi/lib/libhdf5.so.103
#20 0x00007f90c144f171 in H5F_flush_mounts () from /usr/lib64/openmpi/lib/libhdf5.so.103
#21 0x00007f90c143e3a5 in H5Fflush () from /usr/lib64/openmpi/lib/libhdf5.so.103
#22 0x00007f90c1c178c0 in sync_netcdf4_file (h5=0x56527e439b10) at ../../libhdf5/hdf5file.c:222
#23 0x00007f90c1c1816e in NC4_enddef (ncid=<optimized out>) at ../../libhdf5/hdf5file.c:544
#24 0x00007f90c1bd94f3 in nc_enddef (ncid=65536) at ../../libdispatch/dfile.c:1004
#25 0x000056527d0def27 in test_pio (flag=0) at ../../nc_test4/tst_parallel3.c:206
#26 0x000056527d0de62c in main (argc=<optimized out>, argv=<optimized out>) at ../../nc_test4/tst_parallel3.c:91

processes are running full out.

Workaround is to set OMPI_MCA_fcoll=^vulcan.

@edgargabriel edgargabriel self-assigned this Oct 29, 2019
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Oct 29, 2019
This is based on a bug reported on the mailing list using a netcdf testcase.
The problem occurs if processes are using a custom file view, but on some
of them it appears as if the default file view is being used. Because of that,
the simple-grouping option lead to different number of aggregators used on different
processes, and ultimately to a deadlock. This patch fixes the problem by not using
the file_view size anymore for the calculation in the simple-grouping option,
but the contiguous chunk size (which is identical on all processes).

Fixes issue open-mpi#7109

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
@edgargabriel
Copy link
Member

@opoplawski thank you once again for the bug report. I found the issue that leads to the deadlock, a fix is currently being applied to master and should be available in v4.0.3. Just as a note (for documentation purposes), this bug should not be present in the 3.0.x and 3.1.x series.

edgargabriel added a commit to edgargabriel/ompi that referenced this issue Nov 25, 2019
This is based on a bug reported on the mailing list using a netcdf testcase.
The problem occurs if processes are using a custom file view, but on some
of them it appears as if the default file view is being used. Because of that,
the simple-grouping option lead to different number of aggregators used on different
processes, and ultimately to a deadlock. This patch fixes the problem by not using
the file_view size anymore for the calculation in the simple-grouping option,
but the contiguous chunk size (which is identical on all processes).

Fixes issue open-mpi#7109

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
(cherry picked from commit ad5d0df)
@hppritcha
Copy link
Member

@opoplawski if you have a chance could you check if this problem is fixed in the 4.0.x release stream? Nightly tarballs that you can use for testing are at https://www.open-mpi.org/nightly/v4.0.x/

KKraljic pushed a commit to KKraljic/ompi that referenced this issue Dec 8, 2019
This is based on a bug reported on the mailing list using a netcdf testcase.
The problem occurs if processes are using a custom file view, but on some
of them it appears as if the default file view is being used. Because of that,
the simple-grouping option lead to different number of aggregators used on different
processes, and ultimately to a deadlock. This patch fixes the problem by not using
the file_view size anymore for the calculation in the simple-grouping option,
but the contiguous chunk size (which is identical on all processes).

Fixes issue open-mpi#7109

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
@opoplawski
Copy link
Contributor Author

As near as I can tell, it appears to be fixed there. Thanks!

bosilca pushed a commit to bosilca/ompi that referenced this issue Dec 27, 2019
This is based on a bug reported on the mailing list using a netcdf testcase.
The problem occurs if processes are using a custom file view, but on some
of them it appears as if the default file view is being used. Because of that,
the simple-grouping option lead to different number of aggregators used on different
processes, and ultimately to a deadlock. This patch fixes the problem by not using
the file_view size anymore for the calculation in the simple-grouping option,
but the contiguous chunk size (which is identical on all processes).

Fixes issue open-mpi#7109

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
@hppritcha
Copy link
Member

closed by #7193

cniethammer pushed a commit to cniethammer/ompi that referenced this issue May 10, 2020
This is based on a bug reported on the mailing list using a netcdf testcase.
The problem occurs if processes are using a custom file view, but on some
of them it appears as if the default file view is being used. Because of that,
the simple-grouping option lead to different number of aggregators used on different
processes, and ultimately to a deadlock. This patch fixes the problem by not using
the file_view size anymore for the calculation in the simple-grouping option,
but the contiguous chunk size (which is identical on all processes).

Fixes issue open-mpi#7109

Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
(cherry picked from commit ad5d0df)
svenstaro pushed a commit to archlinux/svntogit-packages that referenced this issue Jul 22, 2020
open-mpi/ompi#7109

git-svn-id: file:///srv/repos/svn-packages/svn@372164 eb2447ed-0c53-47e4-bac8-5bc4a241df78
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants