-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlock running netcdf test with openmpi 4.0.2 #7109
Comments
This is based on a bug reported on the mailing list using a netcdf testcase. The problem occurs if processes are using a custom file view, but on some of them it appears as if the default file view is being used. Because of that, the simple-grouping option lead to different number of aggregators used on different processes, and ultimately to a deadlock. This patch fixes the problem by not using the file_view size anymore for the calculation in the simple-grouping option, but the contiguous chunk size (which is identical on all processes). Fixes issue open-mpi#7109 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
@opoplawski thank you once again for the bug report. I found the issue that leads to the deadlock, a fix is currently being applied to master and should be available in v4.0.3. Just as a note (for documentation purposes), this bug should not be present in the 3.0.x and 3.1.x series. |
This is based on a bug reported on the mailing list using a netcdf testcase. The problem occurs if processes are using a custom file view, but on some of them it appears as if the default file view is being used. Because of that, the simple-grouping option lead to different number of aggregators used on different processes, and ultimately to a deadlock. This patch fixes the problem by not using the file_view size anymore for the calculation in the simple-grouping option, but the contiguous chunk size (which is identical on all processes). Fixes issue open-mpi#7109 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu> (cherry picked from commit ad5d0df)
@opoplawski if you have a chance could you check if this problem is fixed in the 4.0.x release stream? Nightly tarballs that you can use for testing are at https://www.open-mpi.org/nightly/v4.0.x/ |
This is based on a bug reported on the mailing list using a netcdf testcase. The problem occurs if processes are using a custom file view, but on some of them it appears as if the default file view is being used. Because of that, the simple-grouping option lead to different number of aggregators used on different processes, and ultimately to a deadlock. This patch fixes the problem by not using the file_view size anymore for the calculation in the simple-grouping option, but the contiguous chunk size (which is identical on all processes). Fixes issue open-mpi#7109 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
As near as I can tell, it appears to be fixed there. Thanks! |
This is based on a bug reported on the mailing list using a netcdf testcase. The problem occurs if processes are using a custom file view, but on some of them it appears as if the default file view is being used. Because of that, the simple-grouping option lead to different number of aggregators used on different processes, and ultimately to a deadlock. This patch fixes the problem by not using the file_view size anymore for the calculation in the simple-grouping option, but the contiguous chunk size (which is identical on all processes). Fixes issue open-mpi#7109 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
closed by #7193 |
This is based on a bug reported on the mailing list using a netcdf testcase. The problem occurs if processes are using a custom file view, but on some of them it appears as if the default file view is being used. Because of that, the simple-grouping option lead to different number of aggregators used on different processes, and ultimately to a deadlock. This patch fixes the problem by not using the file_view size anymore for the calculation in the simple-grouping option, but the contiguous chunk size (which is identical on all processes). Fixes issue open-mpi#7109 Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu> (cherry picked from commit ad5d0df)
open-mpi/ompi#7109 git-svn-id: file:///srv/repos/svn-packages/svn@372164 eb2447ed-0c53-47e4-bac8-5bc4a241df78
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
4.0.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Fedora packages
Please describe the system on which you are running
Details of the problem
netcdf tst_parallel3 program hangs. Backtrace shows:
processes are running full out.
Workaround is to set OMPI_MCA_fcoll=^vulcan.
The text was updated successfully, but these errors were encountered: