-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang in MPI_File_write_at_all in 4.0.2 #7414
Comments
Could you provide me with a testcase of your code? And also, what file system are you using? |
Thanks. I will try the patch in #7193 and see if that fixes things. I've been unable to create a smaller testcase. This happens on Lustre and on a linux scratch system so seems to be filesystem agnostic. |
@gsjaardema I think the file system information is still useful here. On Lustre, Open MPI actually does not use ompio by default, but falls back to romio321. Even if ompio would be used on Lustre (which can happen if the romio component for whatever reason disqualifies itself), the vulcan component is not selected for File_write_at_all operations, but instead the dynamic_gen2 component should be used, a component specifically written for Lustre, and not used on any other file systems. If all of these configurations (romio, ompio with vulcan and/or dynamic_gen2) show a deadlock, there is non-zero chance that the bug is not actually in the MPI library, but somewhere further up on the software chain. |
I applied the patch in #7193 and the hang disappeared on the linux workstation. I will try to rebuild on the cluster with lustre hopefully tomorrow and see what happens there. Thanks. |
Closing the issue. Can always reopen if necessary. |
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
Version v4.0.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
I have tried this with a couple different installations. One of them built by someone else and installed on my system; one that I built myself via spack.
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
My application uses HDF5-1.10.6 and calls down into MPI. The call that is hanging is MPI_File_write_at_all on 16 ranks. For this call, ranks 0..7 are outputting the same amount of data at different offsets (non-overlapping and contiguous) and ranks 8..15 have no data to output.
When the job hangs, I can examine the stack in the debugger. The 8 ranks that have data to output are in
mca_fcoll_vulcan_file_read_all
and the 8 ranks with no data have returned.Further probing in the debugger shows that ranks 0..7 think that the
ompi_file_t
procs_in_group
setting is 1 and includes all procs, but ranks 8..15 think that theprocs_in_group
value is 2 consisting of 0..7 and 8..15. This is probably what is causing the issue, but I am unable to track down where they are diverging in this setting.If I run the same executable using an openmpi-4.0.1 mpiexec (without any recompiling), everything runs to completion.
Not really asking for a solution, since my details are probably a little scarce, but instead looking for something to try or ways to maybe help debug.
The text was updated successfully, but these errors were encountered: