Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in MPI_File_write_at_all in 4.0.2 #7414

Closed
gsjaardema opened this issue Feb 17, 2020 · 6 comments
Closed

Hang in MPI_File_write_at_all in 4.0.2 #7414

gsjaardema opened this issue Feb 17, 2020 · 6 comments
Assignees

Comments

@gsjaardema
Copy link

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Version v4.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

I have tried this with a couple different installations. One of them built by someone else and installed on my system; one that I built myself via spack.

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: Linux RHEL-7
  • Computer hardware:
  • Network type:

Details of the problem

My application uses HDF5-1.10.6 and calls down into MPI. The call that is hanging is MPI_File_write_at_all on 16 ranks. For this call, ranks 0..7 are outputting the same amount of data at different offsets (non-overlapping and contiguous) and ranks 8..15 have no data to output.

When the job hangs, I can examine the stack in the debugger. The 8 ranks that have data to output are in mca_fcoll_vulcan_file_read_all and the 8 ranks with no data have returned.

Further probing in the debugger shows that ranks 0..7 think that the ompi_file_t procs_in_group setting is 1 and includes all procs, but ranks 8..15 think that the procs_in_group value is 2 consisting of 0..7 and 8..15. This is probably what is causing the issue, but I am unable to track down where they are diverging in this setting.

If I run the same executable using an openmpi-4.0.1 mpiexec (without any recompiling), everything runs to completion.

Not really asking for a solution, since my details are probably a little scarce, but instead looking for something to try or ways to maybe help debug.

@edgargabriel
Copy link
Member

Thank you for the bug report. There is good chance that this issue is identical with the bug reported in issue #7109

This should be fixed with PR #7193, which will be included in the upcoming 4.0.3 release. I will however run some tests to confirm that in the next few days.

@edgargabriel edgargabriel self-assigned this Feb 17, 2020
@edgargabriel
Copy link
Member

Could you provide me with a testcase of your code? And also, what file system are you using?

@gsjaardema
Copy link
Author

Thanks. I will try the patch in #7193 and see if that fixes things. I've been unable to create a smaller testcase. This happens on Lustre and on a linux scratch system so seems to be filesystem agnostic.

@edgargabriel
Copy link
Member

@gsjaardema I think the file system information is still useful here. On Lustre, Open MPI actually does not use ompio by default, but falls back to romio321. Even if ompio would be used on Lustre (which can happen if the romio component for whatever reason disqualifies itself), the vulcan component is not selected for File_write_at_all operations, but instead the dynamic_gen2 component should be used, a component specifically written for Lustre, and not used on any other file systems.

If all of these configurations (romio, ompio with vulcan and/or dynamic_gen2) show a deadlock, there is non-zero chance that the bug is not actually in the MPI library, but somewhere further up on the software chain.

@gsjaardema
Copy link
Author

gsjaardema commented Mar 2, 2020

I applied the patch in #7193 and the hang disappeared on the linux workstation. I will try to rebuild on the cluster with lustre hopefully tomorrow and see what happens there. Thanks.

@edgargabriel
Copy link
Member

Closing the issue. Can always reopen if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants