-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v4.0.0 hangs for simple message send & recv in mca_btl_vader_component_progress? #6258
Comments
Hang also occurs with current master acc2a70 |
You said your network type is "Ethernet" -- does that mean you're using the TCP BTL? |
I don't think so. I'm simply using |
I came here to post this exact issue (also narrowed it down to a 6185592 byte threshold). I'm glad someone else already typed it up. I'm on macOS 10.14.2. OpenMPI was installed with homebrew and configured with "--disable-silent-rules --enable-ipv6 --with-libevent=/usr/local/opt/libevent". Here is my test program: #include <mpi.h>
#include <iostream>
#define N 6185593
int main()
{
char data[N];
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
MPI_Send(data, N, MPI_BYTE, 1, 0, MPI_COMM_WORLD);
} else if (rank == 1) {
MPI_Status status;
MPI_Recv(data, N, MPI_BYTE, 0, 0, MPI_COMM_WORLD, &status);
std::cout << status._ucount << " " << status.MPI_ERROR << std::endl;
}
MPI_Finalize();
return 0;
} which I try running with |
Also worth noting that the issue does not occur when I test with MPICH (installed with homebrew) |
I would like to say I also see this one. I see the same thing using 4.0.0 from homebrew on Mojave (10.14.1). A quite complex MPI application works fine on other platforms and used to work on the Mac. I recently updated to a brand new one and now a send and receive pair deadlocks when the message size passes a particular size. I haven't checked the size precisely but it is comparable to the OP. When I check the processes by attaching with lldb. And move up a couple of frames for some clarity I get: For one half:
And for the other end:
If burrowing down the stack further would help I'd happily oblige. BTW they have both received SIG STOP when I attach - but that is probably a coincidence |
Same problem here (OSX, OpenMPI 4.0 via Homebrew). For anyone arriving here looking for a workaround: You can use a different BTL, e.g. $ mpirun --mca btl self,sm,tcp or (if you don't have sm) $ mpirun --mca btl self,tcp |
Can you all try the latest 4.0.1rc nightly snapshot tarball? |
I can reproduce on different flavors of OSX, but not on Linux. The issue seems to come from vader, as if I force the use of TCP (--mca btl tcp,self) the program correctly completes. I'll take a look. |
@gpaulsen @hppritcha This has the potential to be a v4.0.x blocker. I have marked it as so to make sure it isn't missed. Please evaluate. |
Do we know if this happens on v3.0.x or v3.1.x? I ask because we're just about to do RCs for those 2. |
Not happening on 3.1 for me. |
...answering my own question... I am able to replicate on v4.0.0, v4.0.1rc1, and v4.0.x HEAD on my MBP MacOS 10.14.3. I am not able to replicate with v3.0.x HEAD and v3.1.x HEAD. |
Yesterday I updated to OSX 10.14.3 and gcc 7.4.0. I cannot replicate this issue anymore. |
FWIW, I'm at 10.14.3, and I can replicate. But I am using the MacOS gcc (i.e., clang), not a homebrew gcc:
|
@bosilca and I have investigated:
Looks like at least f62d26d and 6ffc7cc and b51c8f8 were missed coming over to v4.0.x from master. These appear to be the main ones we need; there may be one or two more that could be worthwhile to come over. PR inbound shortly... |
Also -- we confirmed: this is not an issue for master. It's just commits that we didn't bring over to v4.0.x. |
v4.0.x: Cherry-pick fixes for issue #6258 from master (vader fixes)
missing commit has been committed to v4.0.x. closing. |
Hello! I am a student just learning MPI, but I encountered a similar hang situation (just on MPI_Probe) when a previous MPI_Isend was sending size 0. Do you think that something worth pulling an issue for/exploring further? Thanks! |
The hang likewise occurred in mca_btl_vader_component_progress |
We just released Open MPI v4.0.1 yesterday (https://www.mail-archive.com/announce@lists.open-mpi.org/msg00122.html); can you please try again with that version? If the problem persists, please open a new issue (vs. commenting on a closed issued). Thanks! |
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v4.0.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Installed from source, but the same error occurs with the installation from homebrew on macOS. The configure options were all default (i.e.
./configure --prefix=... && make -j10 && make install
).The compiler used to compile Open MPI and the example was the system compiler, which is gcc 4.3.4.
Please describe the system on which you are running
Details of the problem
A relatively simple case (repo attached) involving 2 processes -- one doing a (non-blocking) send followed by a wait, the other doing a matching (non-blocking) recv followed by a wait -- will hang once the message exceeds a certain size (6185592 bytes => OK, 6185593 bytes => hang).
When the send & recv are changed to their blocking counterpart, the hang still occurs.
The problem did not occur in previous versions of Open MPI, in particular 3.1.3 seems fine.
open-mpi4_hang_repo.c.zip
The text was updated successfully, but these errors were encountered: