-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OMPI 4.0.1 TCP connection errors beyond 86 nodes #6786
Comments
Can you try the latest 4.0.x nightly snapshot from https://www.open-mpi.org/nightly/v4.0.x/ ? |
Jeff,
EDIT: Added proper verbatim quoting |
This seems to be related to #6198 |
@dfaraj did you build Open MPI with can you run |
using the nightly build from July 8 I can now run with 120 nodes.
|
I have built the same OMPI using OFI instead of PSM2 directly and now that Endpoint error is gone. |
Using Intel compiler 2018 and Openmpi Jul 09, 2019 Nightly Tarball. Fetching the following errors undefined reference to `mpi_type_struct_' I see to the previous posts seems using "--enable-mpi1-compatibility" solves the problem, |
I guess you are using the
|
more updates: so I have 2 Xeon nodes (2 sockets, each socket has 20 cores) and each node has one hfi. when I run with just 20 cores, things works, the moment I go beyond one socket, I get errors:
|
From the above discussion openmpi is not working for 87++ nodes. Is there any way to run openmpi 4.0.1 with -mca pml ucx on more than 100 nodes? Can we use recently Nightly Master Tarball with the --enable-mpi1-compatibility option? |
no |
any updates? |
I'm lost in the conversation here. Is the problem being discussed the TCP/SSH 86 process issue, or some PSM issue? If this is no longer the TCP/SSH 86 process issue, this issue should be closed and a new issue should be opened to discuss the new question (please don't mix multiple topics on a single github issue -- thanks). |
Is there any update on openmpi in which we can run our application for 87++ nodes and with the --enable-mpi1-compatibility option? |
A lot of fixes have gone in to the v4.0.x branch in the run-time area. Can you try the latest v4.0x nightly snapshot? For the MPI-1 compatibility, you should talk to your upstream application providers and (strongly) encourage them to upgrade their source code -- those APIs were deprecated in 1996, and were finally removed in 2012. It's time for them to stop being used. FWIW: We're likely (but not guaranteed) to continue the MPI-1 compatibility in Open MPI v5.0 -- beyond that, I can't promise anything. |
Thanks, Jeff for the information! |
I posted a fix to the |
I just tested the latest nightly build and I no longer see this problem on OPA or EDR fabrics. |
Thanks for verifying @dfaraj! |
Thank you for taking the time to submit an issue!
Background information
we have an OPA cluster of 288 nodes. All nodes run same OS image, have passwordless ssh setup and firewall is disabled. We run basic OSU osu_mbw_mr tests on 2, 4, ...86 nodes and tests complete successfully. Once we hit 88+ nodes we get
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
4.0.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
downloaded 4.0.1 from openmpi site,
Please describe the system on which you are running
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
when we run:
n=86
it works fine,
n=88
we get the tcp error described earlier.
if I do:
n=88
it works.
if I set
n=160
it hangs, I dont think thu it is hanging, it is likely doing ssh to everyone and going so slow
EDIT: Put in proper verbatim markup
The text was updated successfully, but these errors were encountered: