Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4.1.x: btl tcp: Add workaround for "dropped connection" issue #8966

Merged
merged 1 commit into from
May 17, 2021

Conversation

jsquyres
Copy link
Member

@jsquyres jsquyres commented May 17, 2021

Work around a race condition in the TCP BTL's proc setup code.
The Cisco MTT results have been failing on TCP tests due to a
"dropped connection" message some percentage of the time.
Some digging shows that the issue happens in a combination of
multiple NICs and multiple threads. The race is detailed in
#3035 (comment).

This patch doesn't fix the race, but avoids it by forcing
the MPI layer to complete all calls to add_procs across the
entire job before any process leaves MPI_INIT. It also
reduces the scalability of the TCP BTL by increasing start-up
time, but better than hanging.

The long term fix is to do all endpoint setup in the first
call to add_procs for a given remote proc, removing the
race. THis patch is a work around until that patch can
be developed.

Signed-off-by: Brian Barrett bbarrett@amazon.com
(cherry picked from commit 2acc4b7)

Work around a race condition in the TCP BTL's proc setup code.
The Cisco MTT results have been failing on TCP tests due to a
"dropped connection" message some percentage of the time.
Some digging shows that the issue happens in a combination of
multiple NICs and multiple threads.  The race is detailed in
open-mpi#3035 (comment).

This patch doesn't fix the race, but avoids it by forcing
the MPI layer to complete all calls to add_procs across the
entire job before any process leaves MPI_INIT.  It also
reduces the scalability of the TCP BTL by increasing start-up
time, but better than hanging.

The long term fix is to do all endpoint setup in the first
call to add_procs for a given remote proc, removing the
race.  THis patch is a work around until that patch can
be developed.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
(cherry picked from commit 2acc4b7)
@jsquyres jsquyres added this to the v4.1.2 milestone May 17, 2021
@jsquyres jsquyres requested review from hppritcha and bwbarrett May 17, 2021 18:09
@jsquyres
Copy link
Member Author

Somehow this fix was applied to the v4.0.x branch (see #8721) but we missed adding it to the v4.1.x branch.

@jsquyres jsquyres changed the title btl tcp: Add workaround for "dropped connection" issue v4.1.x: btl tcp: Add workaround for "dropped connection" issue May 17, 2021
@jsquyres jsquyres linked an issue May 17, 2021 that may be closed by this pull request
@jsquyres jsquyres merged commit e6b751e into open-mpi:v4.1.x May 17, 2021
@jsquyres jsquyres deleted the pr/fix-tcp-btl-race-condition branch May 17, 2021 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

master: TCP BTL addressing fail
2 participants