Skip to content

Commit

Permalink
Merge pull request #8966 from jsquyres/pr/fix-tcp-btl-race-condition
Browse files Browse the repository at this point in the history
v4.1.x: btl tcp: Add workaround for "dropped connection" issue
  • Loading branch information
jsquyres authored May 17, 2021
2 parents 2b3f043 + e1612b0 commit e6b751e
Showing 1 changed file with 18 additions and 0 deletions.
18 changes: 18 additions & 0 deletions opal/mca/btl/tcp/btl_tcp_component.c
Original file line number Diff line number Diff line change
Expand Up @@ -1291,6 +1291,24 @@ mca_btl_base_module_t** mca_btl_tcp_component_init(int *num_btl_modules,
}
}

/* Avoid a race in wire-up when using threads (progess or user)
and multiple BTL modules. The details of the race are in
https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032,
but the summary is that the lookup code in
component_recv_handler() below assumes that add_procs() is
atomic across all active TCP BTL modules, but in multi-threaded
code, that isn't guaranteed, because the locking is inside
add_procs(), and add_procs() is called once per module. This
isn't a proper fix, but will solve the "dropped connection"
problem until we can come up with a more complete fix to how we
initialize procs, endpoints, and modules in the TCP BTL. */
if (mca_btl_tcp_component.tcp_num_btls > 1 &&
(enable_mpi_threads || 0 < mca_btl_tcp_progress_thread_trigger)) {
for( i = 0; i < mca_btl_tcp_component.tcp_num_btls; i++) {
mca_btl_tcp_component.tcp_btls[i]->super.btl_flags |= MCA_BTL_FLAGS_SINGLE_ADD_PROCS;
}
}

#if OPAL_CUDA_SUPPORT
mca_common_cuda_stage_one_init();
#endif /* OPAL_CUDA_SUPPORT */
Expand Down

0 comments on commit e6b751e

Please sign in to comment.