ompi/instance: fix cleanup function registration order #13073

hominhquan · 2025-01-31T09:23:35Z

Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init.
The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'.
The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (_progress())
This may be the root cause of issue Intermittent crashes inside MPI_Finalize #10117

github-actions · 2025-01-31T09:24:04Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

f33c3a2: ompi/instance: fix cleanup function registration o...

check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

- Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init. - The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'. - The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (*_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (*_progress()) - This may be the root cause of issue open-mpi#10117 Signed-off-by: Minh Quan Ho <minh-quan.ho@sipearl.com>

hppritcha · 2025-02-01T20:59:21Z

This patch isn't going to work. The problem is by the time the ompi_mpi_instance_cleanup_pml is invoked the PML framework has already been closed. For some reason this happens to work for OB1 PML. Not sure exactly why that is yet.

Could you provide a traceback by chance from one of the failed CI tests when using Open MPI main or 5.0.x? This area of Open MPI was completely rewritten from the 4.1.x to 5.0.x releases so tracebacks from failures using the 4.1.x releases wouldn't be that helpful for addressing this issue for newer releases.

hominhquan · 2025-02-02T16:05:59Z

This patch isn't going to work. The problem is by the time the ompi_mpi_instance_cleanup_pml is invoked the PML framework has already been closed. For some reason this happens to work for OB1 PML. Not sure exactly why that is yet.

Could you provide a traceback by chance from one of the failed CI tests when using Open MPI main or 5.0.x? This area of Open MPI was completely rewritten from the 4.1.x to 5.0.x releases so tracebacks from failures using the 4.1.x releases wouldn't be that helpful for addressing this issue for newer releases.

Thanks @hppritcha for your comment. I admit that I don't know the complete init/finalize protocol between different modules. The only fact I notice is that the ompi_mpi_instance_cleanup_pml() was (somehow wrongly ?) registered as belonging to opal_init_domain, whereas I guess it should be in ompi_instance_common_domain ? The result is that this ompi_mpi_instance_cleanup_pml() is called by RTE-finalize, and not instance-finalize (ompi_mpi_instance_finalize()).

@hppritcha Can you confirm which should be the expected registration order of ompi_mpi_instance_cleanup_pml() compared to other finalize functions in RTE ?

hominhquan · 2025-02-02T16:13:12Z

or perhaps the opal_finalize_cleanup_domain (&ompi_instance_common_domain); should be moved to before all the mca_base_framework_close (ompi_framework_dependencies[j]), and always after ompi_rte_finalize() ?

hppritcha · 2025-02-02T19:49:35Z

This code has been used for who knows how many millions of mpirun job runs since the 5.0.0 release. Finalizing MPI as the last instance is being closed (an instance maps to an MPI session) needs to be carefully orchestrated. The pml clean up is attempting to make sure all procs have been removed from the PML (as well as any opened via the OSC framework) prior to closing the frameworks. You may have found a bug but this is not the way to solve it. Could you please provide a traceback of one of your CI failures using either one of the 5.0.x releases or main?

Also a way to double check your changes, you should configure your builds with --enable-mca-dso. This configure option makes a framework close call also cause the component to be dlclose'd. Doing this will make it obvious this patch won't work. Even the OB1 PML fails with a segfault:

#9  <signal handler called>
#10 0x00007fffea8cd851 in ?? ()
#11 0x00007ffff79f7341 in ompi_mpi_instance_cleanup_pml () at instance/instance.c:189
#12 0x00007ffff70c293e in opal_finalize_cleanup_domain (domain=0x7ffff7d9ad80 <ompi_instance_common_domain>) at runtime/opal_finalize_core.c:129
#13 0x00007ffff79f9ce4 in ompi_mpi_instance_finalize_common () at instance/instance.c:967
#14 0x00007ffff79f9e1a in ompi_mpi_instance_finalize (instance=0x7ffff7d9aa60 <ompi_mpi_instance_default>) at instance/instance.c:984
#15 0x00007ffff79ee2f3 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:294
#16 0x00007ffff7a364fd in PMPI_Finalize () at finalize.c:52
#17 0x0000000000400a56 in main (argc=1, argv=0x7fffffffd8d8) at ring_c.c:75

hominhquan · 2025-02-04T14:43:30Z

I observed the race when testing the enabling of async OPAL progress thread in #13074. The observation is as follow:

Each MPI process spawned a new thread to loop on opal_progress()
At the end of osu_ireduce, the main thread hits MPI_Finalize and calls RTE finalize, which calls ompi_mpi_instance_cleanup_pml() before other opal_*_finalize()
In the case of single-node shared-mem communication, ompi_mpi_instance_cleanup_pml() will call sm_del_procs() which will destroy all endpoints (mca_btl_base_endpoint_t), while the progress thread is still running all the registered *_progress() callbacks and will crash if one of them accesses to these endpoints (e.g. mca_btl_sm_component_progress()).
Below is an example backtrace of the crash

(gdb) break PMPI_Finalize
(gdb) continue

Thread 1 "osu_ireduce" hit Breakpoint 2, PMPI_Finalize () at ../../../../../../../../../../ompi/mpi/c/finalize.c:46
46          if (MPI_PARAM_CHECK) {
Missing separate debuginfos, use: dnf debuginfo-install zlib-1.2.11-40.el9.aarch64
(gdb) i th
  Id   Target Id                                         Frame
* 1    Thread 0xffffaee57440 (LWP 2744553) "osu_ireduce" PMPI_Finalize () at ../../../../../../../../../../ompi/mpi/c/finalize.c:46
  2    Thread 0xffffae7c90c0 (LWP 2744946) "osu_ireduce" _opal_progress () at ../../../../../../../../opal/runtime/opal_progress.c:308
  3    Thread 0xffffadc4f0c0 (LWP 2745057) "osu_ireduce" 0x0000ffffaeb3cd7c in epoll_pwait () from /lib64/libc.so.6
  4    Thread 0xffffacef00c0 (LWP 2745066) "osu_ireduce" 0x0000ffffaeb2e890 in read () from /lib64/libc.so.6

(gdb) t 2
[Switching to thread 2 (Thread 0xffffae7c90c0 (LWP 2744946))]
#0  _opal_progress () at ../../../../../../../../opal/runtime/opal_progress.c:308
308             events += (callbacks[i])();
(gdb) bt
#0  _opal_progress () at ../../../../../../../../opal/runtime/opal_progress.c:308
#1  0x0000ffffaeee8c5c in opal_progress_async_thread_engine (obj=<optimized out>) at ../../../../../../../../opal/runtime/opal_progress.c:350
#2  0x0000ffffaead26b8 in start_thread () from /lib64/libc.so.6
#3  0x0000ffffaeb3cbdc in thread_start () from /lib64/libc.so.6
(gdb) c
Continuing.
[Thread 0xffffacef00c0 (LWP 2745066) exited]

Thread 2 "osu_ireduce" received signal SIGSEGV, Segmentation fault.
mca_btl_sm_check_fboxes () at ../../../../../../../../../../../opal/mca/btl/sm/btl_sm_fbox.h:241
241                 const mca_btl_sm_fbox_hdr_t hdr = mca_btl_sm_fbox_read_header(
(gdb) bt
#0  mca_btl_sm_check_fboxes () at ../../../../../../../../../../../opal/mca/btl/sm/btl_sm_fbox.h:241
#1  mca_btl_sm_component_progress () at ../../../../../../../../../../../opal/mca/btl/sm/btl_sm_component.c:559
#2  0x0000ffffaeee8e80 in _opal_progress () at ../../../../../../../../opal/runtime/opal_progress.c:308
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I think this crash rarely occurs since MPI_Finalize is most of the time called by the main thread, and application should have previously waited for all communication termination, so there is no *_progress() callback executed at the same time of finalize, but there may be some cases I have missed among all the available MCA components.

github-actions bot added the Target: main label Jan 31, 2025

hominhquan mentioned this pull request Jan 31, 2025

Intermittent crashes inside MPI_Finalize #10117

Open

hominhquan force-pushed the mqho/instance_finalize branch 2 times, most recently from 9631015 to 6d59b77 Compare January 31, 2025 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ompi/instance: fix cleanup function registration order #13073

ompi/instance: fix cleanup function registration order #13073

hominhquan commented Jan 31, 2025

github-actions bot commented Jan 31, 2025

hppritcha commented Feb 1, 2025

hominhquan commented Feb 2, 2025

hominhquan commented Feb 2, 2025

hppritcha commented Feb 2, 2025

hominhquan commented Feb 4, 2025

ompi/instance: fix cleanup function registration order #13073

Are you sure you want to change the base?

ompi/instance: fix cleanup function registration order #13073

Conversation

hominhquan commented Jan 31, 2025

github-actions bot commented Jan 31, 2025

hppritcha commented Feb 1, 2025

hominhquan commented Feb 2, 2025

hominhquan commented Feb 2, 2025

hppritcha commented Feb 2, 2025

hominhquan commented Feb 4, 2025