Skip to content

Commit

Permalink
ompi/instance: fix cleanup function registration order
Browse files Browse the repository at this point in the history
- Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain')
  before RTE/OPAL init.

- The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn
  will set the internal tracking domain to OPAL's one ('opal_init_domain'), and
  this PML cleanup function would be mis-registered as belonging to 'opal_init_domain'
  instead of the current 'ompi_instance_common_domain'.

- The consequence of such mis-registration is that: at MPI_Finalize(), this
  PML cleanup (*_del_procs()) will be executed by RTE; and, depending on
  their registration order, this may cut the grass under the feet of other
  running components (*_progress())

- This may be the root cause of issue #10117

Signed-off-by: Minh Quan Ho <minh-quan.ho@sipearl.com>
  • Loading branch information
Minh Quan Ho committed Jan 31, 2025
1 parent c5e02ab commit 9631015
Showing 1 changed file with 8 additions and 4 deletions.
12 changes: 8 additions & 4 deletions ompi/instance/instance.c
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
* reserved.
* Copyright (c) 2023 Jeffrey M. Squyres. All rights reserved.
* Copyright (c) 2024 NVIDIA Corporation. All rights reserved.
* Copyright (c) 2025 SiPearl. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
Expand Down Expand Up @@ -381,6 +382,10 @@ static int ompi_mpi_instance_init_common (int argc, char **argv)
opal_finalize_domain_init (&ompi_instance_common_domain, "ompi_mpi_instance_init_common");
opal_finalize_set_domain (&ompi_instance_common_domain);

/* Append PML cleanup into the finalize of this domain ('ompi_instance_common_domain')
before RTE init */
ompi_mpi_instance_append_finalize (ompi_mpi_instance_cleanup_pml);

if (OPAL_SUCCESS != (ret = opal_arch_set_fortran_logical_size(sizeof(ompi_fortran_logical_t)))) {
return ompi_instance_print_error ("ompi_mpi_init: opal_arch_set_fortran_logical_size failed", ret);
}
Expand Down Expand Up @@ -638,8 +643,6 @@ static int ompi_mpi_instance_init_common (int argc, char **argv)
return ompi_instance_print_error ("ompi_group_init() failed", ret);
}

ompi_mpi_instance_append_finalize (ompi_mpi_instance_cleanup_pml);

/* initialize communicator subsystem */
if (OMPI_SUCCESS != (ret = ompi_comm_init ())) {
opal_mutex_unlock (&instance_lock);
Expand Down Expand Up @@ -906,8 +909,6 @@ static int ompi_mpi_instance_finalize_common (void)
mca_mpool_base_tree_print (ompi_debug_show_mpi_alloc_mem_leaks);
}

opal_finalize_cleanup_domain (&ompi_instance_common_domain);

if (NULL != ompi_mpi_main_thread) {
OBJ_RELEASE(ompi_mpi_main_thread);
ompi_mpi_main_thread = NULL;
Expand Down Expand Up @@ -962,6 +963,9 @@ static int ompi_mpi_instance_finalize_common (void)

ompi_proc_finalize();

/* Should be called in reverse order of init, i.e. after RTE finalize */
opal_finalize_cleanup_domain (&ompi_instance_common_domain);

ompi_mpi_instance_release ();

return OMPI_SUCCESS;
Expand Down

0 comments on commit 9631015

Please sign in to comment.