-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ompi/instance: fix cleanup function registration order #13073
base: main
Are you sure you want to change the base?
Conversation
Hello! The Git Commit Checker CI bot found a few problems with this PR: f33c3a2: ompi/instance: fix cleanup function registration o...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
9631015
to
6d59b77
Compare
- Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init. - The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'. - The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (*_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (*_progress()) - This may be the root cause of issue open-mpi#10117 Signed-off-by: Minh Quan Ho <minh-quan.ho@sipearl.com>
This patch isn't going to work. The problem is by the time the ompi_mpi_instance_cleanup_pml is invoked the PML framework has already been closed. For some reason this happens to work for OB1 PML. Not sure exactly why that is yet. Could you provide a traceback by chance from one of the failed CI tests when using Open MPI main or 5.0.x? This area of Open MPI was completely rewritten from the 4.1.x to 5.0.x releases so tracebacks from failures using the 4.1.x releases wouldn't be that helpful for addressing this issue for newer releases. |
Thanks @hppritcha for your comment. I admit that I don't know the complete init/finalize protocol between different modules. The only fact I notice is that the @hppritcha Can you confirm which should be the expected registration order of |
or perhaps the |
This code has been used for who knows how many millions of mpirun job runs since the 5.0.0 release. Finalizing MPI as the last instance is being closed (an instance maps to an MPI session) needs to be carefully orchestrated. The pml clean up is attempting to make sure all procs have been removed from the PML (as well as any opened via the OSC framework) prior to closing the frameworks. You may have found a bug but this is not the way to solve it. Could you please provide a traceback of one of your CI failures using either one of the 5.0.x releases or main? Also a way to double check your changes, you should configure your builds with
|
I observed the race when testing the enabling of async OPAL progress thread in #13074. The observation is as follow:
|
Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init.
The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'.
The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (_progress())
This may be the root cause of issue Intermittent crashes inside MPI_Finalize #10117