-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create an OPAL "core" library for internal usage #10779
Conversation
This branch is functional and synced to The wrapper compilers and Early eyes and feedback on this branch as it stands now are welcome. I will only push new commits from this point, and we can squash later. I'm out-of-the-office until next Tuesday so will resume work on this PR then. |
Data point: Open MPI Below are the results for Open MPI
|
I honestly have forgotten - IIRC, your "mpirun" wrapper has a dependency on OPAL only for installdirs so it can properly deal with any prefix (envar or whatever) - is that correct? I'm not sure it all really matters, but you could probably strip things down even further for just the "mpirun" wrapper if you were so inclined. Installdirs shouldn't have much in the way of dependencies (certainly not hwloc and pmix, for example). |
#9893 ? |
Josh took a patch I was working on to try and work around some of the really ugly in your patch around the MCA system. I thought one of the patches was still you as author; that might have gotten boogered up, but a lot of the hard bits were from your patch. |
I'm just trying to understand if there was a conceptual issue with the approach proposed there. |
Brian asked for help with his branch here, and I was asked to pick it up and drive it to completion for the v5 timeline. I saw some changes from George's branch were mixed in, so I assumed that Brian had already merged in the delta between the two approaches. I finished the work on Brian's branch, then squashed/rebased the branch to bring it up to the present. I added George and Brian as co-authors to the main commit due to their efforts in the branch up to the point where I picked it up. The approach in this PR seems to fit nicely with the existing infrastructure. We could move some code around to make it a bit clearer which is "core" OPAL and which is "top/main" OPAL, but I'm hoping that the documentation (also a part of this PR) will cover that. Note: This is still in a draft PR status because I wanted to bounce it off of CI. I'll work on bring it back for review later today. |
40347e7
to
42b9422
Compare
@@ -63,6 +63,8 @@ static opal_output_stream_t verbose; | |||
static char *output_dir = NULL; | |||
static char *output_prefix = NULL; | |||
|
|||
static opal_output_pmix_cleanup_fn_t output_pmix_cleanup_fn = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unlike pmix
, output
is now part of the core functionality. why are we using pmix in the name of this type in the core
library ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. output
is part of the OPAL core, but pmix
is not part of the OPAL core. The output.c
code had this code in it:
Lines 786 to 787 in b2c26c5
/* register it to be ignored */ | |
opal_pmix_register_cleanup(filename, false, true, false); |
The call was inserted to tell the host RM not to clean up this file when cleaning up session directories and whatnot when the job completes. The open_file
function may be called whenever a new framework/component opens a new output handle.
So I needed to find a way to get a pointer to the opal_pmix_register_cleanup()
OPAL wrapper around the PMIx_Job_control_nb
PMIx function into the OPAL core without linking to it (so tools wouldn't complain about the missing symbol).
To do this I added a function registration mechanism. I added the pmix
name as a reminder of how it connects.
@@ -474,6 +478,22 @@ void opal_output_set_output_file_info(const char *dir, const char *prefix, char | |||
} | |||
} | |||
|
|||
void opal_output_register_pmix_cleanup_fn(opal_output_pmix_cleanup_fn_t cleanup_fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what is the goal of this function. Reading through it seems to call the cleanup with the current registered output_pmix_cleanup_fn
function and then installs a new one ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The open_file
function might have been called in a few of the OPAL core frameworks/components before we can register the handler during the OPAL main initialization (which knows the PMIx symbol). This function first registers all of the previously opened files (if any) before setting the handle for any files to be opened later. It makes sure we cover them all.
@artemry-nv Mellanox CI failed on
I think the next test down needs to comment out as well (for now, until it is fixed) There also look to be some other cases that exercise |
@artemry-nv I think the problem is this line that needs to be uncommented so that the |
* Fixes Issue open-mpi#9869 * Split `libopen-pal.so` into two libraries: - `libopen-pal_core.la` : Internal "core" portion of OPAL containing the essential source and MCA needed for mpicc/mpirun tools to link against. The "core" library is not installed. - `libopen-pal.la` : Includes "core" plus all of the other OPAL project sources. The `.so` version of this is installed. * The "core" library contains the following: - `opal/class` - `opal/mca/backtrace` - `opal/mca/dl` - `opal/mca/installdirs` - `opal/mca/threads` - `opal/mca/timer` - `opal/runtime/*_core.[c|h]` - `opal/runtime/opal_info_support.c` - `opal/util (most - see Makefile.am)` - `opal/util/keyval` * The "core" library is linked into the following tools instead of the full `libopen-pal.so`: - `ompi/tools/mpirun` - `ompi/tools/wrappers` (by extension of `opal/tools/wrappers`) - `opal/tools/wrappers` * The `opal/runtime` files were divided into a 'core' set vs 'main' set Co-authored-by: George Bosilca <bosilca@icl.utk.edu> Co-authored-by: Brian Barrett <bbarrett@amazon.com> Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
7e66a7f
to
8112e9b
Compare
I squashed the commits down to two (functional commit and documentation commit). This is ready for final review. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
libopen-pal.so
into two libraries:libopen-pal_core.la
: Internal "core" portion of OPAL containing the essential source and MCA needed for mpicc/mpirun tools to link against. The "core" library is not installed.libopen-pal.la
: Includes "core" plus all of the other OPAL project sources. The.so
version of this is installed.opal/class
opal/mca/backtrace
opal/mca/dl
opal/mca/installdirs
opal/mca/threads
opal/mca/timer
opal/runtime/*_core.[c|h]
opal/runtime/opal_info_support.c
opal/util (most - see Makefile.am)
opal/util/keyval
libopen-pal.so
:ompi/tools/mpirun
ompi/tools/wrappers
(by extension ofopal/tools/wrappers
)opal/tools/wrappers
opal/runtime
files were divided into a 'core' set vs 'main' set