Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MPI communicator leak #841

Merged
merged 9 commits into from
Mar 28, 2023
Merged

Fix MPI communicator leak #841

merged 9 commits into from
Mar 28, 2023

Conversation

bprather
Copy link
Collaborator

@bprather bprather commented Mar 3, 2023

This fixes an issue seen in long runs which perform many reductions, where, depending on the MPI implementation, the program might run out of available file handles, or exhaust host memory entirely. This was due to allocation of an MPI communicator via MPI_Comm_dup upon creation of a Reduce or AllReduce object, without an accompanying call to MPI_Comm_free when an object and/or its copies were deleted.

This issue was previously fixed in Riot branch by adding a destructor to ReductionBase, which called MPI_Comm_free.
However, that fix relies on the MPI implementation to ref-count communicators such that calling MPI_Comm_free twice on a communicator would not result in a double-free error. It appears not all implementations do this, at least in my experience using that fix in KHARMA.

This fix instead declares the member variable comm as a std::shared_ptr, with MPI_Comm_free registered as a custom destructor.

It also adds some PARTHENON_MPI_CHECKs that were missing.

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@bprather bprather changed the title Attempt to fix MPI communicator leak without introducing a double free WIP: Fix MPI communicator leak Mar 3, 2023
@bprather
Copy link
Collaborator Author

bprather commented Mar 3, 2023

This is "finished" but not tested to work. When I get confirmation it fixes the issue downstream I'll un-WIP it.

@bprather
Copy link
Collaborator Author

bprather commented Mar 3, 2023

@par-hermes format

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@lroberts36
Copy link
Collaborator

lroberts36 commented Mar 6, 2023

This is another one that we ran into on the Riot branch of Parthenon that never got merged back into main... sorry. I used a pretty similar strategy, but without the smart pointer. (see lroberts36/merge-sparse-with-jdolence-sparse)

Edit: Oh, I didn't see that @bprather commented on this already.

Comment on lines 71 to 73
MPI_Comm *comm_tmp;
PARTHENON_MPI_CHECK(MPI_Comm_dup(MPI_COMM_WORLD, comm_tmp));
comm = std::shared_ptr<MPI_Comm>(comm_tmp, MPI_Comm_free);
Copy link
Collaborator

@Yurlungur Yurlungur Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this causes a runtime error, probably because MPI_Comm_dup doesn't allocate. I suggest the following:

Suggested change
MPI_Comm *comm_tmp;
PARTHENON_MPI_CHECK(MPI_Comm_dup(MPI_COMM_WORLD, comm_tmp));
comm = std::shared_ptr<MPI_Comm>(comm_tmp, MPI_Comm_free);
comm = std::shared_ptr<MPI_Comm>(new MPI_Comm, MPI_Comm_free);
PARTHENON_MPI_CHECK(MPI_Comm_dup(MPI_COMM_WORLD, comm.get()));

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this setup we could also put the initialization of comm in the var definition, e.g.,

std::shared_ptr<MPI_Comm> comm(new MPI_Comm, MPI_Comm_free);

@@ -59,20 +59,26 @@ struct ReductionBase {
T val;
#ifdef MPI_PARALLEL
MPI_Request req;
MPI_Comm comm;
std::shared_ptr<MPI_Comm> comm;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a pointer now, it might also make sense to rename it pcomm for clarity.

@BenWibking
Copy link
Collaborator

This fix does not work for me on Crusher:

MPICH ERROR [Rank 0] [job id 281158.0] [Mon Mar 13 14:49:57 2023] [crusher123] - Abort(604584204) (rank 0 in comm 0): Fatal error in PMPI_Comm_dup: Invalid argument, error stack:
PMPI_Comm_dup(170): MPI_Comm_dup(MPI_COMM_WORLD, new_comm=(nil)) failed
PMPI_Comm_dup(145): Null pointer in parameter newcomm

@BenWibking
Copy link
Collaborator

Ok, with the suggested changes from @Yurlungur, it works.

@BenWibking
Copy link
Collaborator

We also need this fix for AthenaPK.

Ben Prather added 2 commits March 17, 2023 14:25
@bprather
Copy link
Collaborator Author

Sorry I'm a bit late on this. This wasn't passing tests initially, and I found I needed to scope the PoissonDriver object in that example, just as @BenWibking apparently did for the Ascent PR.

I think more people may get caught by this issue, though: we might want to establish that all Parthenon objects be set off in a code block inside main() so that we can rely on the appropriate destructors being called before MPI_Finalize.

@bprather bprather changed the title WIP: Fix MPI communicator leak Fix MPI communicator leak Mar 17, 2023
@Yurlungur
Copy link
Collaborator

Please don't forget to add the CHANGELOG @bprather :)

@Yurlungur
Copy link
Collaborator

Approved assuming tests pass

@BenWibking
Copy link
Collaborator

Ok, great. Actually it was @pgrete who discovered the finalization issue :)

Copy link
Collaborator

@pgrete pgrete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean solution!

@pgrete pgrete enabled auto-merge (squash) March 20, 2023 14:36
@BenWibking
Copy link
Collaborator

It looks like the Poisson solver regression test fails:

9/10 Test #64: regression_mpi_test:poisson .....................***Failed    3.73 sec

@bprather
Copy link
Collaborator Author

bprather commented Mar 20, 2023

Hmm, this is exactly the error & location I was seeing before adopting the Driver scoping fix -- but, I no longer see it on MPICH. What MPI stack do the tests use? Maybe e.g. OpenMPI implementations hold tighter to comm handles?

Is there a way to access the full CTest log for the automated tests? That would hopefully confirm it's the same error and provide any implementation-specific context.

@BenWibking
Copy link
Collaborator

BenWibking commented Mar 20, 2023

I see what looks like the full log here: https://github.com/parthenon-hpc-lab/parthenon/actions/runs/4469416747/jobs/7851492978?pr=841#step:8:10205

The configure step says it's using OpenMPI:

-- Found MPI_CXX: /opt/openmpi/lib/libmpi_cxx.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1") found components: CXX 

@bprather
Copy link
Collaborator Author

Ah, should have checked the compile output, thanks. I'll try to reproduce this by switching to OpenMPI on my test setup.

Re: the log, I saw that much -- I think there may be an artifact, usually build/Testing/Temporary/LastTest.log, with a more complete picture which includes stderr. That would print out any useful MPI crashing due to X messages the uploaded log may have missed.

@BenWibking
Copy link
Collaborator

That would be useful to have the stderr output. It looks like the artifact upload is canceled if the tests fail, but this can be changed: https://docs.github.com/en/actions/learn-github-actions/expressions#always

@bprather
Copy link
Collaborator Author

Okay, here's the situation as near as I can tell. I was able to reproduce the issue when compiling inside the cuda11.6-mpi-hdf5-ascent container, but never on my host machine regardless of MPI version.

Rather than MPI, the key difference appears to be the version of HDF5, with 1.12.2 specifically causing the issue (I believe even 1.12.1 is not affected, but I'm not sure of this yet). This appears to be due to an attempted call to MPI_Comm_dup during MPI-IO output in ParthenonFinalize, which causes a segfault (curiously, the backtrace was omitted from both CTest logs -- I had to run manually to get it).

Still testing different stacks (e.g., not sure if HDF5 1.12.2/MPICH exhibits this issue) but that's the basic idea. So, it'll be a little while to fix I think :/

@BenWibking
Copy link
Collaborator

Rather than MPI, the key difference appears to be the version of HDF5, with 1.12.2 specifically causing the issue (I believe even 1.12.1 is not affected, but I'm not sure of this yet). This appears to be due to an attempted call to MPI_Comm_dup during MPI-IO output in ParthenonFinalize, which causes a segfault (curiously, the backtrace was omitted from both CTest logs -- I had to run manually to get it).

Wow, that's incredibly annoying of HDF5. Can we switch the CI container to use 1.12.1, or do we have to use 1.12.2 for some other reason?

@Yurlungur
Copy link
Collaborator

Why exactly does MPI_Comm_dup cause a segfault when HDF5 does it? That seems like a deeper cause for concern.

Can we switch the CI container to use 1.12.1, or do we have to use 1.12.2 for some other reason?

I don't think we're tied specifically to 1.12.2 but it would be good to know what's going on here.

@bprather
Copy link
Collaborator Author

This is a cool bug, which is to say it's driving me insane. It is a race condition, in that it only appears when running under MPI with > 1 rank, and then only randomly:

$ ctest -R poisson
Test project /home/bprather/Code/parthenon/build
    Start 63: regression_test:poisson
1/2 Test #63: regression_test:poisson ..........   Passed    3.25 sec
    Start 64: regression_mpi_test:poisson
2/2 Test #64: regression_mpi_test:poisson ......   Passed    2.67 sec

100% tests passed, 0 tests failed out of 2

Label Time Summary:
mpi           =  10.69 sec*proc (1 test)
poisson       =  13.94 sec*proc (2 tests)
regression    =  13.94 sec*proc (2 tests)
serial        =   3.25 sec*proc (1 test)

Total Test time (real) =   5.93 sec
$ ctest -R poisson
Test project /home/bprather/Code/parthenon/build
    Start 63: regression_test:poisson
1/2 Test #63: regression_test:poisson ..........   Passed    3.28 sec
    Start 64: regression_mpi_test:poisson
2/2 Test #64: regression_mpi_test:poisson ......***Failed    4.57 sec

50% tests passed, 1 tests failed out of 2

Label Time Summary:
mpi           =  18.27 sec*proc (1 test)
poisson       =  21.54 sec*proc (2 tests)
regression    =  21.54 sec*proc (2 tests)
serial        =   3.28 sec*proc (1 test)

Total Test time (real) =   7.85 sec

The following tests FAILED:
	 64 - regression_mpi_test:poisson (Failed)
Errors while running CTest
Output from these tests are in: /home/bprather/Code/parthenon/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

which is a delight to test for! FWIW I've never actually triggered the bug with 2 ranks, either, only 4.

It does not appear to be related to HDF5 or OpenMPI version, in that I can reproduce it with varying frequencies on every combination I've tried (I haven't tested on MPICH, since "don't use OpenMPI" is not a valid workaround). Sprinkling barriers on all MPI communicators and COMM_WORLD does not help at all. Even waiting 5s after the driver finishes executing before writing the output does not help.

Initially, I had thought this might be a pool/allocator issue: MPI frees "Comm1," HDF5 asks for a communicator, MPI gives it the memory that backed "Comm1" as "Comm2." Then, MPI sees that "Comm1" is unused, frees the backing memory, and thereby pulls the rug out from under "Comm2."

However, since we can wait 5s from the last MPI call and still hit the race condition, I'm now doubtful this is the case. Worse, I have no theory to replace it. The fact is, if any communicator has been freed during runtime, I can wait a seemingly arbitrary amount of time, and the next MPI_Comm_dup call from HDF5 will still be subject to this race condition and potentially fail. What!?

@Yurlungur
Copy link
Collaborator

@bprather out of curiosity, does @lroberts36 version, which just added a destructor to the reducers fix the issue? If so, maybe that's just the easiest solution?

@bprather
Copy link
Collaborator Author

I strongly believe it will not, becuase MPI_Comm_free is not actually reference-counted in the way that this PR implements.

According to the standard, each communicator should be reference counted between processes, by incrementing a shared counter in MPI_Comm_dup and decrementing on free. This is intended solely to allow any pending operations to complete even if one process has called free, by keeping the comm alive until its last message is processed. (As an interesting point, the refcount is only "recommended" as a way to implement this linger behavior).

Calling MPI_Comm_free (in e.g., an object destructor) and then trying to use the communicator you just freed (in e.g., a copy of the object) from within the same process will always segfault (indeed, it's in the standard that free should set the communicator pointer to null regardless of whether the communicator itself is still alive). You could imagine calling MPI_Comm_dup in all object constructors, and then being able to call free in the destructor with confidence, but that's just this PR with more needless comm allocations.

MPI has a synchronous version of free without the lingering behavior, which waits to return until the communicator is well and truly dead. Switching this out for free in the code changes nothing, confirming my suspicion that this really isn't a race condition with lingering communicators.

@Yurlungur
Copy link
Collaborator

Actually hey I think I know what it is. @bprather move the reducer objects in Poisson out of the driver and into mutable params.

The problem is that the driver object persists beyond the MPI_Finalize call that's in ParthenonFinalize. But params are not because the mesh object is cleared.

I had this same issue with reducers over Kokkos views. That's why the views are in params, not the driver, in the first place.

@bprather
Copy link
Collaborator Author

I think what you're describing is actually the original issue I ran into, which printed something about MPI_Comm_free called after MPI_Finalize etc etc. That issue was fixed by scoping the driver to be deallocated before ParthenonFinalize was called, which was a clean solution Philipp and I both arrived at for different issues, and seems like a tidy general best-practice to rely on (though, we should document that we apparently now rely on it).

I've verified that this issue is orthogonal to that one -- it appears regardless of Driver scoping when creating the last output files, which happens as a part of Execute() before the Driver-member communicators are even deallocated -- the code crashes well before any finalization can occur (or mysteriously completes all that finalization successfully, because race condition). Rather, there are a couple of step-scoped Reduce objects for vectors "for fun," which appear to be causing the issue as they're deallocated before the last files are written.

Continuing to poke at this a bit but probably no news before Mon.

@bprather
Copy link
Collaborator Author

After experimenting with this, I can find no reliable way of reproducing it on every invocation, even inside the testing container. However, adding MPI barriers alone, without the absurd waiting times between operations actually made the race condition much less frequent. I found I could achieve the same stabilizing effect as the barriers (at least, on my machine) by switching out Comm_free for Comm_disconnect as the destructor.

I make no claim that this solves the underlying problem, but it does make the error so infrequent on my machine (5% of invocations? Less?) that I can't pursue it much further without a more reliable instance of the error occurring at all.

I'd recommend as many people as possible try this fix: either the fix works for all cases, or someone finds a way to more reliably reproduce the race condition here.

@Yurlungur
Copy link
Collaborator

Does this change resolve the issue on the CI? If so that's probably good enough

@bprather bprather disabled auto-merge March 28, 2023 21:13
@bprather bprather enabled auto-merge March 28, 2023 21:13
@bprather
Copy link
Collaborator Author

A manual CI run passed, so I've toggled auto-merge to trigger everything with actual reporting. I guess, speak now or etc etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants