Fix MPI communicator leak #841

bprather · 2023-03-03T21:47:19Z

This fixes an issue seen in long runs which perform many reductions, where, depending on the MPI implementation, the program might run out of available file handles, or exhaust host memory entirely. This was due to allocation of an MPI communicator via MPI_Comm_dup upon creation of a Reduce or AllReduce object, without an accompanying call to MPI_Comm_free when an object and/or its copies were deleted.

This issue was previously fixed in Riot branch by adding a destructor to ReductionBase, which called MPI_Comm_free.
However, that fix relies on the MPI implementation to ref-count communicators such that calling MPI_Comm_free twice on a communicator would not result in a double-free error. It appears not all implementations do this, at least in my experience using that fix in KHARMA.

This fix instead declares the member variable comm as a std::shared_ptr, with MPI_Comm_free registered as a custom destructor.

It also adds some PARTHENON_MPI_CHECKs that were missing.

Code passes cpplint
New features are documented.
Adds a test for any bugs fixed. Adds tests for new features.
Code is formatted
Changes are summarized in CHANGELOG.md
CI has been triggered on Darwin for performance regression tests.
Docs build
(@lanl.gov employees) Update copyright on changed files

bprather · 2023-03-03T21:49:13Z

This is "finished" but not tested to work. When I get confirmation it fixes the issue downstream I'll un-WIP it.

bprather · 2023-03-03T21:50:36Z

@par-hermes format

Yurlungur

Thanks for the fix!

lroberts36 · 2023-03-06T19:15:53Z

This is another one that we ran into on the Riot branch of Parthenon that never got merged back into main... sorry. I used a pretty similar strategy, but without the smart pointer. (see lroberts36/merge-sparse-with-jdolence-sparse)

Edit: Oh, I didn't see that @bprather commented on this already.

Yurlungur · 2023-03-06T19:34:34Z

src/utils/reductions.hpp

+    MPI_Comm *comm_tmp;
+    PARTHENON_MPI_CHECK(MPI_Comm_dup(MPI_COMM_WORLD, comm_tmp));
+    comm = std::shared_ptr<MPI_Comm>(comm_tmp, MPI_Comm_free);


Looks like this causes a runtime error, probably because MPI_Comm_dup doesn't allocate. I suggest the following:

Suggested change

MPI_Comm *comm_tmp;

PARTHENON_MPI_CHECK(MPI_Comm_dup(MPI_COMM_WORLD, comm_tmp));

comm = std::shared_ptr<MPI_Comm>(comm_tmp, MPI_Comm_free);

comm = std::shared_ptr<MPI_Comm>(new MPI_Comm, MPI_Comm_free);

PARTHENON_MPI_CHECK(MPI_Comm_dup(MPI_COMM_WORLD, comm.get()));

With this setup we could also put the initialization of comm in the var definition, e.g.,

std::shared_ptr<MPI_Comm> comm(new MPI_Comm, MPI_Comm_free);

Yurlungur · 2023-03-06T19:38:13Z

src/utils/reductions.hpp

@@ -59,20 +59,26 @@ struct ReductionBase {
  T val;
 #ifdef MPI_PARALLEL
  MPI_Request req;
-  MPI_Comm comm;
+  std::shared_ptr<MPI_Comm> comm;


Since this is a pointer now, it might also make sense to rename it pcomm for clarity.

BenWibking · 2023-03-13T18:51:28Z

This fix does not work for me on Crusher:

MPICH ERROR [Rank 0] [job id 281158.0] [Mon Mar 13 14:49:57 2023] [crusher123] - Abort(604584204) (rank 0 in comm 0): Fatal error in PMPI_Comm_dup: Invalid argument, error stack:
PMPI_Comm_dup(170): MPI_Comm_dup(MPI_COMM_WORLD, new_comm=(nil)) failed
PMPI_Comm_dup(145): Null pointer in parameter newcomm

BenWibking · 2023-03-13T19:04:36Z

Ok, with the suggested changes from @Yurlungur, it works.

BenWibking · 2023-03-17T18:00:18Z

We also need this fix for AthenaPK.

…er before finalize

Resolve conflict when fixing the same bug

bprather · 2023-03-17T20:43:26Z

Sorry I'm a bit late on this. This wasn't passing tests initially, and I found I needed to scope the PoissonDriver object in that example, just as @BenWibking apparently did for the Ascent PR.

I think more people may get caught by this issue, though: we might want to establish that all Parthenon objects be set off in a code block inside main() so that we can rely on the appropriate destructors being called before MPI_Finalize.

Yurlungur · 2023-03-17T21:04:02Z

Please don't forget to add the CHANGELOG @bprather :)

Yurlungur · 2023-03-17T21:09:16Z

Approved assuming tests pass

BenWibking · 2023-03-17T21:10:06Z

Ok, great. Actually it was @pgrete who discovered the finalization issue :)

pgrete

Nice clean solution!

BenWibking · 2023-03-20T15:00:17Z

It looks like the Poisson solver regression test fails:

9/10 Test #64: regression_mpi_test:poisson .....................***Failed    3.73 sec

bprather · 2023-03-20T15:25:47Z

Hmm, this is exactly the error & location I was seeing before adopting the Driver scoping fix -- but, I no longer see it on MPICH. What MPI stack do the tests use? Maybe e.g. OpenMPI implementations hold tighter to comm handles?

Is there a way to access the full CTest log for the automated tests? That would hopefully confirm it's the same error and provide any implementation-specific context.

BenWibking · 2023-03-20T15:41:09Z

I see what looks like the full log here: https://github.com/parthenon-hpc-lab/parthenon/actions/runs/4469416747/jobs/7851492978?pr=841#step:8:10205

The configure step says it's using OpenMPI:

-- Found MPI_CXX: /opt/openmpi/lib/libmpi_cxx.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1") found components: CXX

bprather · 2023-03-20T15:47:31Z

Ah, should have checked the compile output, thanks. I'll try to reproduce this by switching to OpenMPI on my test setup.

Re: the log, I saw that much -- I think there may be an artifact, usually build/Testing/Temporary/LastTest.log, with a more complete picture which includes stderr. That would print out any useful MPI crashing due to X messages the uploaded log may have missed.

BenWibking · 2023-03-20T15:54:43Z

That would be useful to have the stderr output. It looks like the artifact upload is canceled if the tests fail, but this can be changed: https://docs.github.com/en/actions/learn-github-actions/expressions#always

bprather · 2023-03-20T17:32:48Z

Okay, here's the situation as near as I can tell. I was able to reproduce the issue when compiling inside the cuda11.6-mpi-hdf5-ascent container, but never on my host machine regardless of MPI version.

Rather than MPI, the key difference appears to be the version of HDF5, with 1.12.2 specifically causing the issue (I believe even 1.12.1 is not affected, but I'm not sure of this yet). This appears to be due to an attempted call to MPI_Comm_dup during MPI-IO output in ParthenonFinalize, which causes a segfault (curiously, the backtrace was omitted from both CTest logs -- I had to run manually to get it).

Still testing different stacks (e.g., not sure if HDF5 1.12.2/MPICH exhibits this issue) but that's the basic idea. So, it'll be a little while to fix I think :/

BenWibking · 2023-03-20T17:40:50Z

Rather than MPI, the key difference appears to be the version of HDF5, with 1.12.2 specifically causing the issue (I believe even 1.12.1 is not affected, but I'm not sure of this yet). This appears to be due to an attempted call to MPI_Comm_dup during MPI-IO output in ParthenonFinalize, which causes a segfault (curiously, the backtrace was omitted from both CTest logs -- I had to run manually to get it).

Wow, that's incredibly annoying of HDF5. Can we switch the CI container to use 1.12.1, or do we have to use 1.12.2 for some other reason?

Yurlungur · 2023-03-20T18:10:01Z

Why exactly does MPI_Comm_dup cause a segfault when HDF5 does it? That seems like a deeper cause for concern.

Can we switch the CI container to use 1.12.1, or do we have to use 1.12.2 for some other reason?

I don't think we're tied specifically to 1.12.2 but it would be good to know what's going on here.

bprather · 2023-03-24T20:13:40Z

This is a cool bug, which is to say it's driving me insane. It is a race condition, in that it only appears when running under MPI with > 1 rank, and then only randomly:

$ ctest -R poisson
Test project /home/bprather/Code/parthenon/build
    Start 63: regression_test:poisson
1/2 Test #63: regression_test:poisson ..........   Passed    3.25 sec
    Start 64: regression_mpi_test:poisson
2/2 Test #64: regression_mpi_test:poisson ......   Passed    2.67 sec

100% tests passed, 0 tests failed out of 2

Label Time Summary:
mpi           =  10.69 sec*proc (1 test)
poisson       =  13.94 sec*proc (2 tests)
regression    =  13.94 sec*proc (2 tests)
serial        =   3.25 sec*proc (1 test)

Total Test time (real) =   5.93 sec
$ ctest -R poisson
Test project /home/bprather/Code/parthenon/build
    Start 63: regression_test:poisson
1/2 Test #63: regression_test:poisson ..........   Passed    3.28 sec
    Start 64: regression_mpi_test:poisson
2/2 Test #64: regression_mpi_test:poisson ......***Failed    4.57 sec

50% tests passed, 1 tests failed out of 2

Label Time Summary:
mpi           =  18.27 sec*proc (1 test)
poisson       =  21.54 sec*proc (2 tests)
regression    =  21.54 sec*proc (2 tests)
serial        =   3.28 sec*proc (1 test)

Total Test time (real) =   7.85 sec

The following tests FAILED:
	 64 - regression_mpi_test:poisson (Failed)
Errors while running CTest
Output from these tests are in: /home/bprather/Code/parthenon/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

which is a delight to test for! FWIW I've never actually triggered the bug with 2 ranks, either, only 4.

It does not appear to be related to HDF5 or OpenMPI version, in that I can reproduce it with varying frequencies on every combination I've tried (I haven't tested on MPICH, since "don't use OpenMPI" is not a valid workaround). Sprinkling barriers on all MPI communicators and COMM_WORLD does not help at all. Even waiting 5s after the driver finishes executing before writing the output does not help.

Initially, I had thought this might be a pool/allocator issue: MPI frees "Comm1," HDF5 asks for a communicator, MPI gives it the memory that backed "Comm1" as "Comm2." Then, MPI sees that "Comm1" is unused, frees the backing memory, and thereby pulls the rug out from under "Comm2."

However, since we can wait 5s from the last MPI call and still hit the race condition, I'm now doubtful this is the case. Worse, I have no theory to replace it. The fact is, if any communicator has been freed during runtime, I can wait a seemingly arbitrary amount of time, and the next MPI_Comm_dup call from HDF5 will still be subject to this race condition and potentially fail. What!?

Yurlungur · 2023-03-24T21:00:42Z

@bprather out of curiosity, does @lroberts36 version, which just added a destructor to the reducers fix the issue? If so, maybe that's just the easiest solution?

bprather · 2023-03-24T21:18:44Z

I strongly believe it will not, becuase MPI_Comm_free is not actually reference-counted in the way that this PR implements.

According to the standard, each communicator should be reference counted between processes, by incrementing a shared counter in MPI_Comm_dup and decrementing on free. This is intended solely to allow any pending operations to complete even if one process has called free, by keeping the comm alive until its last message is processed. (As an interesting point, the refcount is only "recommended" as a way to implement this linger behavior).

Calling MPI_Comm_free (in e.g., an object destructor) and then trying to use the communicator you just freed (in e.g., a copy of the object) from within the same process will always segfault (indeed, it's in the standard that free should set the communicator pointer to null regardless of whether the communicator itself is still alive). You could imagine calling MPI_Comm_dup in all object constructors, and then being able to call free in the destructor with confidence, but that's just this PR with more needless comm allocations.

MPI has a synchronous version of free without the lingering behavior, which waits to return until the communicator is well and truly dead. Switching this out for free in the code changes nothing, confirming my suspicion that this really isn't a race condition with lingering communicators.

Yurlungur · 2023-03-24T21:37:09Z

Actually hey I think I know what it is. @bprather move the reducer objects in Poisson out of the driver and into mutable params.

The problem is that the driver object persists beyond the MPI_Finalize call that's in ParthenonFinalize. But params are not because the mesh object is cleared.

I had this same issue with reducers over Kokkos views. That's why the views are in params, not the driver, in the first place.

bprather · 2023-03-25T17:42:42Z

I think what you're describing is actually the original issue I ran into, which printed something about MPI_Comm_free called after MPI_Finalize etc etc. That issue was fixed by scoping the driver to be deallocated before ParthenonFinalize was called, which was a clean solution Philipp and I both arrived at for different issues, and seems like a tidy general best-practice to rely on (though, we should document that we apparently now rely on it).

I've verified that this issue is orthogonal to that one -- it appears regardless of Driver scoping when creating the last output files, which happens as a part of Execute() before the Driver-member communicators are even deallocated -- the code crashes well before any finalization can occur (or mysteriously completes all that finalization successfully, because race condition). Rather, there are a couple of step-scoped Reduce objects for vectors "for fun," which appear to be causing the issue as they're deallocated before the last files are written.

Continuing to poke at this a bit but probably no news before Mon.

bprather · 2023-03-28T18:35:55Z

After experimenting with this, I can find no reliable way of reproducing it on every invocation, even inside the testing container. However, adding MPI barriers alone, without the absurd waiting times between operations actually made the race condition much less frequent. I found I could achieve the same stabilizing effect as the barriers (at least, on my machine) by switching out Comm_free for Comm_disconnect as the destructor.

I make no claim that this solves the underlying problem, but it does make the error so infrequent on my machine (5% of invocations? Less?) that I can't pursue it much further without a more reliable instance of the error occurring at all.

I'd recommend as many people as possible try this fix: either the fix works for all cases, or someone finds a way to more reliably reproduce the race condition here.

Yurlungur · 2023-03-28T18:37:41Z

Does this change resolve the issue on the CI? If so that's probably good enough

bprather · 2023-03-28T21:15:07Z

A manual CI run passed, so I've toggled auto-merge to trigger everything with actual reporting. I guess, speak now or etc etc

Attempt to fix MPI communicator leak without introducing a double free

563caf0

bprather changed the title ~~Attempt to fix MPI communicator leak without introducing a double free~~ WIP: Fix MPI communicator leak Mar 3, 2023

cpp-py-formatter

de9d48c

Yurlungur approved these changes Mar 4, 2023

View reviewed changes

Yurlungur requested changes Mar 6, 2023

View reviewed changes

Yurlungur reviewed Mar 6, 2023

View reviewed changes

BenWibking mentioned this pull request Mar 17, 2023

Add global reduction for "magic heating" parthenon-hpc-lab/athenapk#31

Closed

Ben Prather added 2 commits March 17, 2023 14:25

Reductions: allocate MPI_Comm object for pointer, delete Poisson Driv…

8085e17

…er before finalize

Merge branch 'develop' into bprather/fix-mpi-comms

9df35bf

Resolve conflict when fixing the same bug

bprather changed the title ~~WIP: Fix MPI communicator leak~~ Fix MPI communicator leak Mar 17, 2023

Ben Prather added 2 commits March 17, 2023 14:47

Reductions: please the style machine

3a1e388

Reductions: style: header order

0afcaee

Add CHANGELOG

fa47372

Yurlungur approved these changes Mar 17, 2023

View reviewed changes

BenWibking approved these changes Mar 20, 2023

View reviewed changes

pgrete approved these changes Mar 20, 2023

View reviewed changes

pgrete enabled auto-merge (squash) March 20, 2023 14:36

Make a race condition in certain MPI implementations less frequent

8fd55b2

Merge branch 'develop' into bprather/fix-mpi-comms

2926095

bprather disabled auto-merge March 28, 2023 21:13

bprather enabled auto-merge March 28, 2023 21:13

bprather merged commit c2f1ee1 into develop Mar 28, 2023

bprather mentioned this pull request Sep 29, 2023

WIP: Fix MPI Communicator destruction (again) #945

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MPI communicator leak #841

Fix MPI communicator leak #841

bprather commented Mar 3, 2023

bprather commented Mar 3, 2023

bprather commented Mar 3, 2023

Yurlungur left a comment

lroberts36 commented Mar 6, 2023 •

edited

Loading

Yurlungur Mar 6, 2023 •

edited

Loading

Yurlungur Mar 6, 2023

Yurlungur Mar 6, 2023

BenWibking commented Mar 13, 2023

BenWibking commented Mar 13, 2023

BenWibking commented Mar 17, 2023

bprather commented Mar 17, 2023

Yurlungur commented Mar 17, 2023

Yurlungur commented Mar 17, 2023

BenWibking commented Mar 17, 2023

pgrete left a comment

BenWibking commented Mar 20, 2023

bprather commented Mar 20, 2023 •

edited

Loading

BenWibking commented Mar 20, 2023 •

edited

Loading

bprather commented Mar 20, 2023

BenWibking commented Mar 20, 2023

bprather commented Mar 20, 2023

BenWibking commented Mar 20, 2023

Yurlungur commented Mar 20, 2023

bprather commented Mar 24, 2023

Yurlungur commented Mar 24, 2023

bprather commented Mar 24, 2023

Yurlungur commented Mar 24, 2023

bprather commented Mar 25, 2023

bprather commented Mar 28, 2023

Yurlungur commented Mar 28, 2023

bprather commented Mar 28, 2023

Fix MPI communicator leak #841

Fix MPI communicator leak #841

Conversation

bprather commented Mar 3, 2023

bprather commented Mar 3, 2023

bprather commented Mar 3, 2023

Yurlungur left a comment

Choose a reason for hiding this comment

lroberts36 commented Mar 6, 2023 • edited Loading

Yurlungur Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

Yurlungur Mar 6, 2023

Choose a reason for hiding this comment

Yurlungur Mar 6, 2023

Choose a reason for hiding this comment

BenWibking commented Mar 13, 2023

BenWibking commented Mar 13, 2023

BenWibking commented Mar 17, 2023

bprather commented Mar 17, 2023

Yurlungur commented Mar 17, 2023

Yurlungur commented Mar 17, 2023

BenWibking commented Mar 17, 2023

pgrete left a comment

Choose a reason for hiding this comment

BenWibking commented Mar 20, 2023

bprather commented Mar 20, 2023 • edited Loading

BenWibking commented Mar 20, 2023 • edited Loading

bprather commented Mar 20, 2023

BenWibking commented Mar 20, 2023

bprather commented Mar 20, 2023

BenWibking commented Mar 20, 2023

Yurlungur commented Mar 20, 2023

bprather commented Mar 24, 2023

Yurlungur commented Mar 24, 2023

bprather commented Mar 24, 2023

Yurlungur commented Mar 24, 2023

bprather commented Mar 25, 2023

bprather commented Mar 28, 2023

Yurlungur commented Mar 28, 2023

bprather commented Mar 28, 2023

lroberts36 commented Mar 6, 2023 •

edited

Loading

Yurlungur Mar 6, 2023 •

edited

Loading

bprather commented Mar 20, 2023 •

edited

Loading

BenWibking commented Mar 20, 2023 •

edited

Loading